Towers of Babel: Combining Images, Language, and 3D Geometry for
                Learning Multimodal Vision
              Xiaoshi Wu1     Hadar Averbuch-Elor2      Jin Sun2    Noah Snavely2
                  1              2
                     Tsinghua University    Cornell Tech, Cornell University
Figure 1: Our WikiScenes dataset combines 3D reconstructions, images, and language descriptions for dozens of landmarks, like the
Barcelona and Reims Cathedrals pictured above. WikiScenes enable ...                                        
                                    
附件列表