학술논문

A Large Model’s Ability to Identify 3D Objects as a Function of Viewing Angle
Document Type
Conference
Source
2024 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR) AIXVR Artificial Intelligence and eXtended and Virtual Reality (AIxVR), 2024 IEEE International Conference on. :281-288 Jan, 2024
Subject
Computing and Processing
Solid modeling
Three-dimensional displays
Virtual reality
Cameras
Object recognition
Artificial intelligence
Robots
Multimodal interaction
Virtual Reality
CLIP
3D Models
Language
ISSN
2771-7453
Abstract
Virtual reality is progressively more widely used to support embodied AI agents, such as robots, which frequently engage in ‘sim-to-real’ based learning approaches. At the same time, tools such as large vision-and-language models offer new capabilities that tie into a wide variety of tasks and capabilities. In order to understand how such agents can learn from simulated environments, we explore a language model’s ability to recover the type of object represented by a photorealistic 3D model as a function of the 3D perspective from which the model is viewed. We used photogrammetry to create 3D models of commonplace objects and rendered 2D images of these models from an fixed set of 420 virtual camera perspectives. A well-studied image and language model (CLIP) was used to generate text (i.e., prompts) corresponding to these images. Using multiple instances of various object classes, we studied which camera perspectives were most likely to return accurate text categorizations for each class of object.