학술논문

Graph-Based Environment Representation for Vision-and-Language Navigation in Continuous Environments
Document Type
Conference
Source
ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2024 - 2024 IEEE International Conference on. :8331-8335 Apr, 2024
Subject
Bioengineering
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Training
Visualization
Navigation
Convolution
Semantics
Object detection
Acoustics
Vision-Language Navigation in Continuous Environments
Environment Representation
Graph
Object Detection
Language
ISSN
2379-190X
Abstract
The Vision-and-Language Navigation in Continuous Environments (VLN-CE) task requires an agent to follow a language instruction in a realistic environment. Understanding the environment is crucial, yet current methods are relatively simple and direct, without delving into the interplay between language instructions and visual context. Therefore, we propose a novel environment representation. First, we propose an Environment Representation Graph (ERG) through object detection to express the environment in semantic level. Then, relational representations of object-object and object-agent in the ERG are learned through Graph Convolution Network (GCN), so as to obtain a continuous ERG expression. Sequentially, we combine the ERG expression with object label embeddings to obtain the environment representation. Finally, a new cross-modal attention navigation framework is proposed, incorporating our environment representation and a specialized loss function for ERG training. Experimental results demonstrate the effectiveness of our approach in achieving commendable performance on VLN-CE tasks.