학술논문

ViT-BEVSeg: A Hierarchical Transformer Network for Monocular Birds-Eye-View Segmentation
Document Type
Conference
Source
2022 International Joint Conference on Neural Networks (IJCNN) Neural Networks (IJCNN), 2022 International Joint Conference on. :1-7 Jul, 2022
Subject
Bioengineering
Computing and Processing
General Topics for Engineers
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Training
Robot kinematics
Semantics
Transformers
Feature extraction
Decoding
Convolutional neural networks
Vision Transformer
Bird's Eye View
Autonomous Driving
Language
ISSN
2161-4407
Abstract
Generating a detailed near-field perceptual model of the environment is an important and challenging problem in both self-driving vehicles and autonomous mobile robotics. A Bird's Eye View (BEV) map, providing a panoptic representation, is a commonly used approach that provides a simplified 2D representation of the vehicle's surroundings with accurate semantic level segmentation for many downstream tasks. Current state-of-the art approaches to generate BEV-maps employ a Convolutional Neural Network (CNN) backbone to create feature-maps which are passed through a spatial transformer to project the derived features onto the BEV coordinate frame. In this paper, we evaluate the use of vision transformers (ViT) as a backbone architecture to generate BEV maps. Our network architecture, ViT-BEVSeg, employs standard vision transformers to generate a multi-scale representation of the input image. The resulting representation is then provided as an input to a spatial transformer decoder module which outputs segmentation maps in the BEV grid. We evaluate our approach on the nuScenes dataset demonstrating a considerable improvement in the performance relative to state-of-the-art approaches. Code is available at https://github.com/robotvisionmu/ViT-BEVSeg.