학술논문

A Language Prior Based Focal Loss for Visual Question Answering
Document Type
Conference
Source
2021 IEEE International Conference on Multimedia and Expo (ICME) Multimedia and Expo (ICME), 2021 IEEE International Conference on. :1-6 Jul, 2021
Subject
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Signal Processing and Analysis
Training
Deep learning
Visualization
Computational modeling
Refining
Predictive models
Knowledge discovery
Visual Question Answering
Language Priors
Focal Loss
Language
ISSN
1945-788X
Abstract
According to current research, one of the major challenges in Visual Question Answering (VQA) models is the overdependence on language priors (and neglect of the visual modality). VQA models tend to predict answers only based on superficial correlations between the first few words in question and frequency of related answer candidates. To address this issue, we propose a novel Language Prior based Focal Loss (LP-Focal Loss) by rescaling the standard cross entropy loss. Specifically, we employ a question-only branch to capture the language biases for each answer candidate based on the corresponding question input. Then, the LP-Focal Loss dynamically assigns lower weights to biased answers when computing the training loss, thereby reducing the contribution of more-biased instances in the train split. Extensive experiments show that the LP-Focal Loss can be generally applied to common baseline VQA models, and achieves significantly better performance on the VQA-CP v2 dataset, with an overall 18% accuracy boost over benchmark models.