학술논문

On Vocabulary Reliance in Scene Text Recognition
Document Type
Conference
Source
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) CVPR Computer Vision and Pattern Recognition (CVPR), 2020 IEEE/CVF Conference on. :11422-11431 Jun, 2020
Subject
Computing and Processing
Vocabulary
Text recognition
Decoding
Training
Training data
Visualization
Context modeling
Language
ISSN
2575-7075
Abstract
The pursuit of high performance on public benchmarks has been the driving force for research in scene text recognition, and notable progresses have been achieved. However, a close investigation reveals a startling fact that the state-of-the-art methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary. We call this phenomenon ``vocabulary reliance''. In this paper, we establish an analytical framework, in which different datasets, metrics and module combinations for quantitative comparisons are devised, to conduct an in-depth study on the problem of vocabulary reliance in scene text recognition. Key findings include: (1) Vocabulary reliance is ubiquitous, i.e., all existing algorithms more or less exhibit such characteristic; (2) Attention-based decoders prove weak in generalizing to words outside vocabulary and segmentation-based decoders perform well in utilizing visual features; (3) Context modeling is highly coupled with the prediction layers. These findings provide new insights and can benefit future research in scene text recognition. Furthermore, we propose a simple yet effective mutual learning strategy to allow models of two families (attention-based and segmentation-based) to learn collaboratively. This remedy alleviates the problem of vocabulary reliance and significantly improves the overall scene text recognition performance.