학술논문

두 개의 인코더를 이용한 장면 텍스트 인식
Document Type
Academic Journal
Source
제어로봇시스템학회 논문지. 2023-12 29(12):973-979
Subject
deep learning
scene text recognition
convolutional neural network
transformer
Language
Korean
ISSN
1976-5622
2233-4335
Abstract
Despite significant advancements in scene text recognition, current models face substantial challenges, particularly when confronted with irregular text images featuring complex backgrounds, curved text, diverse fonts, and distortions. While convolutional neural network (CNN)-based text recognition networks have demonstrated commendable performance, they grapple with the aforementioned challenges. Recently, transformer-based feature extractors have exhibited advantages in global feature extraction from images, especially in the context of irregular text images. By employing self-attention, these transformers establish information connections between different parts of the image, thereby mitigating the impact of uneven character distribution. This study proposes multi-encoder scene text recognition (MESTR), a hybrid approach that combines a CNN-based and a transformer-based feature extractor. MESTR excels in simultaneously extracting local and global features from text images, ensuring the integration of both types of features to enhance performance. During training, we employed a guiding connectionist temporal classification (CTC) decoder [6] as a compensatory training strategy for the attentional decoder. Our experiments showed the efficacy of MESTR across seven benchmarks, demonstrating robust performance. In addition, ablation experiments are presented to validate the effectiveness of the proposed algorithm for scene text recognition.