학술논문

두 개의 인코더를 이용한 장면 텍스트 인식

Document Type

Academic Journal

Author

Yao Wang; 하종은 (Jong-Eun Ha)

Source

제어로봇시스템학회 논문지. 2023-12 29(12):973-979

Subject

deep learning
scene text recognition
convolutional neural network
transformer

Language

Korean

ISSN

1976-5622
2233-4335

Abstract

Despite significant advancements in scene text recognition, current models face substantial challenges, particularly when confronted with irregular text images featuring complex backgrounds, curved text, diverse fonts, and distortions. While convolutional neural network (CNN)-based text recognition networks have demonstrated commendable performance, they grapple with the aforementioned challenges. Recently, transformer-based feature extractors have exhibited advantages in global feature extraction from images, especially in the context of irregular text images. By employing self-attention, these transformers establish information connections between different parts of the image, thereby mitigating the impact of uneven character distribution. This study proposes multi-encoder scene text recognition (MESTR), a hybrid approach that combines a CNN-based and a transformer-based feature extractor. MESTR excels in simultaneously extracting local and global features from text images, ensuring the integration of both types of features to enhance performance. During training, we employed a guiding connectionist temporal classification (CTC) decoder [6] as a compensatory training strategy for the attentional decoder. Our experiments showed the efficacy of MESTR across seven benchmarks, demonstrating robust performance. In addition, ablation experiments are presented to validate the effectiveness of the proposed algorithm for scene text recognition.

Online Access

Full Text (DBPIA) Scopus

이메일

부산대학교 도서관

Online Access

메일 발송