학술논문

CFNAM-PG: Bridging Phonetic and Glyphic Information for Chinese Full Name and Abbreviation Matching Based on Simbert and DenseNet
Document Type
article
Source
International Journal of Computational Intelligence Systems, Vol 17, Iss 1, Pp 1-14 (2024)
Subject
Near homophone
Near homoglyph
Multimodal feature fusion
Full name
Abbreviation matching
Electronic computers. Computer science
QA75.5-76.95
Language
English
ISSN
1875-6883
Abstract
Abstract Matching abbreviated names with their full names (full-abbr matching) plays a key role in data integration, address matching, information retrieval, and other fields. Traditional full-abbr matching technology often encounters issues related to near homophones and near homoglyphs. First, a near-homophone full-abbr matching model based on Simbert and VGG was first proposed, which integrates character and speech features, leveraging a speech recognition model and combining a brain-like cognitive learning dual-process mechanism which involves linguistic knowledge and neural network together. Second, to address the problem of near-homoglyph full-abbr matching in Chinese, a DenseNet-based model that fuses glyph structure and image features was proposed, in which statistical feature extractors are employed to extract feature vectors for glyphic features including stroke, Wubi and structural features separately. Lastly, the near-homophone model and the near-homoglyph model are coupled to work together in the full-abbr matching task, in which expert knowledge is used as a component of the feature optimizer. Experimental results showed that the integrated model significantly increased the matching accuracy to 87.5%, demonstrating a 12.3% improvement.