학술논문

Classifying Textual Components of Bilingual Documents with Decision-Tree Support Vector Machines

Document Type

Conference

Author

Lin, Xiao-Rong; Guo, Chien-Yang; Chang, Fu

Source

2011 International Conference on Document Analysis and Recognition Document Analysis and Recognition (ICDAR), 2011 International Conference on. :498-502 Sep, 2011

Subject

Computing and Processing
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
General Topics for Engineers
Training
Feature extraction
Accuracy
Testing
Support vector machines
Training data
Shape
bilingual document
component
decision-tree support vector machine
script and language identification

Language

ISSN

1520-5363
2379-2140

Abstract

In this paper, we propose a method for classifying textual entities of bilingual documents written in Chinese and English. In contrast to earlier works that performed classification on the level of text lines or documents, we apply our method to the level of textual components, as we must first identify Chinese components before merging them into intact characters and sending the latter characters to a Chinese recognizer. To cope with a large training data set containing 365,672 samples, we employ a decision-tree support vector machine (DTSVM) method, which decomposes a given data space into small regions and trains local SVMs on those regions. By applying this method to train classifiers on various combinations of feature types, we were able to complete each training process within 3,500 seconds and achieve higher than 99.6% test accuracy in classifying a textual component into Chinese, alphanumeric, and punctuation. Moreover, the classification had no strong bias towards any of the three categories.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송