학술논문

Utilizing image-based features in biomedical document classification
Document Type
Conference
Source
2015 IEEE International Conference on Image Processing (ICIP) Image Processing (ICIP), 2015 IEEE International Conference on. :4451-4455 Sep, 2015
Subject
Computing and Processing
Signal Processing and Analysis
DNA
Biomedical imaging
Decision trees
Optical character recognition software
Vegetation
Context
Proteins
image-based features
OCR
document classification
document-representation
bioinformatics
Language
Abstract
Images form a rich information source, which remains underutilized in biomedical document classification. We present here work that uses both image- and text-based features in order to identify articles of interest, in this case, pertaining to cis-regulatory modules in the context of gene-networks. Extending on our new idea, which we have recently introduced, of using OCR-based features to identify DNA contents in images, we combine image and text based classifiers to categorize documents as relevant or irrelevant to cis-regulatory modules. Using a set of hundreds of articles, marked by experts as relevant or irrelevant to cis-regulatory modules, we train/test image and text based classifiers, as well as classifiers integrating both. Our results indicate that the latter show the best performance with Recall, F-measure and Utility measures all above 0.9, demonstrating the significance of incorporating image data, and specifically OCR-based features, into the document categorization process. Moreover, the use of character distribution properties to represent images is directly relevant to other biomedical images containing text (e.g. RNA, proteins). Diagrams and other images containing text are also prevalent outside the biomedical domain, hence the work stands to be applicable and beneficial in other application areas.