학술논문

Extracting Domain Information using Deep Learning
Document Type
Conference
Source
Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning). :1-7
Subject
Domain Informational Vocabulary Extraction (DIVE)
entity extraction
neural networks
text mining
Language
English
Abstract
Across various scientific domains, digital publication of technical documents, often in the form of conference/journal article submissions, are the first accessible instance of new human knowledge in these respective fields. Synthesizing and curating this information is a slow and difficult process and often requires non-trivial human expertise. Given the ever increasing rate of these publications and the natural limitations of manual approaches, a computational solution to this problem is the paramount need of the hour. One of the central tasks is to extract important phrases and terminologies from scientific article. Although many tools are available to extracting keywords and named entities from a document, a key challenge is to determine how important they are with regard the context of the entire document. In some cases, important entities to an article might be a new vocabulary that haven't, or rarely, appeared from previous data. In other cases, there are also many other existing entities that are less important for the particular article but weighted more significantly from models based on prior knowledge. In this paper, we investigate how deep learning methods may be used to address this issue. We have developed a computational tool to provide entity extraction and expert curation functionality. The tool has been integrated with the publication pipeline used in American Society of Plant Biologists. Using the author feedback mechanism in our deployed tool we were able to create a expert user annotated dataset based on articles submitted over an entire year. Using this new gold standard dataset for supervised training, we are now able to contrast several methods for the entity extraction task. We use the NeuroNER tool to investigate the effectiveness of deep neural network in this task and also contrast it with other tools using a variety of different methods such as ABNER (using CRF) and DIVE (using an ensemble of regular expression rules, keyword dictionaries and ontology files). Our results show that DIVE ensemble of methods have higher precision scores than pre-trained CRF models included in ABNER. However, early results from NeuroNER training with on author annotations shows very promising improvement on predicting the important words from the documents.

Online Access