학술논문

Identifying computer-generated text using statistical analysis
Document Type
Conference
Source
2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017. :1504-1511 Dec, 2017
Subject
Communication, Networking and Broadcast Technologies
Signal Processing and Analysis
Feature extraction
Linear regression
Standards
Data mining
Loss measurement
Natural languages
Language
Abstract
Computer-based automatically generated text is used in various applications (e.g., text summarization, machine translation) and has come to play an important role in daily life. However, computer-generated text may produce confusing information due to translation errors and inappropriate wording caused by faulty language processing, which could be a critical issue in presidential elections and product advertisements. Previous methods for detecting computer-generated text typically estimate text fluency, but this may not be useful in the near future due to the development of neural-network-based natural language generation that produces wording close to human-crafted wording. A different approach to detecting computergenerated text is thus needed. We hypothesize that human-crafted wording is more consistent than that of a computer. For instance, Zipf's law states that the most frequent word in human-written text has approximately twice the frequency of the second most frequent word, nearly three times that of the third most frequent word, and so on. We found that this is not true in the case of computer-generated text. We hence propose a method to identify computer-generated text on the basis of statistics. First, the word distribution frequencies are compared with the corresponding Zipfian distributions to extract the frequency features. Next, complex phrase features are extracted because human-generated text contains more complex phrases than computer-generated text. Finally, the higher consistency of the human-generated text is quantified at both the sentence level using phrasal verbs and at the paragraph level using coreference resolution relationships, which are integrated into consistency features. The combination of the frequencies, the complex phrases, and the consistency features was evaluated for 100 English books written originally in English and 100 English books translated from Finnish. The results show that our method achieves better performance (accuracy = 98.0%; equal error rate = 2.9%) compared with the most suitable method for books using parsing tree feature extraction. Evaluation using two other languages (French and Dutch) showed similar results. The proposed method thus works consistently in various languages.