학술논문

Detection of Anorexic Girls-In Blog Posts Written in Hebrew Using a Combined Heuristic AI and NLP Method
Document Type
Periodical
Source
IEEE Access Access, IEEE. 10:34800-34814 2022
Subject
Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Engineering Profession
Fields, Waves and Electromagnetics
General Topics for Engineers
Geoscience
Nuclear Engineering
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Social networking (online)
Blogs
Task analysis
Feature extraction
Depression
Support vector machines
Anxiety disorders
Mental disorders
natural language processing
supervised machine learning
text analysis
text classification
text processing
Language
ISSN
2169-3536
Abstract
In this study, we aim to detect in social media texts written in Hebrew girls who are suspected of being anorexic. We constructed a dataset containing 100 blog posts written by females who are probably anorexic, and 100 blog posts written by females who are likely to be non-anorexic. The construction of this dataset was supervised and approved by an international expert on anorexia. We tested several text classification (TC) methods, using various feature sets (content-based and style-based), five machine learning (ML) methods, three RNN models, four BERT models, three basic preprocessing methods, three feature filtering methods, and parameter tuning. Several insights were found as follows. A set of 50-word n-grams (mostly word unigrams) given by an expert was found as a good basic detector. A heuristic process based on the random forest ML method has overcome a combinatorial explosion and led to significant improvement over a baseline result at a level of $\text{P}\,{=}$ .01. Application of an iterative process that tests combinations of “k out of $\text{n}'$ ” where $\text{n}'\,{ < }$ n (n is the number of feature sets) lead to a result of 90.63%, using a combination of 300 features from ten feature sets.