학술논문

Training the Genre Classifier for Automatic Classification of Web Pages
Document Type
Conference
Source
2007 29th International Conference on Information Technology Interfaces Information Technology Interfaces, 2007. ITI 2007. 29th International Conference on. :93-98 Jun, 2007
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Signal Processing and Analysis
Web pages
Machine learning algorithms
Search engines
Internet
Feature extraction
Data mining
Decision trees
Bagging
Testing
Africa
genre classification
web page
genre features
ensemble algorithm
Language
ISSN
1330-1012
Abstract
This paper presents experiments on classifying web pages by genre. Firstly, a corpus of 1539 manually labeled web pages was prepared. Secondly, 502 genre features were selected based on the literature and the observation of the corpus. Thirdly, these features were extracted from the corpus to obtain a data set. Finally, two machine learning algorithms, one for induction of decision trees (J48) and one ensemble algorithm (bagging), were trained and tested on the data set. The ensemble algorithm achieved on average 17% better precision and 1.6% better accuracy, but slightly worse recall; F-measure did not vary significantly. The results indicate that classification by genre could be a useful addition to search engines.