학술논문

A machine-learning approach to discovering company home pages
Document Type
Conference
Source
4th IEEE International Conference on Digital Ecosystems and Technologies Digital Ecosystems and Technologies (DEST), 2010 4th IEEE International Conference on. :361-366 Apr, 2010
Subject
Computing and Processing
Communication, Networking and Broadcast Technologies
Companies
Web pages
Training
Logistics
Feature extraction
Biological system modeling
Search engines
Language
ISSN
2150-4938
2150-4946
Abstract
For many marketing and business applications, it is necessary to know the home page of a company specified only by its company name. If we require the home page for a small number of big companies, this task is readily accomplished via use of Internet search engines or access to domain registration lists. However, if the entities of interest are small companies, these approaches can lead to mismatches, particularly if a specified company lacks a home page. We address this problem using a supervised machine-learning approach in which we train a binary classification model. We classify potential website matches for each company name based on a set of explanatory features extracted from the content on each candidate website. Our approach is related to web-based business intelligence in two ways: (1) we build the training set for our learning algorithms through crowdsourcing tools and illustrate their potential for business research, and (2) the success of our model allows one to easily use corporate home pages as data inputs into other research projects. Through the successful use of crowdsourcing, our approach is able to identify a correct home page or recognize that a valid home page does not exist with an accuracy that is 57% better than simply taking the highest ranked search engine result as the correct match.