학술논문

Overcoming Data Sparsity in Predicting User Characteristics from Behavior through Graph Embeddings
Document Type
Conference
Source
2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Advances in Social Networks Analysis and Mining (ASONAM), 2020 IEEE/ACM International Conference on. :32-36 Dec, 2020
Subject
Computing and Processing
Uniform resource locators
Training
Social networking (online)
Supervised learning
Semisupervised learning
Task analysis
Testing
graph embeddings
demographic prediction
network representation learning
Language
ISSN
2473-991X
Abstract
Understanding user characteristics such as demographic information is useful for the personalization of online content promoted to users. However, it is difficult to obtain such data for each user visiting the website. Since demographic data for some users can be collected, their behavior can be used to predict the attributes of unknown users. Through online news consumption, we can infer the attributes of users from the articles they view. Most existing models take a supervised learning approach to this modeling task. However, by representing the user-URL interactions with a network, we can convert it to a semi-supervised learning problem and learn embeddings for users. Graph embeddings have become popular in recent years, with research mainly focusing on algorithmic developments. However, while we have an intuitive understanding of the problems they may overcome, such as data sparsity, this problem remains unexplored in the domain of demographic prediction using behavior. In this paper, we first investigate the effectiveness of using user embeddings generated from network representation learning for prediction by comparing its performance with other traditional feature sets, including content and item-based features. We find that the embeddings can represent a user generally on two prediction tasks, (1) gender prediction (classification) and (2) age prediction (regression). Second, we explore the advantages of using these embeddings over the other methods in two cases of data sparsity, where (1) the training and testing sets of users are temporally split and (2) the user labels are imbalanced. In both these cases, the embeddings outperform the baseline.