학술논문

PPGloVe: Privacy-Preserving GloVe for Training Word Vectors in the Dark
Document Type
Periodical
Source
IEEE Transactions on Information Forensics and Security IEEE Trans.Inform.Forensic Secur. Information Forensics and Security, IEEE Transactions on. 19:3644-3658 2024
Subject
Signal Processing and Analysis
Computing and Processing
Communication, Networking and Broadcast Technologies
Training
Computational modeling
Data models
Task analysis
Cryptography
Arithmetic
Privacy
Privacy preservation
data security
word representation
cloud computing
Language
ISSN
1556-6013
1556-6021
Abstract
Words are treated as atomic units in natural language processing tasks and it is a fundamental step to represent them as vectors for supporting subsequent computations. GloVe is a widely used machine learning model to train word vectors. Generally, a large corpus and high computation resources are required to train high-quality word vectors using GloVe, making it difficult for users to train their own word vectors by themselves. A natural choice nowadays is to outsource the training process to the cloud. However, coming with such cloud-based training services are serious privacy concerns, which should be well addressed. In this paper, we design, implement, and evaluate PPGloVe, the first system framework that supports privacy-preserving word vectors training using GloVe over encrypted data of multiple participants. We first decompose the training task and show that previous privacy-preserving machine learning techniques are not practical for this task. We then construct a new secure training strategy to delicately bridge lightweight cryptographic techniques with GloVe in depth to support privacy-preserving GloVe training on the cloud. By design, the corpora of the participants and the trained word vectors are kept private along the whole training process. Extensive experiments over three datasets of different scales demonstrate that PPGloVe produces word vectors with promising quality comparable to plaintext training, with practically affordable overhead.