학술논문

User-Driven Synthetic Dataset Generation With Quantifiable Differential Privacy
Document Type
Periodical
Source
IEEE Transactions on Services Computing IEEE Trans. Serv. Comput. Services Computing, IEEE Transactions on. 16(5):3812-3826 Jan, 2023
Subject
Computing and Processing
General Topics for Engineers
Data privacy
Synthetic data
Data protection
Privacy
Government
Data models
Distributed databases
Differential privacy
data privacy
data hit rate
++%24k%24<%2Ftex-math>+++k<%2Fmml%3Ami>+<%2Fmml%3Amath>++<%2Falternatives>+<%2Finline-formula>+<%2Fnamed-content>-level%22"> $k$ k -level
synthetic dataset hunting
Language
ISSN
1939-1374
2372-0204
Abstract
Recently, releasing data to a third party for secondary analysis has become a trend of service computing. However, data owners are concerned that such a move may expose individuals’ records, which is in violation of regulations such as the European Union's General Data Protection Regulation. Differential privacy has been proposed as a possible solution to the aforementioned problem. The privacy budget $\varepsilon$ɛ in differential privacy is for theoretical interpretation, but in practice, its application in measuring the risk of data disclosure has not been well studied, especially with sampling-based synthetic datasets. Moreover, datasets released by data owners with quantifiable privacy levels and the explicit utility for these datasets have yet to be well developed. In this paper, we present an intuitive approach for defining the privacy level (i.e., data hit rate and $k$k-level) and utility level (i.e., basic statistics and a series of data mining models), and the privacy budget $\varepsilon$ɛ is quantified for evaluating the risk and utility of private data. In addition, we propose two user-driven synthetic dataset hunting methods to generate a synthetic dataset with the specified privacy objective, enabling the data owner (e.g., the government and financial companies) to understand the possible privacy risk and thereby release datasets with confirmed privacy level. To the best of our knowledge, this is the first method that allows data providers to automatically generate synthetic datasets with a quantifiable privacy level for the service of open data.