학술논문

Managing Personal Identifiable Information in Data Lakes
Document Type
Periodical
Source
IEEE Access Access, IEEE. 12:32164-32180 2024
Subject
Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Engineering Profession
Fields, Waves and Electromagnetics
General Topics for Engineers
Geoscience
Nuclear Engineering
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Big Data applications
Metadata
Computer architecture
General Data Protection Regulation
Feature extraction
Privacy
Data lakes
Identification of persons
Knowledge discovery
Data collection
Data lake
personal identifiable information metadata
personal data
data discovery
entity linking
data removal
Language
ISSN
2169-3536
Abstract
Privacy is a fundamental human right according to the Universal Declaration of Human Rights of the United Nations. Adoption of the General Data Protection Regulation (GDPR) in European Union in 2018 was turning point in management of personal data, specifically personal identifiable information (PII). Although there were many previous privacy laws in existence before, GDPR has brought privacy topic in the regulatory spotlight. Two most important novelties are seven basic principles related to processing of personal data and huge fines defined for violation of the regulation. Many other countries have followed the EU with the adoption of similar legislation. Personal data management processes in companies, especially in analytical systems and Data Lakes, must comply with the regulatory requirements. In Data Lakes, there are no standard architectures or solutions for the need to discover personal identifiable information, match data about the same person from different sources, or remove expired personal data. It is necessary to upgrade the existing Data Lake architectures and metadata models to support these functionalities. The goal is to study the current Data Lake architecture and metadata models and to propose enhancements to improve the collection, discovery, storage, processing, and removal of personal identifiable information. In this paper, a new metadata model that supports the handling of personal identifiable information in a Data Lake is proposed.