학술논문

Manipulating Data Lakes Intelligently With Java Annotations
Document Type
Periodical
Source
IEEE Access Access, IEEE. 12:34903-34917 2024
Subject
Aerospace
Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Engineered Materials, Dielectrics and Plasmas
Engineering Profession
Fields, Waves and Electromagnetics
General Topics for Engineers
Geoscience
Nuclear Engineering
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Transportation
Big Data applications
Java
Annotations
Databases
Software engineering
Artificial intelligence
Structured Query Language
Data lakes
Enterprise resource planning
Object oriented databases
Data lake
data precipitation
data stewards
enterprise-level applications
impedance mismatch
java annotations
JAMDL
object-oriented
ORMapping
software framework
Language
ISSN
2169-3536
Abstract
Data lakes are typically large data repositories where enterprises store data in a variety of data formats. From the perspective of data storage, data can be categorized into structured, semi-structured, and unstructured data. On the one hand, due to the complexity of data forms and transformation procedures, many enterprises simply pour valuable data into data lakes without organizing and managing them effectively. This can create data silos (or data islands) or even data swamps, with the result that some data will be permanently invisible. Although data are integrated into a data lake, they are simply physically stored in the same environment and cannot be correlated with other data to leverage their precious value. On the other hand, processing data from a data lake into a desired format is always a difficult and tedious task that requires experienced programming skills, such as conversion from structured to semi-structured. In this article, a novel software framework called Java Annotation for Manipulating Data Lakes (JAMDL) that can manage heterogeneous data is proposed. This approach uses Java annotations to express the properties of data in metadata (data about data) so that the data can be converted into different formats and managed efficiently in a data lake. Furthermore, this article suggests using artificial intelligence (AI) translation models to generate Data Manipulation Language (DML) operations for data manipulation and uses AI recommendation models to improve the visibility of data when data precipitation occurs.