학술논문

Efficient Large Scale NLP Feature Engineering with Apache Spark
Document Type
Conference
Source
2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC) Computing and Communication Workshop and Conference (CCWC), 2022 IEEE 12th Annual. :0274-0280 Jan, 2022
Subject
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
General Topics for Engineers
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Runtime
Conferences
Natural languages
Pipelines
Cluster computing
Big Data
Feature extraction
natural language processing
NLP
machine learning
distributed systems
big data
Apache Spark
Language
Abstract
Feature engineering is a computationally time-consuming process in the end-to-end machine learning pipeline. Large amounts of text data are being generated on many heterogeneous sources and platforms on the internet. The compute resources needed to extract valuable features from these big datasets are increasing significantly. In this research, we evaluate the runtime of the RDD and the Spark-SQL APIs of the Apache Spark framework to extract text features from the corpus of english Wikipedia. As a result, we demonstrate the significant runtime performance of the SparkSQL compared to RDD API.