학술논문
Efficient Large Scale NLP Feature Engineering with Apache Spark
Document Type
Conference
Author
Source
2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC) Computing and Communication Workshop and Conference (CCWC), 2022 IEEE 12th Annual. :0274-0280 Jan, 2022
Subject
Language
Abstract
Feature engineering is a computationally time-consuming process in the end-to-end machine learning pipeline. Large amounts of text data are being generated on many heterogeneous sources and platforms on the internet. The compute resources needed to extract valuable features from these big datasets are increasing significantly. In this research, we evaluate the runtime of the RDD and the Spark-SQL APIs of the Apache Spark framework to extract text features from the corpus of english Wikipedia. As a result, we demonstrate the significant runtime performance of the SparkSQL compared to RDD API.