학술논문

Efficient Large Scale NLP Feature Engineering with Apache Spark

Document Type

Conference

Author

Esmaeilzadeh, Armin; Heidari, Maryam; Abdolazimi, Reyhaneh; Hajibabaee, Parisa; Malekzadeh, Masoud

Source

2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC) Computing and Communication Workshop and Conference (CCWC), 2022 IEEE 12th Annual. :0274-0280 Jan, 2022

Subject

Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
General Topics for Engineers
Photonics and Electrooptics
Power, Energy and Industry Applications
Robotics and Control Systems
Signal Processing and Analysis
Runtime
Conferences
Natural languages
Pipelines
Cluster computing
Big Data
Feature extraction
natural language processing
NLP
machine learning
distributed systems
big data
Apache Spark

Language

Abstract

Feature engineering is a computationally time-consuming process in the end-to-end machine learning pipeline. Large amounts of text data are being generated on many heterogeneous sources and platforms on the internet. The compute resources needed to extract valuable features from these big datasets are increasing significantly. In this research, we evaluate the runtime of the RDD and the Spark-SQL APIs of the Apache Spark framework to extract text features from the corpus of english Wikipedia. As a result, we demonstrate the significant runtime performance of the SparkSQL compared to RDD API.

Online Access

Full Text (IEEE) Find it@PNU

이메일

부산대학교 도서관

Online Access

메일 발송