학술논문

FeedRef2022: A Named Entity Recognition Dataset for Extracting Indicators of Compromise
Document Type
Conference
Source
2022 IEEE International Conference on Big Data (Big Data) Big Data (Big Data), 2022 IEEE International Conference on. :2578-2584 Dec, 2022
Subject
Communication, Networking and Broadcast Technologies
Computing and Processing
Engineering Profession
Geoscience
Robotics and Control Systems
Signal Processing and Analysis
Measurement
Pain
Annotations
Big Data
Predictive models
Cyber threat intelligence
Behavioral sciences
Named entity recognition
Information extraction
Transfer learning
Language
Abstract
With the increasing use of the internet, cyber threats and malicious activities are becoming ubiquitous. To avoid unsuspecting attacks, gathering enough information about different threats is crucial. According to the Pyramid of Pain, Indicators of Compromise (IOCs) are the simplest artifacts to observe, which help cyber security professionals to design the corresponding precautions. Cyber Threat Intelligence (CTI) is data that presents current threat events, threat actors’ targets, and attack behaviors; hence, collecting and analyzing CTI in advance can be beneficial to defend against cyberattacks. In this paper, we construct a named entity recognition dataset using our annotation method by collecting 1,854 threat intelligence reports. Additionally, we fine-tuned four pre-trained language models and compared the efficiency of each model. Among the four models, we realized that the fine-tuned ELECTRA model could extract new IOCs correctly, and the FeedRef2022 dataset could train NER models for detecting IOCs.