학술논문

Machine Learning & Concept Drift based Approach for Malicious Website Detection
Document Type
Conference
Source
2020 International Conference on COMmunication Systems & NETworkS (COMSNETS) COMmunication Systems & NETworkS (COMSNETS), 2020 International Conference on. :582-585 Jan, 2020
Subject
Communication, Networking and Broadcast Technologies
Feature extraction
Uniform resource locators
Forestry
Supervised learning
Malware
Training data
Machine learning
URL Feature Extraction
Malicious Website Detection
Concept Drifts
Feature Vectors
Gradient Boosted Trees
Random Forest
Feedforward Neural Networks
Language
ISSN
2155-2509
Abstract
The rampant increase in the number of available cyber attack vectors and the frequency of cyber attacks necessitates the implementation of robust cybersecurity systems. Malicious websites are a significant threat to cybersecurity. Miscreants and hackers use malicious websites for illegal activities such as disrupting the functioning of the systems by implanting malware, gaining unauthorized access to systems, or illegally collecting personal information. We propose and implement an approach for classifying malicious and benign websites given their Uniform Resource Locator (URL) as input. Using the URL provided by the user, we collect Lexical, Host-Based, and Content-Based features for the website. These features are fed into a supervised Machine Learning algorithm as input that classifies the URL as malicious or benign. The models are trained on a dataset consisting of multiple malicious and benign URLs. We have evaluated the accuracy of classification for Random forests, Gradient Boosted Decision Trees and Deep Neural Network classifiers. One loophole in the use of Machine learning for detection is the availability of the same training data to the attackers. This data is exploited by the miscreants to alter the features associated with the Malicious URLs, which will be classified as benign by the supervised learning algorithms. Further, owing to the dynamic nature of the malicious websites, we also propose a paradigm for detecting and countering these manually induced concept drifts.