Phising Detection with Traditional Machine Learning and Deep Learning Algorithms

10 Dec 2023

Reading time ~2 minutes

Introduction

In the rapidly evolving landscape of cyber threats, phishing remains a pervasive and cunning method employed by attackers to compromise sensitive information. To combat this menace, our project delves into the realm of phishing detection, employing a diverse set of algorithms ranging from traditional machine learning to cutting-edge deep learning models.

Dataset

The dataset utilized in this project, titled Dataset of Malicious and Benign Webpages was sourced from Kaggle. It serves as a critical component for training and evaluating our phishing detection system. The dataset’s primary objective is to classify webpages into either malicious or benign categories. Dataset >>

Traditional Machine Learning Algorithms

Decision Tree

Decision Trees (DT) provide an intuitive approach to classify phishing websites based on a set of rules. By visualizing decision boundaries, DT helps us understand the factors that contribute to identifying potential threats.

Random Forest

Random Forest, an ensemble learning technique, combines multiple decision trees to enhance the accuracy of phishing detection. The collaborative decision-making process strengthens the model’s resilience against false positives.

Logistic Regression

Logistic Regression, a classic classification algorithm, is adept at discerning between legitimate and phishing websites by mapping input features to a binary outcome. Its simplicity and efficiency make it a valuable addition to our detection arsenal.

Naive Bayes

Naive Bayes, based on Bayes’ theorem, excels in probabilistic classification. Its ability to handle large datasets and quick training times make it an effective choice for identifying phishing attempts.

Deep Learning Algorithms

Convolutional Neural Network - Long Short-Term Memory (CNN-LSTM)

The fusion of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks enables our model to capture both spatial and temporal patterns in web page structures. This hybrid architecture proves potent in recognizing sophisticated phishing tactics.

Multi-Layer Perceptron (MLP)

A Multi-Layer Perceptron, a fundamental component of deep learning, is employed for its capability to discern intricate patterns within vast datasets. Its multiple layers facilitate the extraction of hierarchical features crucial for identifying phishing characteristics.

Result

Algorithms	Accuracy
Decision Tree	99.1%
Random Forest	99.5%
Logistic Regression	97.3%
Naive Bayes	97%
CNN-LSTM	99.5%
MLP	98.5%

Conclusion

In conclusion, our project rigorously tested and trained each algorithm individually using a comprehensive dataset of both legitimate and phishing websites. Through this meticulous approach, we assessed the unique strengths and capabilities of Decision Trees, Random Forest, Logistic Regression, Naive Bayes, CNN-LSTM, and Multi-Layer Perceptron in the context of phishing detection. Each algorithm underwent rigorous testing using the same dataset, allowing for a thorough comparative analysis of their efficacy in identifying and thwarting phishing attempts.

For further details on the implementation and results, refer to the complete project documentation available here >>