Phishing Website Detection Using a Machine Learning Classification Approach

Authors

  • Ibnu Arifin Institut Informatika dan Bisnis Darmajaya Author
  • Chairani Institut Informatika dan Bisnis Darmajaya Author

DOI:

https://doi.org/10.35314/yja1d830

Keywords:

Phishing, Machine Learning, Random Forest, Phishing Detection, Website Classification

Abstract

Phishing is a form of cybercrime that is increasingly prevalent, with millions of attacks recorded annually. This study develops a phishing website detection model using a machine learning classification approach, employing a pipeline that includes data preprocessing, feature selection, and model validation. The dataset was obtained from the UCI Machine Learning Repository and consists of 235,795 URLs with a relatively balanced distribution between phishing (100,945) and non-phishing (134,850). After data cleaning and feature selection, 21 optimal features were retained, ensuring they were safe from potential data leakage. Two algorithms were evaluated: decision tree and random forest, using 10-fold cross-validation. The random forest algorithm achieved an average accuracy of 97.78%, while the decision tree was slightly higher at 98.02%. However, random forest outperformed in class discrimination, as measured by ROC-AUC (99.73%) and PR-AUC (99.78%), compared to decision tree values of 99.49% and 99.40%. The method also incorporated a 10-fold cross-validation procedure to minimize data leakage and ensure reliable model evaluation. The Wilcoxon test further confirmed that the performance difference between the two algorithms is statistically significant. Overall, although the decision tree demonstrates strong classification performance, random forest proves to be more consistent and reliable in detecting phishing websites, making it a superior choice in the context of cybersecurity.

Downloads

Download data is not yet available.

Published

16-09-2025

Issue

Section

Articles

How to Cite

Phishing Website Detection Using a Machine Learning Classification Approach. (2025). INOVTEK Polbeng - Seri Informatika, 10(3), 1498-1508. https://doi.org/10.35314/yja1d830