Advanced International Journal for Research

E-ISSN: 3048-7641     Impact Factor: 9.11

A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal

Call for Paper Volume 7, Issue 3 (May-June 2026) Submit your research before last 3 days of June to publish your research paper in the issue of May-June.

Real-Time Phishing Website Detection Using Lexical URL Features with Weighted Soft Voting Ensemble

Author(s) Prof. Dr. Thiyagarajan A, Ms. Kaniska Devi B, Ms. Manisha T, Ms. Harisha S
Country India
Abstract Phishing attacks remain one of the most financially damaging threats in modern cybersecurity. Conventional blacklist-based defences prove insufficient against zero-day phishing URLs not yet logged by threat intelligence services. This work investigates whether a machine learning framework operating exclusively on lexical features extracted from raw URL strings can deliver high-accuracy phishing detection without accessing, downloading, or rendering the target webpage. Seventeen lexical features are extracted and organized across five conceptual groups: length and structure, special characters, Shannon entropy, typosquatting indicators, and suspicious keyword patterns. Two ensemble classifiers—Random Forest (RF)
and XGBoost - are individually trained on two benchmark datasets and their outputs fused through a Weighted Soft Voting algorithm that assigns calibrated, confidence-based weights to each model. Experiments on the UCI Phishing Websites Dataset (11,055 instances) and the Kaggle Phishing URL Detection Dataset yield training-phase accuracies of 99.34%, 99.51%, and 99.40% for RF, XGBoost, and the ensemble respectively. The brand edits distance feature—the novel typosquatting detection measure proves the single most discriminative lexical feature. A graded three-tier risk scoring mechanism (Low / Medium / High) provides actionable outputs beyond binary classification, and sub-millisecond inference confirms practical suitability for real-time browser or network gateway deployment
Keywords phishing detection; lexical URL features; Random Forest; XGBoost; Weighted Soft Voting; typosquatting; ensemble learning; real-time classification; cybersecurity; machine learning.
Field Computer Applications
Published In Volume 7, Issue 3, May-June 2026
Published On 2026-05-15
DOI https://doi.org/10.63363/aijfr.2026.v07i03.5511

Share this