Advanced International Journal for Research
E-ISSN: 3048-7641
•
Impact Factor: 9.11
A Widely Indexed Open Access Peer Reviewed Multidisciplinary Bi-monthly Scholarly International Journal
Home
Research Paper
Submit Research Paper
Publication Guidelines
Publication Charges
Upload Documents
Track Status / Pay Fees / Download Publication Certi.
Editors & Reviewers
View All
Join as a Reviewer
Get Membership Certificate
Current Issue
Publication Archive
Conference
Publishing Conf. with AIJFR
Upcoming Conference(s) ↓
WSMCDD-2025
GSMCDD-2025
Conferences Published ↓
RBS:RH-COVID-19 (2023)
ICMRS'23
PIPRDA-2023
Contact Us
Plagiarism is checked by the leading plagiarism checker
Call for Paper
Volume 7 Issue 3
May-June 2026
Indexing Partners
Real-Time Phishing Website Detection Using Lexical URL Features with Weighted Soft Voting Ensemble
| Author(s) | Prof. Dr. Thiyagarajan A, Ms. Kaniska Devi B, Ms. Manisha T, Ms. Harisha S |
|---|---|
| Country | India |
| Abstract | Phishing attacks remain one of the most financially damaging threats in modern cybersecurity. Conventional blacklist-based defences prove insufficient against zero-day phishing URLs not yet logged by threat intelligence services. This work investigates whether a machine learning framework operating exclusively on lexical features extracted from raw URL strings can deliver high-accuracy phishing detection without accessing, downloading, or rendering the target webpage. Seventeen lexical features are extracted and organized across five conceptual groups: length and structure, special characters, Shannon entropy, typosquatting indicators, and suspicious keyword patterns. Two ensemble classifiers—Random Forest (RF) and XGBoost - are individually trained on two benchmark datasets and their outputs fused through a Weighted Soft Voting algorithm that assigns calibrated, confidence-based weights to each model. Experiments on the UCI Phishing Websites Dataset (11,055 instances) and the Kaggle Phishing URL Detection Dataset yield training-phase accuracies of 99.34%, 99.51%, and 99.40% for RF, XGBoost, and the ensemble respectively. The brand edits distance feature—the novel typosquatting detection measure proves the single most discriminative lexical feature. A graded three-tier risk scoring mechanism (Low / Medium / High) provides actionable outputs beyond binary classification, and sub-millisecond inference confirms practical suitability for real-time browser or network gateway deployment |
| Keywords | phishing detection; lexical URL features; Random Forest; XGBoost; Weighted Soft Voting; typosquatting; ensemble learning; real-time classification; cybersecurity; machine learning. |
| Field | Computer Applications |
| Published In | Volume 7, Issue 3, May-June 2026 |
| Published On | 2026-05-15 |
| DOI | https://doi.org/10.63363/aijfr.2026.v07i03.5511 |
Share this

E-ISSN 3048-7641
CrossRef DOI is assigned to each research paper published in our journal.
AIJFR DOI prefix is
10.63363/aijfr
Downloads
All research papers published on this website are licensed under Creative Commons Attribution-ShareAlike 4.0 International License, and all rights belong to their respective authors/researchers.