Introduction
BUMATARA is a phishing detection system built for real-world usability. It classifies whether a submitted URL or email address is malicious, combining classic machine learning with large language models (LLMs) for content-level analysis. Designed for both performance and explainability, the system uses handcrafted features, tree-based classifiers, and LLM reasoning to deliver judgments with human-readable insight.
Built during a hackathon with my team, this project is now publicly deployed and fully usable via bumatara.com
.
Dataset & Features
Multi-level detection is achieved by using separate feature pipelines for URLs, email addresses, and website content:
- URL Classification
Uses 23 handcrafted features extracted from user-submitted URLs, such as:- Domain structure & subdomain length
- Use of HTTPS or suspicious port numbers
- Character entropy, special symbol ratios, redirects
Dataset Source β
- Email Address Classification
Built on the CEAS dataset, the system derives:- Domain reputation
- TLD analysis (e.g., uncommon extensions)
- Numeric character ratio, length patterns
Dataset Source β
- Web Content Analysis
For URL inputs, the system scrapes:- Meta tags, title content, inline forms, scripts
- Then uses a DeepSeek v2 LLM to evaluate if the page mimics legitimate services or contains suspicious intent.
Model & Architecture
The backend architecture blends tree-based models for speed with LLM inference for depth.
- Classification Layer
- URLs: XGBoost model trained on 23 engineered features.
- Emails: CatBoost model fine-tuned on CEAS + custom features.
- Model performance:
- π 96% accuracy for phishing URLs
- βοΈ 81% accuracy for phishing email addresses
- LLM Analysis Layer
- Web scraping captures HTML-based content.
- Uses OpenRouter API to query DeepSeek v2 LLM.
- LLM returns a natural-language assessment like:
βThis page mimics a PayPal login and uses IP-based redirection, commonly found in phishing pages.β
System Workflow
- User submits either a URL or email address.
- The system detects the input type and routes it to:
- XGBoost (URL) or CatBoost (email) model.
- Extracted features are classified.
- If the input is a URL:
- Web content is scraped and sent to the LLM for further analysis.
- The final verdict and explanation are displayed on the site and stored for audit/logging.
Deployment & Tools
- Backend: Python, Flask REST API
- Frontend: Laravel (fully deployed)
- Database: MySQL for logging input history and results
- LLM Access: OpenRouter API using DeepSeek v2
- Deployment: Hosted on AWS EC2 (Ubuntu)
Explore the live system here:
π BUMATARA - Phishing Checker
Result
- β 96% accuracy on malicious URL detection using XGBoost
- β 81% accuracy for email classification using CatBoost
- β LLM layer adds contextual reasoning and boosts explainability
- π§ Combines fast inference with deep semantic review for better trustworthiness
Conclusion
BUMATARA demonstrates how hybrid AI systems can combine traditional ML pipelines with modern LLMs to solve real-world cybersecurity problems. As the AI engineer on this project, I designed and deployed the models then integrated the OpenRouter LLM pipeline for content analysis.
This system shows that phishing detection doesnβt have to be a black box. By blending structure-based classification with explainable AI, BUMATARA balances precision with usability β a practical solution for individual users and organizations alike.
π Future work includes expanding the LLM prompt library, refining the email classifier with newer datasets, and rolling out multi-language phishing detection.