• Home
  • About
    • Malinda Ratnaduhita photo

      Malinda Ratnaduhita

      AI Engineer | Data Scientist

    • Learn More
    • LinkedIn
    • Instagram
    • Github
  • Posts
    • All Posts
    • All Tags
  • Projects

BUMATARA - Phishing URL Checker

30 Jun 2025

Reading time ~3 minutes

Introduction

BUMATARA is a phishing detection system built for real-world usability. It classifies whether a submitted URL or email address is malicious, combining classic machine learning with large language models (LLMs) for content-level analysis. Designed for both performance and explainability, the system uses handcrafted features, tree-based classifiers, and LLM reasoning to deliver judgments with human-readable insight.

Built during a hackathon with my team, this project is now publicly deployed and fully usable via bumatara.com.

Dataset & Features

Multi-level detection is achieved by using separate feature pipelines for URLs, email addresses, and website content:

  • URL Classification
    Uses 23 handcrafted features extracted from user-submitted URLs, such as:
    • Domain structure & subdomain length
    • Use of HTTPS or suspicious port numbers
    • Character entropy, special symbol ratios, redirects
      Dataset Source β†’
  • Email Address Classification
    Built on the CEAS dataset, the system derives:
    • Domain reputation
    • TLD analysis (e.g., uncommon extensions)
    • Numeric character ratio, length patterns
      Dataset Source β†’
  • Web Content Analysis
    For URL inputs, the system scrapes:
    • Meta tags, title content, inline forms, scripts
    • Then uses a DeepSeek v2 LLM to evaluate if the page mimics legitimate services or contains suspicious intent.

Model & Architecture

The backend architecture blends tree-based models for speed with LLM inference for depth.

  1. Classification Layer
    • URLs: XGBoost model trained on 23 engineered features.
    • Emails: CatBoost model fine-tuned on CEAS + custom features.
    • Model performance:
      • πŸ” 96% accuracy for phishing URLs
      • βœ‰οΈ 81% accuracy for phishing email addresses
  2. LLM Analysis Layer
    • Web scraping captures HTML-based content.
    • Uses OpenRouter API to query DeepSeek v2 LLM.
    • LLM returns a natural-language assessment like:
      β€œThis page mimics a PayPal login and uses IP-based redirection, commonly found in phishing pages.”

System Workflow

  1. User submits either a URL or email address.
  2. The system detects the input type and routes it to:
    • XGBoost (URL) or CatBoost (email) model.
  3. Extracted features are classified.
  4. If the input is a URL:
    • Web content is scraped and sent to the LLM for further analysis.
  5. The final verdict and explanation are displayed on the site and stored for audit/logging.

Deployment & Tools

  • Backend: Python, Flask REST API
  • Frontend: Laravel (fully deployed)
  • Database: MySQL for logging input history and results
  • LLM Access: OpenRouter API using DeepSeek v2
  • Deployment: Hosted on AWS EC2 (Ubuntu)

Explore the live system here:
πŸ”— BUMATARA - Phishing Checker

Result

  • βœ… 96% accuracy on malicious URL detection using XGBoost
  • βœ… 81% accuracy for email classification using CatBoost
  • βœ… LLM layer adds contextual reasoning and boosts explainability
  • 🧠 Combines fast inference with deep semantic review for better trustworthiness

Conclusion

BUMATARA demonstrates how hybrid AI systems can combine traditional ML pipelines with modern LLMs to solve real-world cybersecurity problems. As the AI engineer on this project, I designed and deployed the models then integrated the OpenRouter LLM pipeline for content analysis.

This system shows that phishing detection doesn’t have to be a black box. By blending structure-based classification with explainable AI, BUMATARA balances precision with usability β€” a practical solution for individual users and organizations alike.

πŸš€ Future work includes expanding the LLM prompt library, refining the email classifier with newer datasets, and rolling out multi-language phishing detection.

See Project β†’
My Tasks on GitHub β†’



Phishing DetectionXGBoostLLMCatBoostFlaskCybersecurityAI EngineeringOpenRouter Share Tweet +1