• Home
  • About
    • Malinda Ratnaduhita photo

      Malinda Ratnaduhita

      AI Engineer | Data Scientist

    • Learn More
    • LinkedIn
    • Instagram
    • Github
  • Posts
    • All Posts
    • All Tags
  • Projects

Sentiment Analysis Using Transformers

26 Jul 2024

Reading time ~2 minutes

Introduction

This project showcases a complete sentiment analysis pipeline built from scratch using Python and Transformers. The target data consists of tweets related to the 2024 Indonesian presidential candidates, classified into ‘Positive’ or ‘Negative’ sentiments. The project compares two modeling approaches:

  • A traditional baseline using Naive Bayes
  • A modern deep learning solution using fine-tuned BERT

As a solo project, I handled all stages — from data cleaning and class balancing to model training, evaluation, and benchmarking. The goal was to explore the contrast between classic machine learning and transformer-based NLP in a real-world political sentiment context.

Dataset

The dataset contains labeled tweets mentioning three candidates:

  • Anies Baswedan
  • Prabowo Subianto
  • Ganjar Pranowo

Each tweet includes columns such as Text, Tweet Count, and label (positive/negative). The dataset was combined, preprocessed, and split into training and test sets.
Dataset >>

Data Preprocessing

Robust preprocessing was critical due to the noisy nature of real-world tweets. Steps included:

  • Text Cleaning:
    • Removed emojis, non-ASCII characters, mentions (@), hashtags, and URLs
    • Lowercased text, stripped punctuation
  • Handling Missing Data:
    • Rows with null values were dropped
  • Class Balancing:
    • Used RandomOverSampler to address imbalance between sentiment classes
  • Tokenization:
    • For Naive Bayes: CountVectorizer + TfidfTransformer
    • For BERT: Hugging Face’s BERT tokenizer with attention masks, padding, and truncation

Model Architecture

Baseline: Naive Bayes

  • Applied CountVectorizer + TfidfTransformer for feature extraction
  • Used MultinomialNB for sentiment classification
  • Fast, interpretable, and good baseline performance

Fine-Tuned BERT (Transformers)

  • Leveraged Hugging Face’s bert-base-multilingual-cased model
  • Built a custom Keras model with:
    • BERT encoding layer (TFBertModel)
    • Dense layers for classification
    • Softmax activation
  • Trained using tokenized inputs with attention masks
  • Evaluated using accuracy, precision, recall, and F1-score

Result

Naive Bayes:

  • Accuracy: ~85%
  • Performed well on both sentiment classes with balanced precision and recall

Fine-Tuned BERT:

  • Accuracy: 93%
  • Significantly outperformed Naive Bayes on all metrics
  • High-quality sentiment classification even with informal or sarcastic tweets

Conclusion

This project demonstrates the effectiveness of fine-tuning transformer models for sentiment analysis, particularly in noisy and politically charged tweet data. Compared to a Naive Bayes baseline, BERT showed superior accuracy and generalization across both sentiment classes.

Built entirely as a solo project, this work reflects my ability to handle real-world data preprocessing, apply modern NLP techniques, and evaluate model performance rigorously. It also highlights the practical strengths of transformer-based models for sentiment mining in social media.

Go to Kaggle >>



Sentiment AnalysisNaive BayesBERTTransformersNLPHuggingFaceData Cleaning Share Tweet +1