Skip to content

🐦 End-to-end Twitter sentiment analysis pipeline comparing Logistic Regression (TF‑IDF), a PyTorch NN with GloVe embeddings, and fine-tuned BERT & DistilBERT models.

License

Notifications You must be signed in to change notification settings

VassTs/sentiment-classifiers-twitter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sentiment Classifiers on Twitter Dataset

Overview

The goal is to develop a sentiment classifier in Python using four different approaches on a given English-language Twitter dataset. Results and comparisons are documented in distinct reports for each implementation. The four approaches are:

  1. Logistic Regression with TF-IDF
  2. Neural Network in PyTorch with Word2Vec embeddings
  3. Fine-tuning pretrained BERT (without Hugging Face's Trainer)
  4. Fine-tuning pretrained DistilBERT (without Hugging Face's Trainer)

Detailed description

Dataset

The dataset contains three columns:

  • ID: Unique text identifier
  • Text: Tweet content
  • Label: Sentiment (0 = negative, 1 = positive)

Training, validation, and test sets are pre-split.

Preprocessing

Preprocessing includes four functions: my_lower, my_stopword, my_unpunct, and my_lemmatize. For each classifier, we experiment to determine which preprocessing methods improve performance. All implementations share the same preprocessing options.

Vectorization and Embeddings

Approach 1: TF-IDF vectorization. The sparse TF-IDF matrix feeds directly into logistic regression.

Approach 2: Uses pretrained GloVe embeddings (glove.6B/glove.twitter.27B) converted to Word2Vec format. Tweet embeddings are generated by averaging word vectors.

Approaches 3-4: BERT and DistilBERT use their own pretrained contextual embeddings.

Experiments

Includes manual experiments and automated hyperparameter tuning using the Optuna framework. Key aspects explored:

  • Hyperparameter optimization
  • Embedding dimensions
  • Regularization techniques
  • Optimization algorithms
  • Architectural variations

All experimentation aims to prevent underfitting/overfit and maximize validation performance.

Evaluation

Metrics include accuracy, precision, recall, and F1-score. Visualizations include:

  • Learning curves
  • ROC curves
  • Confusion matrices

Notes

  • Results for each model on the test set were obtained via a private Kaggle competition as part of the course for which these assignments were developed. Consequently, we cannot reproduce those test‑set results locally.
  • Reports for BERT and DistilBERT models are combined in a single document.

License

This project is licensed under the MIT License. See LICENSE for details.

About

🐦 End-to-end Twitter sentiment analysis pipeline comparing Logistic Regression (TF‑IDF), a PyTorch NN with GloVe embeddings, and fine-tuned BERT & DistilBERT models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published