Sentiment Classifiers on Twitter Dataset

Overview

The goal is to develop a sentiment classifier in Python using four different approaches on a given English-language Twitter dataset. Results and comparisons are documented in distinct reports for each implementation. The four approaches are:

Logistic Regression with TF-IDF
Neural Network in PyTorch with Word2Vec embeddings
Fine-tuning pretrained BERT (without Hugging Face's Trainer)
Fine-tuning pretrained DistilBERT (without Hugging Face's Trainer)

Detailed description

Dataset

The dataset contains three columns:

ID: Unique text identifier
Text: Tweet content
Label: Sentiment (0 = negative, 1 = positive)

Training, validation, and test sets are pre-split.

Preprocessing

Preprocessing includes four functions: my_lower, my_stopword, my_unpunct, and my_lemmatize. For each classifier, we experiment to determine which preprocessing methods improve performance. All implementations share the same preprocessing options.

Vectorization and Embeddings

Approach 1: TF-IDF vectorization. The sparse TF-IDF matrix feeds directly into logistic regression.

Approach 2: Uses pretrained GloVe embeddings (glove.6B/glove.twitter.27B) converted to Word2Vec format. Tweet embeddings are generated by averaging word vectors.

Approaches 3-4: BERT and DistilBERT use their own pretrained contextual embeddings.

Experiments

Includes manual experiments and automated hyperparameter tuning using the Optuna framework. Key aspects explored:

Hyperparameter optimization
Embedding dimensions
Regularization techniques
Optimization algorithms
Architectural variations

All experimentation aims to prevent underfitting/overfit and maximize validation performance.

Evaluation

Metrics include accuracy, precision, recall, and F1-score. Visualizations include:

Learning curves
ROC curves
Confusion matrices

Notes

Results for each model on the test set were obtained via a private Kaggle competition as part of the course for which these assignments were developed. Consequently, we cannot reproduce those test‑set results locally.
Reports for BERT and DistilBERT models are combined in a single document.

License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
bert-distilbert		bert-distilbert
datasets		datasets
logistic-regression		logistic-regression
neural-network		neural-network
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sentiment Classifiers on Twitter Dataset

Overview

Detailed description

Dataset

Preprocessing

Vectorization and Embeddings

Experiments

Evaluation

Notes

License

About

Uh oh!

Releases

Packages

Languages

License

VassTs/sentiment-classifiers-twitter

Folders and files

Latest commit

History

Repository files navigation

Sentiment Classifiers on Twitter Dataset

Overview

Detailed description

Dataset

Preprocessing

Vectorization and Embeddings

Experiments

Evaluation

Notes

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages