The goal is to develop a sentiment classifier in Python using four different approaches on a given English-language Twitter dataset. Results and comparisons are documented in distinct reports for each implementation. The four approaches are:
- Logistic Regression with TF-IDF
- Neural Network in PyTorch with Word2Vec embeddings
- Fine-tuning pretrained BERT (without Hugging Face's
Trainer
) - Fine-tuning pretrained DistilBERT (without Hugging Face's
Trainer
)
The dataset contains three columns:
- ID: Unique text identifier
- Text: Tweet content
- Label: Sentiment (0 = negative, 1 = positive)
Training, validation, and test sets are pre-split.
Preprocessing includes four functions: my_lower
, my_stopword
, my_unpunct
, and my_lemmatize
. For each classifier, we experiment to determine which preprocessing methods improve performance. All implementations share the same preprocessing options.
Approach 1: TF-IDF vectorization. The sparse TF-IDF matrix feeds directly into logistic regression.
Approach 2: Uses pretrained GloVe embeddings (glove.6B/glove.twitter.27B) converted to Word2Vec format. Tweet embeddings are generated by averaging word vectors.
Approaches 3-4: BERT and DistilBERT use their own pretrained contextual embeddings.
Includes manual experiments and automated hyperparameter tuning using the Optuna framework. Key aspects explored:
- Hyperparameter optimization
- Embedding dimensions
- Regularization techniques
- Optimization algorithms
- Architectural variations
All experimentation aims to prevent underfitting/overfit and maximize validation performance.
Metrics include accuracy, precision, recall, and F1-score. Visualizations include:
- Learning curves
- ROC curves
- Confusion matrices
- Results for each model on the test set were obtained via a private Kaggle competition as part of the course for which these assignments were developed. Consequently, we cannot reproduce those test‑set results locally.
- Reports for BERT and DistilBERT models are combined in a single document.
This project is licensed under the MIT License. See LICENSE for details.