DPSentimentAnalysis

DPSentimentAnalysis is a sentiment analysis project that uses BERT embeddings and logistic regression models (including a DP-SGD variant) to classify tweets into positive and negative sentiments. The project integrates with a Qdrant vector database for efficient storage and retrieval of embeddings.

Project Setup

Prerequisites

Python: Ensure Python 3.8+ is installed.
Docker: Install Docker to run the Qdrant database.
Dependencies: Install the required Python packages.

Installation Steps

Clone the repository:

   git clone https://github.com/yeabmoh/DPSentimentAnalysis.git
   cd DPSentimentAnalysis

Createt a virtual environment:

   python -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

   pip install -r requirements.txt

Create the qdrant_data folder:

   mkdir qdrant_data

Running the Project

Step 1: Start Qdrant

Start the Qdrant database using Docker:

   docker run -d -p 6333:6333 -v $(pwd)/qdrant_data:/qdrant/storage --name qdrant qdrant/qdrant
   docker start qdrant

-p 6333:6333: Maps port 6333 on your machine to Qdrant's port.
-v $(pwd)/qdrant_data:/qdrant/storage: Mounts the qdrant_data folder for persistent storage. Make sure $(pwd)/qdrant_data is an absolute path to the qdrant_data foler you just created in the home repository with the mkdir command
--name qdrant: Names the container qdrant.

Step 2: Run the Sentiment Analysis Pipeline

Run the main script:

python run.py --model logistic

Options:
- --model logistic: Use the standard logistic regression model
- --model dp_logistic: Use the DP-SGD logistic regression model

The script will:

Check if the qdrant_data folder is empty.
If empty, preprocess the dataset and store embeddings in Qdrant.
Load train and test data from Qdrant.
Train and evaluate the specified model.

Docker Setup for Qdrant

Starting Qdrant

To start the Qdrant container:

docker start qdrant

Stopping Qdrant

To stop the Qdrant container:

docker stop qdrant

Removing Qdrant Data

To clear the qdrant_data container:

rm -rf qdrant_data/*

Folder Structure

DPSentimentAnalysis/
├── data/
│   ├── qdrant_db/          # Qdrant client and utility functions
│   ├── scripts/            # Scripts for preprocessing and loading data
│   ├── [settings.py](http://_vscodecontentref_/2)         # Configuration settings
├── models/
│   ├── [logistic.py](http://_vscodecontentref_/3)         # Logistic regression model
│   ├── [dp_logistic.py](http://_vscodecontentref_/4)      # DP-SGD logistic regression model
├── qdrant_data/            # Qdrant database storage (ignored by Git)
├── [requirements.txt](http://_vscodecontentref_/5)        # Python dependencies
├── [run.py](http://_vscodecontentref_/6)                  # Main script to run the pipeline
├── [README.md](http://_vscodecontentref_/7)               # Project documentation

Key Features

BERT Tokenization:

Preprocesses tweets using the bert-base-uncased tokenizer.
Stores embeddings in Qdrant for efficient retrieval.

Logistic Regression Models:

Standard logistic regression (logistic).
Differentially private logistic regression (dp_logistic).

Qdrant Integration:

Uses Qdrant as a vector database for storing and retrieving embeddings. Modular Design:
Preprocessing, data loading, and model training are modular and reusable.

Troubleshooting

Issue: qdrant_data is empty

Ensure the Qdrant container is running:

    docker start qdrant

If the folder is still empty, rerun the preprocessing step:

    python run.py --model logistic

Issue: `X_train` contains `nan` values

Check the bert_preprocessing_script.py for issues with tokenization or vector creation.
Ensure the PointStruct objects are correctly created and inserted into Qdrant.

Issue: Qdrant container not starting

Check if the container is already running:

    docker ps

If not, start it:

    docker start qdrant

Future Improvements

Add support for additional models.
Implement advanced evaluation metrics.
Optimize Qdrant queries for large datasets.

License

License This project is licensed under the MIT License. See the LICENSE file for details.

Key Sections:

Project Setup:
- Explains prerequisites, installation steps, and creating the qdrant_data folder.
Running the Project:
- Details the steps to start Qdrant and run the pipeline.
Docker Setup for Qdrant:
- Provides commands to manage the Qdrant container.
Folder Structure:
- Describes the organization of the project.
Troubleshooting:
- Addresses common issues like empty qdrant_data or nan values in X_train.
Future Improvements:
- Suggests potential enhancements for the project.

Let me know if you need further adjustments!

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dp_logistic.txt		dp_logistic.txt
dp_output.txt		dp_output.txt
dpnoises_logistic.txt		dpnoises_logistic.txt
inference_attack.py		inference_attack.py
log		log
log_output.txt		log_output.txt
logistic_nontrain_labels.npy		logistic_nontrain_labels.npy
logistic_nontrain_probs.npy		logistic_nontrain_probs.npy
logistic_train_labels.npy		logistic_train_labels.npy
logistic_train_probs.npy		logistic_train_probs.npy
mia_attack.py		mia_attack.py
mlp_output.txt		mlp_output.txt
mlpr_output.txt		mlpr_output.txt
requirements.txt		requirements.txt
run.py		run.py
sparse_logistic_trial_1.txt		sparse_logistic_trial_1.txt
sparse_logistic_trial_2.txt		sparse_logistic_trial_2.txt
sparse_logistic_trial_3.txt		sparse_logistic_trial_3.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DPSentimentAnalysis

Table of Contents

Project Setup

Prerequisites

Installation Steps

Running the Project

Step 1: Start Qdrant

Step 2: Run the Sentiment Analysis Pipeline

Docker Setup for Qdrant

Starting Qdrant

Stopping Qdrant

Removing Qdrant Data

Folder Structure

Key Features

Troubleshooting

Issue: qdrant_data is empty

Issue: `X_train` contains `nan` values

Issue: Qdrant container not starting

Future Improvements

License

Key Sections:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

yeabmoh/DPSentimentAnalysis

Folders and files

Latest commit

History

Repository files navigation

DPSentimentAnalysis

Table of Contents

Project Setup

Prerequisites

Installation Steps

Running the Project

Step 1: Start Qdrant

Step 2: Run the Sentiment Analysis Pipeline

Docker Setup for Qdrant

Starting Qdrant

Stopping Qdrant

Removing Qdrant Data

Folder Structure

Key Features

Troubleshooting

Issue: qdrant_data is empty

Issue: X_train contains nan values

Issue: Qdrant container not starting

Future Improvements

License

Key Sections:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Issue: `X_train` contains `nan` values

Packages