DPSentimentAnalysis is a sentiment analysis project that uses BERT embeddings and logistic regression models (including a DP-SGD variant) to classify tweets into positive and negative sentiments. The project integrates with a Qdrant vector database for efficient storage and retrieval of embeddings.
- Project Setup
- Running the Project
- Docker Setup for Qdrant
- Folder Structure
- Key Features
- Troubleshooting
- Python: Ensure Python 3.8+ is installed.
- Docker: Install Docker to run the Qdrant database.
- Dependencies: Install the required Python packages.
- Clone the repository:
git clone https://github.com/yeabmoh/DPSentimentAnalysis.git
cd DPSentimentAnalysis
- Createt a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Create the
qdrant_data
folder:
mkdir qdrant_data
Start the Qdrant database using Docker:
docker run -d -p 6333:6333 -v $(pwd)/qdrant_data:/qdrant/storage --name qdrant qdrant/qdrant
docker start qdrant
-p 6333:6333
: Maps port 6333 on your machine to Qdrant's port.-v $(pwd)/qdrant_data:/qdrant/storage
: Mounts the qdrant_data folder for persistent storage. Make sure$(pwd)/qdrant_data
is an absolute path to the qdrant_data foler you just created in the home repository with themkdir
command--name qdrant
: Names the container qdrant.
Run the main script:
python run.py --model logistic
- Options:
--model logistic
: Use the standard logistic regression model--model dp_logistic
: Use the DP-SGD logistic regression model
The script will:
- Check if the qdrant_data folder is empty.
- If empty, preprocess the dataset and store embeddings in Qdrant.
- Load train and test data from Qdrant.
- Train and evaluate the specified model.
To start the Qdrant container:
docker start qdrant
To stop the Qdrant container:
docker stop qdrant
To clear the qdrant_data
container:
rm -rf qdrant_data/*
DPSentimentAnalysis/
├── data/
│ ├── qdrant_db/ # Qdrant client and utility functions
│ ├── scripts/ # Scripts for preprocessing and loading data
│ ├── [settings.py](http://_vscodecontentref_/2) # Configuration settings
├── models/
│ ├── [logistic.py](http://_vscodecontentref_/3) # Logistic regression model
│ ├── [dp_logistic.py](http://_vscodecontentref_/4) # DP-SGD logistic regression model
├── qdrant_data/ # Qdrant database storage (ignored by Git)
├── [requirements.txt](http://_vscodecontentref_/5) # Python dependencies
├── [run.py](http://_vscodecontentref_/6) # Main script to run the pipeline
├── [README.md](http://_vscodecontentref_/7) # Project documentation
- BERT Tokenization:
- Preprocesses tweets using the bert-base-uncased tokenizer.
- Stores embeddings in Qdrant for efficient retrieval.
- Logistic Regression Models:
- Standard logistic regression (logistic).
- Differentially private logistic regression (dp_logistic).
- Qdrant Integration:
- Uses Qdrant as a vector database for storing and retrieving embeddings. Modular Design:
- Preprocessing, data loading, and model training are modular and reusable.
- Ensure the Qdrant container is running:
docker start qdrant
- If the folder is still empty, rerun the preprocessing step:
python run.py --model logistic
- Check the
bert_preprocessing_script.py
for issues with tokenization or vector creation. - Ensure the
PointStruct
objects are correctly created and inserted into Qdrant.
- Check if the container is already running:
docker ps
If not, start it:
docker start qdrant
- Add support for additional models.
- Implement advanced evaluation metrics.
- Optimize Qdrant queries for large datasets.
License
This project is licensed under the MIT License. See the LICENSE
file for details.
-
Project Setup:
- Explains prerequisites, installation steps, and creating the qdrant_data folder.
-
Running the Project:
- Details the steps to start Qdrant and run the pipeline.
-
Docker Setup for Qdrant:
- Provides commands to manage the Qdrant container.
-
Folder Structure:
- Describes the organization of the project.
-
Troubleshooting:
- Addresses common issues like empty qdrant_data or
nan
values in X_train.
- Addresses common issues like empty qdrant_data or
-
Future Improvements:
- Suggests potential enhancements for the project.
Let me know if you need further adjustments!