pubMedNLP

Transformer-based Question Answering System trained on PubMed data.

Contributors:

Kenneth Styppa (GitHub alias 'KennyLoRI' and 'Kenneth Styppa')
Daniel Bogacz (GitHub alias 'bgzdaniel')
Arjan Siddhpura (GitHub alias 'arjansiddhpura')

Important remark: It appears commits from 'Kenneth Styppa' are only shown in the full commit history, not in the commit history of separate components. I.e. when opening files on Github, and assessing contributions to the file, commits from "Kenneth Styppa" will not appear. They only appear in the commit history of the full branch.

Overview

This project utilizes a combination of Kedro, Langchain, ChromaDB, and llama2.cpp to build a retrieval augmented generation system for medical question answering. The project is structured into modular pipelines that can be run end-2-end to first obtain the data, preprocess and embed the data, and later perform queries to interact with the retrieved information similar to a Q&A chatbot. Due to the modularity, it is only a matter of a different command line prompts to use the latter, i.e. the readily developed Q&A system.

Technologies Used

Kedro: Kedro is a development workflow framework that facilitates the creation, visualization, and deployment of modular data pipelines.
Langchain: Langchain is a framework for developing applications powered by language models, including information retrievers, text generation pipelines and other wrappers to facilitate a seamless integration of LLM-related open-source software.
ChromaDB: Chroma DB is an open-source vector storage system designed for efficiently storing and retrieving vector embeddings.
llama2.cpp: llama2.cpp implements Meta's LLaMa2 architecture in efficient C/C++ to enable a fast local runtime.

Installation & set-up

Prerequisites:
- Ensure you have Python installed on your system. Your Python version should match 3.10.
- Ensure to have conda installed on your system.
- Create a folder where you want to store the project. Call it e.g. pupMedNLP

Create a Conda Environment:

Create a conda environment
Activate the environment

conda create --name your_project_env python=3.10
conda activate your_project_env

Clone the Repository into your working directory:

git clone https://github.com/KennyLoRI/pubMedNLP.git

When using Mac set pgk_config path:

export PKG_CONFIG_PATH="/opt/homebrew/opt/openblas/lib/pkgconfig"

then switch to the working directory of the project:

cd pubMedNLP

Install Dependencies:
```
pip install -r requirements.txt
```
Llama.cpp GPU installation: (When using CPU only, skip this step.)

This part might be slightly tricky, depending on which system the installation is done. We do NOT recommend installation on Windows. It has been tested, but requires multiple components which need to be downloaded. Please contact Daniel Bogacz for details.

Linux:
```
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
```
MacOS:
```
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
```
If anything goes wrong in this step, please contact Daniel Bogacz for Linux installation issues and Kenneth Styppa for MacOS installation issues. Also refer to the installation guide provided here and also here
Download chroma store and model files and place them into the right location:
- Go to this Google drive link and download the ChromaDB store (folder called chroma_store_abstracts) as well as the llama2.cpp model files.
- Insert the ChromaDB store at pubMedNLP/kedronlp/
- Insert the model file into pubMedNLP/kedronlp/data/06_models/ and keep the name

Usage

Using the Q&A system

Navigate to the kedronlp folder in your terminal:
```
cd pubMedNLP/kedronlp
```
Activate the Q&A System:
```
kedro run --pipeline=chat
```

Interact with the system:

Ask your question

Please enter your question (use *word* for abbreviations or special terms): [your_question]

Ask another question (and so on)

Note: Running the system for the first time might take some additional seconds because the model has to be initialized. All questions, following the first one should be answered within a few seconds. If an answer takes more than 30 seconds to be completed, your GPU might not be automatically detected. You can check that by setting verbose = True in the parameters.yml file and then taking a look at the model initialization output. If it prints OPEN_BLAS = 1 somewhere, your GPU is automatically detected and it should be fine. If not please reach out to us in person via the emails provided in the documentation.

Trouble-shooting:

If you encounter an issue during your usage install pyspellchecker separately and try again:
```
pip install pyspellchecker
```

When encountering issues in the Llama.cpp installation, make sure you have NVIDIA Toolkit installed. Check with:

nvcc --version

Something similar to the following should appear:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

Also make sure that CMake is installed on your system.

Optional usage possibilities

Visualize the pipeline:
- Use built-in features from Kedro to get an overview of the pipeline in your browser
```
kedro viz
```
Test the preprocessing pipeline:
- Note: This is not advised since it may take a long time to extract the abstracts from PubMed and embed them (+ the PubMed API is not altogether stable):

kedro run --pipeline=data_processing

Create paragraphs out of abstracts:
- For this, the file extract_data.csv is required. Place it in kedronlp/data/01_raw. See here for the data. Go to kedronlp/scripts.
```
python create_paragraphs.py
```
Embedding of abstracts or paragraphs:
- For embedding abstracts the file extract_data.csv is required. Place it in kedronlp/data/01_raw. See here for the data. Go to kedronlp/scripts.
```
python abstract2vec.py
```
- For embedding paragraphs the file paragraphs.csv is required. Place it in kedronlp/data/01_raw. See here for the data. Go to kedronlp/scripts.
```
python paragraph2vec.py
```
Loading embeddings to the vector database ChromaDB:
- For loading abstract based embeddings to the vector database, the file abstract_metadata_embeddings.csv is required. Place it in kedronlp/data/01_raw/. See here for the data. Go to kedronlp/scripts.
```
python vec2chroma.py --granularity abstracts
```
- For loading paragraph based embeddings to the vector database, the file abstract_metadata_embeddings.csv is required. Place it in kedronlp/data/01_raw/. See here for the data. Go to kedronlp/scripts.
```
python vec2chroma.py --granularity paragraphs
```
Running Validation and Evaluation:
- For the validation and evaluation BleuRT is required. First clone bleuRT:
```
git clone https://github.com/google-research/bleurt.git
```
Go in into the subfolder 'bleurt':
```
cd bleurt
```
Specifically for MacOS: Because tensorflow is differently named under MacOS, the install requirements have to be changed. Go to bleurt/setup.py and change in the list variable install_requires the entry tensorflow to tensorflow-macos. It should look like the following:
```
install_requires = [
 "pandas", "numpy", "scipy", "tensorflow-macos", "tf-slim>=1.1", "sentencepiece"
]
```
Save the file.

Install bleuRT with the following:
```
pip install . 
```
The used BleuRT model can be found here. Place it under pubMedNLP/kedronlp/scripts/evaluation.
- Download the abstract based ChromaDB store (folder called chroma_store_abstracts) from here. The paragraph based vector database has do be created, it did not fit into the google drive link anymore. Please follow the steps above in 'Loading embeddings to the vector database ChromaDB' for paragraph based embeddings. This should create the paragraph based ChromaDB store called chroma_store_paragraphs. Go to kedronlp/scripts/evaluation.
```
python valid_and_eval.py
```

Project Structure

data/: This directory contains the raw and processed data as well as the model files used by the project.
src/: The source code of the project is organized into modules within this directory.
conf/: Configuration files for Kedro and other tools are stored here. If you want to run the pipelines with different retrievers, or hyperparameters you can change them in the parameters.yml file and they will automatically be broadcasted to all necessary files.
scripts/: Contains scripts we used during developing the project.
scripts/evaluation: Contains files and scripts to perform a validation and evaluation.
notebooks/: Contains tests and analysis notebooks. Have a look here to see our own evaluation of our system
docs/: Documentation related to the project.

Acknowledgments

We thank Prof. Gertz for this engaging course and Satya for her time to give us helpful advice.

Name		Name	Last commit message	Last commit date
Latest commit History 360 Commits
.idea		.idea
kedronlp		kedronlp
project_docs		project_docs
.DS_Store		.DS_Store
DOCUMENTATION.md		DOCUMENTATION.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pubMedNLP

Contributors:

Overview

Technologies Used

Installation & set-up

Usage

Using the Q&A system

Trouble-shooting:

Optional usage possibilities

Project Structure

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

kennethSty/pubMedNLP

Folders and files

Latest commit

History

Repository files navigation

pubMedNLP

Contributors:

Overview

Technologies Used

Installation & set-up

Usage

Using the Q&A system

Trouble-shooting:

Optional usage possibilities

Project Structure

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages