Skip to content

A Question-answering RAG (Retrieval-augmented generation) pipeline for positions de thèses de l'ENC (ENCPOS).

Notifications You must be signed in to change notification settings

chartes/encpos-qa-rag

Repository files navigation

encpos-qa-rag

Python 3.10 Conda

Jupyter Notebook Streamlit

This repository contains notebooks, code, application and ressources for the RAG LLM pipeline experiments for "Postions de thèses" corpora de l'École nationale des chartes.

Installation

  • Clone the repository:
git clone https://github.com/chartes/encpos-qa-rag.git
cd encpos-qa-rag/

Option A: Using makefile

make

Option B: Manual installation

  • create a conda environment:
conda env create -f environment.yml
  • activate the environment:
conda activate qa_rag_env
  • install requirements:
pip3 install -r requirements.txt
  • First start by download retrievers.zip
  • Unzip the file in the data/ directory

(To use Generation part) Run LMStudio server

In LMStudio, download and serve the LLM mistral-nemo-instruct-2407 (model we use for this experiment).

Tip

config.yml contains the configuration for the notebooks, including the paths to the data and the retrievers. You can modify it to suit your needs.

Warning

Some data are already calculated and stored in the data/ directory, you can use them directly without re-running the notebooks.

Fichier Description
01-prepare_chunk_corpus.ipynb Data analysis, preprocessing and chunking of the corpus of longform abstracts
02-build_vectordb.ipynb Vectorstore database creation (Retriever)
03-assemble_rag.ipynb Assemble the RAG pipeline with the retriever and the reader (generation part with LLM model)

Check a specific documentation for streamlit application

@inproceedings{terjo2025from,
  title     = {From questions to insights: a reproducible question-answering pipeline for historiographical corpus exploration},
  author    = {Lucas Terriel and Vincent Jolivet},
  booktitle = {Proceedings of the Digital Humanities Conference (DH2025)},
  year      = {2025},
  address   = {Lisbon, Portugal},
  month     = {July 14-18},
  institution = {École nationale des chartes – PSL, France},
  note      = {Presented at DH2025, NOVA-FCSH}
}

About

A Question-answering RAG (Retrieval-augmented generation) pipeline for positions de thèses de l'ENC (ENCPOS).

Resources

Stars

Watchers

Forks

Packages

No packages published