This repository contains notebooks, code, application and ressources for the RAG LLM pipeline experiments for "Postions de thèses" corpora de l'École nationale des chartes.
- Clone the repository:
git clone https://github.com/chartes/encpos-qa-rag.git
cd encpos-qa-rag/
make
- create a conda environment:
conda env create -f environment.yml
- activate the environment:
conda activate qa_rag_env
- install requirements:
pip3 install -r requirements.txt
- First start by download retrievers.zip
- Unzip the file in the
data/
directory
(To use Generation part) Run LMStudio server
In LMStudio, download and serve the LLM
mistral-nemo-instruct-2407
(model we use for this experiment).
Tip
config.yml
contains the configuration for the notebooks, including the paths to the data and the retrievers. You can modify it to suit your needs.
Warning
Some data are already calculated and stored in the data/
directory, you can use them directly without re-running the notebooks.
Fichier | Description |
---|---|
01-prepare_chunk_corpus.ipynb |
Data analysis, preprocessing and chunking of the corpus of longform abstracts |
02-build_vectordb.ipynb |
Vectorstore database creation (Retriever) |
03-assemble_rag.ipynb |
Assemble the RAG pipeline with the retriever and the reader (generation part with LLM model) |
Check a specific documentation for streamlit application
@inproceedings{terjo2025from,
title = {From questions to insights: a reproducible question-answering pipeline for historiographical corpus exploration},
author = {Lucas Terriel and Vincent Jolivet},
booktitle = {Proceedings of the Digital Humanities Conference (DH2025)},
year = {2025},
address = {Lisbon, Portugal},
month = {July 14-18},
institution = {École nationale des chartes – PSL, France},
note = {Presented at DH2025, NOVA-FCSH}
}