This project introduces an adaptive learning system developed in Python, leveraging Retrieval Augmented Generation (RAG).
- Advanced LaTeX Ingestion Pipeline:
- Robustly processes
.tex
files, including a pre-processing step for common custom LaTeX command expansion. - Parses content using
pylatexenc
for accurate structural understanding. - Identifies conceptual topics based on document hierarchy (sections, subsections).
- Strategically chunks text using
langchain_text_splitters.RecursiveCharacterTextSplitter
for optimal retrieval.
- Robustly processes
- Vector Knowledge Base: Employs Weaviate to store text chunks and their Sentence Transformers embeddings, enabling powerful semantic search.
- Intelligent Retrieval System: Utilizes hybrid search strategies (semantic + keyword) to fetch the most relevant context for question generation and answer evaluation.
- RAG-Powered Question Generation: Leverages an LLM (e.g., Gemini Flash 2.0) to dynamically generate diverse questions (factual, conceptual, analytical) at varying difficulty levels, grounded in retrieved context.
- LLM-Based Answer Evaluation: Assesses learner responses against the source context using an LLM, providing quantitative scores and qualitative, constructive feedback.
- Comprehensive Learner Profile Management: Tracks learner progress, concept-specific knowledge scores, interaction history, and (future) Spaced Repetition System (SRS) data within an SQLite database.
- Adaptive Learning Engine:
- Constructs a curriculum map from ingested documents.
- Intelligently selects questions to target weak areas or introduce new concepts based on learner performance.
- Adjusts question difficulty and the amount of context provided.
- Allows learners to focus their study on specific documents/topics.
- Spaced Repetition System (SRS): Basic implementation to calculate next review dates for mastered concepts.
- Scalable FastAPI Backend: Exposes all system functionalities through a well-defined RESTful API, built for robustness and asynchronous operations.
The system features a modular design, primarily within the src/
directory, promoting separation of concerns and maintainability:
- data_ingestion/:
latex_processor.py
,latex_parser.py
,pdf_parser.py
,document_loader.py
,concept_tagger.py
,chunker.py
,vector_store_manager.py
- retrieval/:
retriever.py
- generation/:
question_generator_rag.py
- evaluation/:
answer_evaluator.py
- learner_model/:
profile_manager.py
,knowledge_tracker.py
- interaction/:
answer_handler.py
- adaptive_engine/:
question_selector.py
,srs_scheduler.py
- api/:
main_api.py
,models.py
- pipeline.py: Orchestrates key flows.
- app.py: Interactive CLI entry point.
- config.py: Centralized configuration management.
This project utilizes a modern, robust technology stack:
- Core Language: Python 3.10+
- Vector Database: Weaviate (local Docker instance or cloud)
- Embedding Models: Sentence Transformers (e.g.,
all-MiniLM-L6-v2
) - LaTeX Parsing:
pylatexenc
- Text Chunking:
langchain-text-splitters
- LLM Interaction: Google Gemini API (leveraging
aiohttp
for efficient asynchronous calls) - API Framework: FastAPI, Uvicorn
- Data Validation: Pydantic
- Learner Profile Storage: SQLite3
- Asynchronous Programming:
asyncio
- Configuration:
python-dotenv
for.env
file management - Testing Frameworks:
unittest
,unittest.mock
rag_math_project/
├── data/
│ ├── raw_latex/
│ ├── raw_pdfs/
│ ├── parsed_content/
│ │ ├── from_latex/
│ │ └── from_pdf/
│ ├── learner_profiles.sqlite3
│ └── processed_documents_log.txt
├── src/
│ ├── adaptive_engine/
│ ├── api/
│ ├── data_ingestion/
│ ├── evaluation/
│ ├── generation/
│ ├── interaction/
│ ├── learner_model/
│ ├── retrieval/
│ ├── __init__.py
│ ├── app.py
│ ├── config.py
│ └── pipeline.py
├── tests/
│ # ... (module-specific test directories)
├── .env
├── requirements.txt
└── README.md
main
- Prerequisites: Python 3.10+, Docker & Docker Compose, Weaviate access, LLM API Key. (Optional: Mathpix API credentials).
- Installation:
- Clone:
git clone <your-repository-url> && cd rag_math_project
- Environment:
python -m venv .venv && source .venv/bin/activate
(or.venv\Scripts\activate
on Windows) - Install uv:
pip install uv
- Dependencies:
uv pip install .
(for development) oruv pip install -e .
(for editable install) - Weaviate: Ensure instance is running (e.g.,
docker-compose up -d
if using local compose file). - Environment Variables: Create
.env
in root withGEMINI_API_KEY
, etc. (seesrc/config.py
for all options).
- Clone:
- a. Data Ingestion & Interactive CLI Demo:
- Place
.tex
files indata/raw_latex/
. - Run:
python -m src.app
- Place
- b. API Server:
- Ensure data is ingested.
- Run:
python -m src.api.main_api
oruvicorn src.api.main_api:app --reload --host 0.0.0.0 --port 8000
- Access Swagger UI at
http://localhost:8000/docs
.
- c. Running Tests:
- All:
python -m unittest discover tests
- Specific:
python -m unittest tests.module_name.test_file_name
- All:
The FastAPI backend provides a comprehensive set of endpoints. Key examples:
GET /api/v1/topics
: Lists available top-level learning topics.POST /api/v1/interaction/start
: Initiates a learning session.POST /api/v1/interaction/submit_answer
: Submits and evaluates a learner's answer.GET /api/v1/health
: API health check.
This project is an actively developed prototype with a strong foundation. Current areas of focus and known limitations include:
- Document Type Focus: Primarily robust for LaTeX ingestion. PDF processing is currently basic and experimental.
- Challenge: Advanced LaTeX Parsing: While the system handles common custom LaTeX commands, ensuring reliable parsing for a diverse range of highly complex or obscure user-defined macros is an ongoing challenge. This can occasionally lead to incomplete or inaccurate content extraction.
- Challenge: Semantic Chunking for Mathematical Content: The current text chunking strategy (
RecursiveCharacterTextSplitter
) is general-purpose. For highly structured mathematical content, it can sometimes inadvertently separate semantically linked units (e.g., a theorem from its proof, or a definition from its explanatory examples). Refining chunking logic to be more context-aware for mathematical discourse is a key area for improvement. - Challenge: Mathematical PDF Extraction: The existing PDF parser struggles with accurately extracting densely mathematical text, particularly complex equations and their surrounding layout. This significantly impacts the quality of ingested PDF content.
- Database Scalability: The current SQLite-based learner profile storage, while suitable for prototyping and single-user scenarios, would need to be migrated to a more robust database system (e.g., PostgreSQL) for production deployment. This is particularly important for handling concurrent write operations and ensuring data integrity in a multi-user environment.
- Spaced Repetition System (SRS): Implemented with basic logic; full SM-2 algorithm integration is planned.
- Curriculum Structure: Currently derived from document hierarchy. Explicit prerequisite definition between concepts is future work.
- Error Handling & Logging: Robust, but opportunities for further enhancement and more granular logging exist.
- User Interface: The interactive CLI (
app.py
) serves for demonstration and testing. A dedicated frontend application would consume the API for a richer user experience.
The vision for this project includes several exciting enhancements:
- Enhanced SRS: Implement a full-fledged SM-2 algorithm with dynamic ease factors and review intervals.
- Sophisticated Curriculum Graph:
- Allow manual definition and LLM-assisted inference of prerequisites between concepts.
- Utilize graph traversal for more nuanced "adjacent possible" concept recommendations.
- Advanced Question Generation: Develop capabilities for more interactive question types, such as "fill-in-the-missing-step" for proofs or auto-generating cloze deletions from definitions.
- Improved LaTeX & PDF Processing:
- Develop more resilient LaTeX pre-processing to handle a wider array of custom macros.
- Investigate and integrate specialized OCR and document analysis tools for superior mathematical PDF parsing.
- Context-Aware Semantic Chunking: Research and implement advanced chunking strategies tailored for technical and mathematical documents to better preserve logical units.
- Multi-Modal Content Support: Extend ingestion to handle images, diagrams, and other embedded media within technical documents.
- Dedicated Web Frontend: Build a responsive web application for a seamless and engaging learner experience.
- Performance Optimization: Profile and optimize for large-scale knowledge bases and concurrent users.
- Learner Analytics & Reporting: Provide dashboards for learners and educators to track progress, identify challenging concepts, and assess system effectiveness <<<<<<< test
A Retrieval-Augmented Generation (RAG) system for mathematics education, designed to provide personalized learning experiences through adaptive question generation and concept tracking.
- Adaptive Learning: Personalized question generation based on learner's knowledge level
- Concept Tracking: Monitors learner progress across mathematical concepts
- Curriculum Mapping: Structured learning paths with concept dependencies
- Latex Support: Full support for mathematical notation and equations
- Vector Search: Efficient semantic search for relevant mathematical content
- Knowledge Graph: Tracks relationships between mathematical concepts
rag_math_project/
├── src/
│ ├── data_ingestion/ # Data processing and ingestion
│ ├── knowledge_graph/ # Knowledge graph management
│ ├── learning/ # Learning session management
│ ├── question_gen/ # Question generation
│ ├── retrieval/ # Vector search and retrieval
│ └── utils/ # Utility functions
├── tests/ # Test suite
├── data/ # Data storage
└── config/ # Configuration files
- Clone the repository:
git clone https://github.com/yourusername/rag-math-project.git
cd rag-math-project
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -e ".[test]"
- Create a
.env
file in the project root:
cp .env.example .env
- Update the environment variables in
.env
with your configuration:
WEAVIATE_URL=your_weaviate_url
WEAVIATE_API_KEY=your_api_key
OPENAI_API_KEY=your_openai_api_key
- Start the FastAPI server:
uvicorn src.main:app --reload
- Access the API documentation at
http://localhost:8000/docs
The project uses pytest for testing. To run the tests:
# Run all tests
pytest
# Run tests with coverage report
pytest --cov=src
# Run specific test file
pytest tests/test_question_selector_pytest.py
# Run tests in parallel
pytest -n auto
The project follows PEP 8 guidelines. To check code style:
flake8 src tests
Type hints are used throughout the project. To check types:
mypy src
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Weaviate for vector search capabilities
- Sentence Transformers for text embeddings
- FastAPI for the web framework
- PyTorch Geometric for graph operations
For questions and support, please open an issue in the GitHub repository.
main