This repository contains code and resources related to evaluating and using Large Language Models (LLMs) in the healthcare domain. It includes examples ranging from basic PyTorch implementations to advanced transformer models, robustness experiments with medical images, and guidelines for constructing your own LLMs.
The repository is organized as follows:
-
/src/
: Core source code and utilities/src/runners/
: Model inference runners for different models (LLama, LLaVA)/src/data_processing/
: Data preparation and modification utilities/src/evaluation/
: Evaluation metrics and tools/src/utils/
: Common utilities and helpers
-
/experiments/
: Experiment notebooks organized by task/experiments/radiologist/
: Chest X-ray interpretation experiments (includes robustness tests)/experiments/surgical_tools/
: Surgical tool identification experiments
-
/tutorials/
: Educational notebooks demonstrating LLM concepts01-PyTorch-Basics.ipynb
: Basic concepts using PyTorch02-Transformer-Basics.ipynb
: Introduction to transformer models03-Building-LLM.ipynb
: Guide to building your own LLM04-Instruction-Tuning.ipynb
: Instructions for fine-tuning models05-Llama3-Pretrained.ipynb
: Working with Llama 3 models06-LLM-Robustness.ipynb
: Testing LLM robustness07_GRPO_Qwen_0_5_Instruct.ipynb
: GRPO fine-tuning with Qwen08_Tiny_VLM_Training.ipynb
: Training tiny vision-language models
-
/results/
: Results and performance analysis/results/analysis/
: Model performance analysis/results/monitoring/
: Runtime monitoring statistics/results/radiologist/
: Radiologist task results/results/surgical_tools/
: Surgical tools detection results
-
/data/
: Datasets, image data, and evaluation metadata -
/docs/
: Documentation and research notes -
/tests/
: Test files and sample data -
/archived_files/
: Legacy code and deprecated experiments
This repository contains experiments with various large language and vision-language models:
-
LLMs (Text-only models)
- Llama 3 (various sizes)
- GPT models (via API)
-
Vision-Language Models (VLMs)
- LLaVA-Med (Medical domain specialized)
- Gemini (Google's multimodal model)
- Gemma Vision (Google's open VLM)
- CheXagent (Chest X-ray specialized model)
The radiologist experiments evaluate how different models perform at interpreting chest X-rays:
- Base performance tests (standard images)
- Robustness tests using perturbed images (noise, artifacts)
- Adversarial sample testing
These experiments test model performance at identifying surgical instruments:
- Visual recognition of tools in surgical scenes
- Evaluation across different surgical procedures
- Performance comparison across model types
- Multimodal Medical Evaluation: Testing of vision-language models on medical images
- Robustness Analysis: Assessment of model performance under various perturbations
- Performance Monitoring: Tools to track and analyze model performance metrics
- Educational Content: Tutorials explaining LLM fundamentals and implementation
-
Prerequisites:
- Python 3.x
- Jupyter Notebook or Jupyter Lab
- PyTorch, Transformers, and other libraries (see notebook imports)
-
Clone the repository:
git clone https://github.com/yourusername/llm-healthcare.git cd llm-healthcare
-
Explore the content:
- Start with the tutorials to understand the concepts
- Review the experiment notebooks for practical evaluations
- Use the monitoring tools to track performance metrics
This project is licensed under the terms of the license included in the LICENSE
file.
- The MIMIC-CXR dataset (Johnson et al.)
- Harvard-FairVLMed benchmark
- Contributors to the open-source LLM ecosystem