FlexLLama is a lightweight, extensible, and user-friendly self-hosted tool that easily runs multiple llama.cpp server instances with OpenAI v1 API compatibility. It's designed to manage multiple models across different GPUs, making it a powerful solution for local AI development and deployment.
- π Multiple llama.cpp instances - Run different models simultaneously
- π― Multi-GPU support - Distribute models across different GPUs
- π OpenAI v1 API compatible - Drop-in replacement for OpenAI endpoints
- π Real-time dashboard - Monitor model status with a web interface
- π€ Chat & Completions - Full chat and text completion support
- π Embeddings & Reranking - Supports models for embeddings and reranking
- β‘ Auto-start - Automatically start default runners on launch
- π Model switching - Dynamically load/unload models as needed
-
Install FlexLLama:
From GitHub:
pip install git+https://github.com/yazon/flexllama.git
From local source (after cloning):
# git clone https://github.com/yazon/flexllama.git # cd flexllama pip install .
-
Create your configuration: Copy the example configuration file to create your own. If you installed from a local clone, you can run:
cp backend/config_example.json config.json
If you installed from git, you may need to download it from the repository.
-
Edit
config.json
: Updateconfig.json
with the correct paths for yourllama-server
binary and your model files (.gguf
). -
Run FlexLLama:
python main.py config.json
or
flexllama config.json
-
Open dashboard:
http://localhost:8080
FlexLLama can be run using Docker and Docker Compose. We provide profiles for both CPU-only and GPU-accelerated (NVIDIA CUDA) environments.
-
Clone the repository:
git clone https://github.com/yazon/flexllama.git cd flexllama
After cloning, you can proceed with the quick start script or a manual setup.
For an easier start, the docker-start.sh
helper script automates several setup steps. It checks your Docker environment, builds the correct image (CPU or GPU) and provides the commands to launch FlexLLama.
-
Make the script executable:
chmod +x docker-start.sh
-
Run the script: Use the
--gpu
flag for NVIDIA GPU support.For CPU-only setup:
./docker-start.sh
For GPU-accelerated setup:
./docker-start.sh --gpu
-
Follow the on-screen instructions: The script will guide you.
Manual Docker and Docker Compose Setup
If you prefer to run the steps manually, follow this guide:
-
Place your models:
# Create the models directory if it doesn't exist mkdir -p models # Copy your .gguf model files into it cp /path/to/your/model.gguf models/
-
Configure your models:
# Edit the Docker configuration to point to your models # β’ CPU-only: keep "n_gpu_layers": 0 # β’ GPU: set "n_gpu_layers" to e.g. 99 and specify "main_gpu": 0
-
Build and Start FlexLLama with Docker Compose (Recommended): Use the
--profile
flag to select your environment. The service will be available athttp://localhost:8080
.For CPU-only:
docker compose --profile cpu up --build -d
For GPU support (NVIDIA CUDA):
docker compose --profile gpu up --build -d
-
View Logs To monitor the output of your services, you can view their logs in real-time.
For the CPU service:
docker compose --profile cpu logs -f
For the GPU service:
docker compose --profile gpu logs -f
(Press
Ctrl+C
to stop viewing the logs.) -
(Alternative) Using
docker run
: You can also build and run the containers manually.For CPU-only:
# Build the image docker build -t flexllama:latest . # Run the container docker run -d -p 8080:8080 \ -v $(pwd)/models:/app/models:ro \ -v $(pwd)/docker/config.json:/app/config.json:ro \ flexllama:latest
For GPU support (NVIDIA CUDA):
# Build the image docker build -f Dockerfile.cuda -t flexllama-gpu:latest . # Run the container docker run -d --gpus all -p 8080:8080 \ -v $(pwd)/models:/app/models:ro \ -v $(pwd)/docker/config.json:/app/config.json:ro \ flexllama-gpu:latest
-
Open the dashboard: Access the FlexLLama dashboard in your browser:
http://localhost:8080
Edit config.json
to configure your runners and models:
{
"auto_start_runners": true,
"api": {
"host": "0.0.0.0",
"port": 8080,
"health_endpoint": "/health"
},
"runner1": {
"type": "llama-server",
"path": "/path/to/llama-server",
"host": "127.0.0.1",
"port": 8085
},
"models": [
{
"runner": "runner1",
"model": "/path/to/model.gguf",
"model_alias": "my-model",
"n_ctx": 4096,
"n_gpu_layers": 99,
"main_gpu": 0
}
]
}
{
"runner_gpu0": {
"path": "/path/to/llama-server",
"port": 8085
},
"runner_gpu1": {
"path": "/path/to/llama-server",
"port": 8086
},
"models": [
{
"runner": "runner_gpu0",
"model": "/path/to/chat-model.gguf",
"model_alias": "chat-model",
"main_gpu": 0,
"n_gpu_layers": 99
},
{
"runner": "runner_gpu1",
"model": "/path/to/embedding-model.gguf",
"model_alias": "embedding-model",
"embedding": true,
"main_gpu": 1,
"n_gpu_layers": 99
}
]
}
Runner Options:
path
: Path to llama-server binaryhost
/port
: Where to run this instanceextra_args
: Additional arguments for llama-server (applied to all models using this runner)
Model Options:
Core Settings:
runner
: Which runner to use for this modelmodel
: Path to .gguf model filemodel_alias
: Name to use in API calls
Model Types:
embedding
: Set totrue
for embedding modelsreranking
: Set totrue
for reranking modelsmmproj
: Path to multimodal projection file (for vision models)
Performance & Memory:
n_ctx
: Context window size (e.g., 4096, 8192, 32768)n_batch
: Batch size for processing (e.g., 256, 512)n_threads
: Number of CPU threads to usemain_gpu
: Which GPU to use (0, 1, 2...)n_gpu_layers
: How many layers to offload to GPU (99 for all layers)tensor_split
: Array defining how to split model across GPUs (e.g., [1.0, 0.0])offload_kqv
: Whether to offload key-value cache to GPU (true
/false
)use_mlock
: Lock model in RAM to prevent swapping (true
/false
)
Optimization:
flash_attn
: Enable flash attention for faster processing (true
/false
)split_mode
: How to split model layers ("row" or other modes)cache-type-k
: Key cache quantization type (e.g., "q8_0")cache-type-v
: Value cache quantization type (e.g., "q8_0")
Chat & Templates:
chat_template
: Chat template format (e.g., "mistral-instruct", "gemma")jinja
: Enable Jinja templating (true
/false
)
Advanced Options:
rope-scaling
: RoPE scaling method (e.g., "linear")rope-scale
: RoPE scaling factor (e.g., 2)yarn-orig-ctx
: Original context size for YaRN scalingpooling
: Pooling method for embeddings (e.g., "cls")args
: Additional custom arguments to pass directly to llama-server for this specific model (string, e.g., "--custom-flag --param value"). These are applied after all other model parameters and before runnerextra_args
.
You can validate your configuration file and run a suite of tests to ensure the application is working correctly.
To validate your config.json
file, run config.py
and provide the path to your configuration file. This will check for correct formatting and required fields.
python backend/config.py config.json
A successful validation will print a confirmation message. If there are errors, they will be displayed with details on how to fix them.
The tests/
directory contains scripts for different testing purposes. All test scripts generate detailed logs in the tests/logs/{session_id}/
directory.
Prerequisites:
- For
test_basic.py
andtest_all_models.py
, the main application must be running (flexllama config.json
). - For
test_model_switching.py
, the main application should not be running.
test_basic.py
performs basic checks on the API endpoints to ensure they are responsive.
# Run basic tests against the default URL (http://localhost:8080)
python tests/test_basic.py
What it tests:
/v1/models
and/health
endpoints/v1/chat/completions
with both regular and streaming responses- Concurrent request handling
test_all_models.py
runs a comprehensive test suite against every model defined in your config.json
.
# Test all configured models
python tests/test_all_models.py config.json
What it tests:
- Model loading and health checks
- Chat completions (regular and streaming) for each model
- Response time and error handling
test_model_switching.py
verifies the dynamic loading and unloading of models.
# Run model switching tests
python tests/test_model_switching.py config.json
What it tests:
- Dynamic model loading and switching
- Runner state management and health monitoring
- Proper cleanup of resources
This project is licensed under the BSD-3-Clause License. See the LICENSE
file for details.
π Ready to run multiple LLMs like a pro? Edit your config.json
and start FlexLLama!