| 
 | 1 | +# LocalAI Backend Architecture  | 
 | 2 | + | 
 | 3 | +This directory contains the core backend infrastructure for LocalAI, including the gRPC protocol definition, multi-language Dockerfiles, and language-specific backend implementations.  | 
 | 4 | + | 
 | 5 | +## Overview  | 
 | 6 | + | 
 | 7 | +LocalAI uses a unified gRPC-based architecture that allows different programming languages to implement AI backends while maintaining consistent interfaces and capabilities. The backend system supports multiple hardware acceleration targets and provides a standardized way to integrate various AI models and frameworks.  | 
 | 8 | + | 
 | 9 | +## Architecture Components  | 
 | 10 | + | 
 | 11 | +### 1. Protocol Definition (`backend.proto`)  | 
 | 12 | + | 
 | 13 | +The `backend.proto` file defines the gRPC service interface that all backends must implement. This ensures consistency across different language implementations and provides a contract for communication between LocalAI core and backend services.  | 
 | 14 | + | 
 | 15 | +#### Core Services  | 
 | 16 | + | 
 | 17 | +- **Text Generation**: `Predict`, `PredictStream` for LLM inference  | 
 | 18 | +- **Embeddings**: `Embedding` for text vectorization  | 
 | 19 | +- **Image Generation**: `GenerateImage` for stable diffusion and image models  | 
 | 20 | +- **Audio Processing**: `AudioTranscription`, `TTS`, `SoundGeneration`  | 
 | 21 | +- **Video Generation**: `GenerateVideo` for video synthesis  | 
 | 22 | +- **Object Detection**: `Detect` for computer vision tasks  | 
 | 23 | +- **Vector Storage**: `StoresSet`, `StoresGet`, `StoresFind` for RAG operations  | 
 | 24 | +- **Reranking**: `Rerank` for document relevance scoring  | 
 | 25 | +- **Voice Activity Detection**: `VAD` for audio segmentation  | 
 | 26 | + | 
 | 27 | +#### Key Message Types  | 
 | 28 | + | 
 | 29 | +- **`PredictOptions`**: Comprehensive configuration for text generation  | 
 | 30 | +- **`ModelOptions`**: Model loading and configuration parameters  | 
 | 31 | +- **`Result`**: Standardized response format  | 
 | 32 | +- **`StatusResponse`**: Backend health and memory usage information  | 
 | 33 | + | 
 | 34 | +### 2. Multi-Language Dockerfiles  | 
 | 35 | + | 
 | 36 | +The backend system provides language-specific Dockerfiles that handle the build environment and dependencies for different programming languages:  | 
 | 37 | + | 
 | 38 | +- `Dockerfile.python`  | 
 | 39 | +- `Dockerfile.golang`  | 
 | 40 | +- `Dockerfile.llama-cpp`  | 
 | 41 | + | 
 | 42 | +### 3. Language-Specific Implementations  | 
 | 43 | + | 
 | 44 | +#### Python Backends (`python/`)  | 
 | 45 | +- **transformers**: Hugging Face Transformers framework  | 
 | 46 | +- **vllm**: High-performance LLM inference  | 
 | 47 | +- **mlx**: Apple Silicon optimization  | 
 | 48 | +- **diffusers**: Stable Diffusion models  | 
 | 49 | +- **Audio**: bark, coqui, faster-whisper, kitten-tts  | 
 | 50 | +- **Vision**: mlx-vlm, rfdetr  | 
 | 51 | +- **Specialized**: rerankers, chatterbox, kokoro  | 
 | 52 | + | 
 | 53 | +#### Go Backends (`go/`)  | 
 | 54 | +- **whisper**: OpenAI Whisper speech recognition in Go with GGML cpp backend (whisper.cpp)  | 
 | 55 | +- **stablediffusion-ggml**: Stable Diffusion in Go with GGML Cpp backend  | 
 | 56 | +- **huggingface**: Hugging Face model integration  | 
 | 57 | +- **piper**: Text-to-speech synthesis Golang with C bindings using rhaspy/piper  | 
 | 58 | +- **bark-cpp**: Bark TTS models Golang with Cpp bindings  | 
 | 59 | +- **local-store**: Vector storage backend  | 
 | 60 | + | 
 | 61 | +#### C++ Backends (`cpp/`)  | 
 | 62 | +- **llama-cpp**: Llama.cpp integration  | 
 | 63 | +- **grpc**: GRPC utilities and helpers  | 
 | 64 | + | 
 | 65 | +## Hardware Acceleration Support  | 
 | 66 | + | 
 | 67 | +### CUDA (NVIDIA)  | 
 | 68 | +- **Versions**: CUDA 11.x, 12.x  | 
 | 69 | +- **Features**: cuBLAS, cuDNN, TensorRT optimization  | 
 | 70 | +- **Targets**: x86_64, ARM64 (Jetson)  | 
 | 71 | + | 
 | 72 | +### ROCm (AMD)  | 
 | 73 | +- **Features**: HIP, rocBLAS, MIOpen  | 
 | 74 | +- **Targets**: AMD GPUs with ROCm support  | 
 | 75 | + | 
 | 76 | +### Intel  | 
 | 77 | +- **Features**: oneAPI, Intel Extension for PyTorch  | 
 | 78 | +- **Targets**: Intel GPUs, XPUs, CPUs  | 
 | 79 | + | 
 | 80 | +### Vulkan  | 
 | 81 | +- **Features**: Cross-platform GPU acceleration  | 
 | 82 | +- **Targets**: Windows, Linux, Android, macOS  | 
 | 83 | + | 
 | 84 | +### Apple Silicon  | 
 | 85 | +- **Features**: MLX framework, Metal Performance Shaders  | 
 | 86 | +- **Targets**: M1/M2/M3 Macs  | 
 | 87 | + | 
 | 88 | +## Backend Registry (`index.yaml`)  | 
 | 89 | + | 
 | 90 | +The `index.yaml` file serves as a central registry for all available backends, providing:  | 
 | 91 | + | 
 | 92 | +- **Metadata**: Name, description, license, icons  | 
 | 93 | +- **Capabilities**: Hardware targets and optimization profiles  | 
 | 94 | +- **Tags**: Categorization for discovery  | 
 | 95 | +- **URLs**: Source code and documentation links  | 
 | 96 | + | 
 | 97 | +## Building Backends  | 
 | 98 | + | 
 | 99 | +### Prerequisites  | 
 | 100 | +- Docker with multi-architecture support  | 
 | 101 | +- Appropriate hardware drivers (CUDA, ROCm, etc.)  | 
 | 102 | +- Build tools (make, cmake, compilers)  | 
 | 103 | + | 
 | 104 | +### Build Commands  | 
 | 105 | + | 
 | 106 | +Example of build commands with Docker  | 
 | 107 | + | 
 | 108 | +```bash  | 
 | 109 | +# Build Python backend  | 
 | 110 | +docker build -f backend/Dockerfile.python \  | 
 | 111 | +  --build-arg BACKEND=transformers \  | 
 | 112 | +  --build-arg BUILD_TYPE=cublas12 \  | 
 | 113 | +  --build-arg CUDA_MAJOR_VERSION=12 \  | 
 | 114 | +  --build-arg CUDA_MINOR_VERSION=0 \  | 
 | 115 | +  -t localai-backend-transformers .  | 
 | 116 | + | 
 | 117 | +# Build Go backend  | 
 | 118 | +docker build -f backend/Dockerfile.golang \  | 
 | 119 | +  --build-arg BACKEND=whisper \  | 
 | 120 | +  --build-arg BUILD_TYPE=cpu \  | 
 | 121 | +  -t localai-backend-whisper .  | 
 | 122 | + | 
 | 123 | +# Build C++ backend  | 
 | 124 | +docker build -f backend/Dockerfile.llama-cpp \  | 
 | 125 | +  --build-arg BACKEND=llama-cpp \  | 
 | 126 | +  --build-arg BUILD_TYPE=cublas12 \  | 
 | 127 | +  -t localai-backend-llama-cpp .  | 
 | 128 | +```  | 
 | 129 | + | 
 | 130 | +For ARM64/Mac builds, docker can't be used, and the makefile in the respective backend has to be used.  | 
 | 131 | + | 
 | 132 | +### Build Types  | 
 | 133 | + | 
 | 134 | +- **`cpu`**: CPU-only optimization  | 
 | 135 | +- **`cublas11`**: CUDA 11.x with cuBLAS  | 
 | 136 | +- **`cublas12`**: CUDA 12.x with cuBLAS  | 
 | 137 | +- **`hipblas`**: ROCm with rocBLAS  | 
 | 138 | +- **`intel`**: Intel oneAPI optimization  | 
 | 139 | +- **`vulkan`**: Vulkan-based acceleration  | 
 | 140 | +- **`metal`**: Apple Metal optimization  | 
 | 141 | + | 
 | 142 | +## Backend Development  | 
 | 143 | + | 
 | 144 | +### Creating a New Backend  | 
 | 145 | + | 
 | 146 | +1. **Choose Language**: Select Python, Go, or C++ based on requirements  | 
 | 147 | +2. **Implement Interface**: Implement the gRPC service defined in `backend.proto`  | 
 | 148 | +3. **Add Dependencies**: Create appropriate requirements files  | 
 | 149 | +4. **Configure Build**: Set up Dockerfile and build scripts  | 
 | 150 | +5. **Register Backend**: Add entry to `index.yaml`  | 
 | 151 | +6. **Test Integration**: Verify gRPC communication and functionality  | 
 | 152 | + | 
 | 153 | +### Backend Structure  | 
 | 154 | + | 
 | 155 | +```  | 
 | 156 | +backend-name/  | 
 | 157 | +├── backend.py/go/cpp    # Main implementation  | 
 | 158 | +├── requirements.txt      # Dependencies  | 
 | 159 | +├── Dockerfile           # Build configuration  | 
 | 160 | +├── install.sh           # Installation script  | 
 | 161 | +├── run.sh              # Execution script  | 
 | 162 | +├── test.sh             # Test script  | 
 | 163 | +└── README.md           # Backend documentation  | 
 | 164 | +```  | 
 | 165 | + | 
 | 166 | +### Required gRPC Methods  | 
 | 167 | + | 
 | 168 | +At minimum, backends must implement:  | 
 | 169 | +- `Health()` - Service health check  | 
 | 170 | +- `LoadModel()` - Model loading and initialization  | 
 | 171 | +- `Predict()` - Main inference endpoint  | 
 | 172 | +- `Status()` - Backend status and metrics  | 
 | 173 | + | 
 | 174 | +## Integration with LocalAI Core  | 
 | 175 | + | 
 | 176 | +Backends communicate with LocalAI core through gRPC:  | 
 | 177 | + | 
 | 178 | +1. **Service Discovery**: Core discovers available backends  | 
 | 179 | +2. **Model Loading**: Core requests model loading via `LoadModel`  | 
 | 180 | +3. **Inference**: Core sends requests via `Predict` or specialized endpoints  | 
 | 181 | +4. **Streaming**: Core handles streaming responses for real-time generation  | 
 | 182 | +5. **Monitoring**: Core tracks backend health and performance  | 
 | 183 | + | 
 | 184 | +## Performance Optimization  | 
 | 185 | + | 
 | 186 | +### Memory Management  | 
 | 187 | +- **Model Caching**: Efficient model loading and caching  | 
 | 188 | +- **Batch Processing**: Optimize for multiple concurrent requests  | 
 | 189 | +- **Memory Pinning**: GPU memory optimization for CUDA/ROCm  | 
 | 190 | + | 
 | 191 | +### Hardware Utilization  | 
 | 192 | +- **Multi-GPU**: Support for tensor parallelism  | 
 | 193 | +- **Mixed Precision**: FP16/BF16 for memory efficiency  | 
 | 194 | +- **Kernel Fusion**: Optimized CUDA/ROCm kernels  | 
 | 195 | + | 
 | 196 | +## Troubleshooting  | 
 | 197 | + | 
 | 198 | +### Common Issues  | 
 | 199 | + | 
 | 200 | +1. **GRPC Connection**: Verify backend service is running and accessible  | 
 | 201 | +2. **Model Loading**: Check model paths and dependencies  | 
 | 202 | +3. **Hardware Detection**: Ensure appropriate drivers and libraries  | 
 | 203 | +4. **Memory Issues**: Monitor GPU memory usage and model sizes  | 
 | 204 | + | 
 | 205 | +## Contributing  | 
 | 206 | + | 
 | 207 | +When contributing to the backend system:  | 
 | 208 | + | 
 | 209 | +1. **Follow Protocol**: Implement the exact gRPC interface  | 
 | 210 | +2. **Add Tests**: Include comprehensive test coverage  | 
 | 211 | +3. **Document**: Provide clear usage examples  | 
 | 212 | +4. **Optimize**: Consider performance and resource usage  | 
 | 213 | +5. **Validate**: Test across different hardware targets  | 
0 commit comments