In development
Enterprise-ready distributed AI model serving platform that transforms single-node Ollama architecture into a horizontally scalable, fault-tolerant, high-performance distributed system.
OllamaMax is a comprehensive distributed AI model platform built for enterprise-scale deployments. It provides:
- π Distributed Architecture: Horizontal scaling across multiple nodes with P2P networking
- π Consensus Engine: Raft-based consensus for cluster coordination and consistency
- β‘ Intelligent Scheduling: Advanced load balancing and resource optimization
- ποΈ Proxy Management CLI: Comprehensive command-line tools for proxy monitoring and control
- π‘οΈ Enterprise Security: JWT authentication, TLS encryption, audit logging
- π Comprehensive Monitoring: Prometheus metrics, health checks, observability
- π§ Production Ready: Kubernetes deployment, Helm charts, operational runbooks
graph TB
subgraph "Load Balancer Layer"
LB[Load Balancer]
end
subgraph "API Gateway Layer"
API1[API Server 1]
API2[API Server 2]
API3[API Server 3]
end
subgraph "Consensus Layer"
RAFT[Raft Consensus Engine]
LEADER[Leader Election]
end
subgraph "P2P Network Layer"
P2P1[P2P Node 1]
P2P2[P2P Node 2]
P2P3[P2P Node 3]
DHT[Distributed Hash Table]
end
subgraph "Scheduler Layer"
SCHED[Intelligent Scheduler]
LB_ALG[Load Balancing]
HEALTH[Health Checker]
end
subgraph "Model Management"
MODELS[Model Registry]
REPL[Replication Manager]
SYNC[Sync Manager]
DIST[Distribution Engine]
end
subgraph "Storage Layer"
STORAGE1[Distributed Storage 1]
STORAGE2[Distributed Storage 2]
STORAGE3[Distributed Storage 3]
CAS[Content-Addressed Storage]
end
subgraph "Monitoring & Observability"
PROM[Prometheus Metrics]
GRAFANA[Grafana Dashboards]
ALERTS[Alert Manager]
LOGS[Centralized Logging]
end
LB --> API1
LB --> API2
LB --> API3
API1 --> RAFT
API2 --> RAFT
API3 --> RAFT
RAFT --> LEADER
LEADER --> SCHED
API1 --> P2P1
API2 --> P2P2
API3 --> P2P3
P2P1 --> DHT
P2P2 --> DHT
P2P3 --> DHT
SCHED --> LB_ALG
SCHED --> HEALTH
SCHED --> MODELS
MODELS --> REPL
MODELS --> SYNC
MODELS --> DIST
REPL --> STORAGE1
REPL --> STORAGE2
REPL --> STORAGE3
SYNC --> CAS
API1 --> PROM
API2 --> PROM
API3 --> PROM
PROM --> GRAFANA
PROM --> ALERTS
Component | Purpose | Technology |
---|---|---|
API Server | REST API gateway and web interface | Gin, WebSockets, JWT |
P2P Network | Peer-to-peer networking and discovery | libp2p, DHT, PubSub |
Consensus Engine | Cluster coordination and leader election | HashiCorp Raft |
Scheduler | Intelligent request routing and load balancing | Custom algorithms |
Proxy Manager | Load balancing and instance management | HTTP reverse proxy, CLI tools |
Model Manager | Distributed model registry and replication | Content-addressed storage |
Security Layer | Authentication, authorization, encryption | JWT, TLS, Audit logging |
Monitoring | Metrics, health checks, observability | Prometheus, Grafana |
- Go 1.24.5+
- Docker & Docker Compose (for containerized deployment)
- Kubernetes 1.25+ (for production deployment)
- Helm 3.0+ (for Kubernetes package management)
# Clone repository
git clone https://github.com/khryptorgraphics/ollamamax.git
cd ollamamax/ollama-distributed
# Install dependencies
go mod download
# Build the binary
make build
# Run with default configuration
./bin/ollama-distributed start
# Monitor proxy status
./bin/ollama-distributed proxy status
# View cluster instances
./bin/ollama-distributed proxy instances
# Monitor metrics in real-time
./bin/ollama-distributed proxy metrics --watch
# Build Docker image
docker build -t ollamamax:latest .
# Run single node
docker run -p 11434:11434 -p 8080:8080 -p 9090:9090 ollamamax:latest
# Run cluster with Docker Compose
docker-compose up -d
# Install with Helm
helm repo add ollamamax https://charts.ollamamax.com
helm install ollamamax ollamamax/ollamamax-cluster
# Or deploy with kubectl
kubectl apply -f k8s/
# Start a distributed node
./ollama-distributed start [--config config.yaml] [--peers peer1,peer2]
# Check node status
./ollama-distributed status
# Join existing cluster
./ollama-distributed join --peers node1:8080,node2:8080
The proxy CLI provides comprehensive tools for managing the distributed Ollama proxy:
# Check proxy status
./ollama-distributed proxy status [--json] [--api-url URL]
# List registered instances
./ollama-distributed proxy instances [--json] [--api-url URL]
# Monitor performance metrics
./ollama-distributed proxy metrics [--json] [--api-url URL]
# Real-time metrics monitoring
./ollama-distributed proxy metrics --watch [--interval 5]
# Start node and monitor proxy
./ollama-distributed start &
./ollama-distributed proxy status
# Monitor cluster health
./ollama-distributed proxy instances --json | jq '.instances[] | select(.status=="healthy")'
# Watch metrics in real-time
./ollama-distributed proxy metrics --watch --interval 10
# Check specific API endpoint
./ollama-distributed proxy status --api-url http://node2:8080
The system supports multiple environment configurations:
# Development
export OLLAMA_ENVIRONMENT=development
export OLLAMA_LOG_LEVEL=debug
# Staging
export OLLAMA_ENVIRONMENT=staging
export OLLAMA_LOG_LEVEL=info
# Production
export OLLAMA_ENVIRONMENT=production
export OLLAMA_LOG_LEVEL=warn
export OLLAMA_TLS_ENABLED=true
export OLLAMA_AUTH_ENABLED=true
Environment | Config File | Purpose |
---|---|---|
Development | config/dev.yaml |
Local development settings |
Staging | config/staging.yaml |
Pre-production testing |
Production | config/prod.yaml |
Production deployment |
# Example production configuration
node:
environment: production
region: us-west-2
zone: us-west-2a
api:
listen: "0.0.0.0:11434"
tls:
enabled: true
cert_file: "/etc/tls/server.crt"
key_file: "/etc/tls/server.key"
security:
auth:
enabled: true
method: jwt
token_expiry: 24h
metrics:
enabled: true
listen: "0.0.0.0:9090"
logging:
level: info
format: json
output: stdout
GET /api/v1/cluster/status # Cluster health and status
GET /api/v1/cluster/leader # Current cluster leader
POST /api/v1/cluster/join # Join existing cluster
POST /api/v1/cluster/leave # Leave cluster gracefully
GET /api/v1/nodes # List all cluster nodes
GET /api/v1/nodes/{id} # Get specific node details
POST /api/v1/nodes/{id}/drain # Drain node for maintenance
POST /api/v1/nodes/{id}/undrain # Return node to service
GET /api/v1/models # List available models
GET /api/v1/models/{name} # Get model details
POST /api/v1/models/{name}/download # Download model to cluster
DELETE /api/v1/models/{name} # Remove model from cluster
POST /api/v1/generate # Text generation
POST /api/v1/chat # Chat completion
POST /api/v1/embeddings # Text embeddings
GET /api/v1/health # System health check
GET /api/v1/metrics # Prometheus metrics
GET /api/v1/transfers # Model transfer status
GET /api/v1/ws # WebSocket endpoint
All API endpoints require JWT authentication:
# Obtain JWT token
curl -X POST http://localhost:11434/auth/login \
-H "Content-Type: application/json" \
-d '{"username":"admin","password":"secret"}'
# Use token in requests
curl -H "Authorization: Bearer <jwt-token>" \
http://localhost:11434/api/v1/cluster/status
# Add Helm repository
helm repo add ollamamax https://charts.ollamamax.com
helm repo update
# Install with production values
helm install ollamamax ollamamax/ollamamax-cluster \
--namespace ollamamax \
--create-namespace \
--values values-production.yaml
# Create namespace
kubectl create namespace ollamamax
# Deploy core components
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/secret.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/ingress.yaml
# Deploy monitoring stack
kubectl apply -f k8s/monitoring/
- Security: TLS certificates configured and valid
- Authentication: JWT secrets rotated and secure
- Monitoring: Prometheus metrics collection enabled
- Alerting: Alert rules configured for critical metrics
- Backup: Automated backup procedures in place
- Logging: Centralized log aggregation configured
- Network: Firewall rules and network policies applied
- Storage: Persistent volumes configured for data
- Scaling: Horizontal Pod Autoscaler configured
- Health Checks: Liveness and readiness probes configured
Key metrics exposed for monitoring:
# Node Metrics
ollama_nodes_total # Total nodes in cluster
ollama_nodes_online # Online nodes count
ollama_nodes_offline # Offline nodes count
# Model Metrics
ollama_models_total # Total models in registry
ollama_models_loaded # Currently loaded models
ollama_model_replicas # Model replication factor
# Request Metrics
ollama_requests_total # Total API requests
ollama_requests_duration_seconds # Request duration histogram
ollama_requests_errors_total # Request error count
# System Metrics
ollama_cpu_usage_percent # CPU utilization
ollama_memory_usage_bytes # Memory usage
ollama_network_bytes_total # Network I/O
ollama_disk_usage_bytes # Disk utilization
# P2P Metrics
ollama_peers_connected # Connected peers
ollama_peer_connections_total # Total P2P connections
ollama_peer_messages_total # P2P message count
# Consensus Metrics
ollama_consensus_leader_changes # Leadership changes
ollama_consensus_log_entries # Raft log entries
ollama_consensus_applied_entries # Applied log entries
Pre-built dashboards available in monitoring/grafana/
:
- Cluster Overview: High-level cluster health and status
- Node Performance: Individual node metrics and resources
- Model Management: Model distribution and replication status
- API Performance: Request rates, latency, and error rates
- P2P Network: Peer connectivity and network health
- Consensus Health: Raft consensus and leadership metrics
Critical alerts configured in monitoring/alerts/
:
- Node Down: Node becomes unreachable
- High CPU Usage: CPU utilization > 80%
- Memory Pressure: Memory usage > 85%
- Disk Full: Disk usage > 90%
- API Errors: Error rate > 5%
- Consensus Split: Leadership instability
- Model Replication: Under-replicated models
Feature | Implementation | Status |
---|---|---|
TLS Encryption | TLS 1.3 with strong cipher suites | β Enabled |
JWT Authentication | HMAC-256 signed tokens | β Enabled |
RBAC Authorization | Role-based access control | β Enabled |
Audit Logging | Comprehensive audit trail | β Enabled |
Network Security | Firewall rules and network policies | β Enabled |
Secret Management | Kubernetes secrets integration | β Enabled |
Certificate Rotation | Automated certificate management | β Enabled |
security:
tls:
enabled: true
min_version: "1.3"
cert_file: "/etc/tls/server.crt"
key_file: "/etc/tls/server.key"
ca_file: "/etc/tls/ca.crt"
auth:
enabled: true
method: jwt
token_expiry: 24h
secret_key: "${JWT_SECRET}"
audit:
enabled: true
log_file: "/var/log/ollama/audit.log"
format: json
firewall:
enabled: true
allowed_ips:
- "10.0.0.0/8"
- "172.16.0.0/12"
- "192.168.0.0/16"
# Check cluster health
ollama-distributed status
# Add new node to cluster
ollama-distributed join --peers=node1:4001,node2:4001
# Drain node for maintenance
kubectl drain ollama-node-1 --delete-local-data --ignore-daemonsets
# Scale cluster horizontally
kubectl scale deployment ollama-cluster --replicas=5
# Download model to cluster
curl -X POST http://localhost:11434/api/v1/models/llama2:7b/download
# Check model distribution
curl http://localhost:11434/api/v1/models/llama2:7b
# Enable auto-distribution
curl -X POST http://localhost:11434/api/v1/distribution/auto-configure \
-H "Content-Type: application/json" \
-d '{"enabled": true}'
# Backup cluster state
kubectl exec -it ollama-leader -- \
ollama-distributed backup --output=/backup/cluster-state.tar.gz
# Restore from backup
kubectl exec -it ollama-leader -- \
ollama-distributed restore --input=/backup/cluster-state.tar.gz
Issue | Symptoms | Solution |
---|---|---|
Split Brain | Multiple leaders elected | Check network connectivity, restart nodes |
High Latency | Slow API responses | Check node resources, scale cluster |
Model Load Failure | Models not downloading | Check storage capacity, network connectivity |
Authentication Errors | 401 responses | Verify JWT tokens, check secret configuration |
Memory Issues | Out of memory errors | Increase node memory, optimize model allocation |
# View cluster logs
kubectl logs -f deployment/ollama-cluster
# Check P2P network status
ollama-distributed p2p-status
# Validate configuration
ollama-distributed config-validate
# Run health diagnostics
ollama-distributed diagnose --full
# Kubernetes resource configuration
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
# JVM tuning for Go runtime
env:
- name: GOGC
value: "100"
- name: GOMAXPROCS
value: "4"
Metric | Scale Up Trigger | Scale Down Trigger |
---|---|---|
CPU Usage | > 70% for 5 minutes | < 30% for 10 minutes |
Memory Usage | > 80% for 3 minutes | < 40% for 15 minutes |
Request Queue | > 100 pending | < 10 pending |
Response Time | > 500ms average | < 100ms average |
We welcome contributions! Please see our Contributing Guide for details.
# Set up development environment
make dev-setup
# Run tests
make test
# Run integration tests
make test-integration
# Build and test locally
make build-local
# Submit pull request
git checkout -b feature/your-feature
git commit -am "Add your feature"
git push origin feature/your-feature
- Test Coverage: Minimum 80% code coverage required
- Code Style: Use
gofmt
andgolint
for consistent formatting - Documentation: All public APIs must be documented
- Security: Security review required for auth/crypto changes
- CLI Reference - Complete command-line guide
- Proxy CLI Implementation - Technical implementation details
- Architecture Overview - System design and components
- API Reference - REST API documentation
- Configuration Guide - Setup and configuration
- Deployment Guide - Production deployment
- Security Guide - Security best practices
- Monitoring Guide - Observability and metrics
- Troubleshooting - Common issues and solutions
- GitHub Issues: Report bugs and request features
- Discussions: Community discussions and Q&A
- Documentation: Comprehensive documentation
For enterprise support, consulting, and professional services:
- Email: [email protected]
- Support Portal: https://giggahost.com
This project is licensed under the MIT License - see the LICENSE file for details.
- Ollama Team: For the original single-node implementation
- HashiCorp: For the excellent Raft consensus library
- libp2p Team: For robust P2P networking primitives
- Prometheus & Grafana: For comprehensive monitoring solutions
- Kubernetes Community: For container orchestration platform
Built with β€οΈ for the AI community | Website | Documentation | Community