A high-performance Rust-based REST API service for generating sentence embeddings using transformer models. Built with Axum and rust-bert for fast, scalable NLP workloads.
- Fast Performance: 2-4x faster than Python equivalents using LibTorch backend
- REST API: Simple HTTP endpoints for embedding generation
- Batch Processing: Process multiple texts in a single request
- GPU Support: Optional CUDA acceleration (6x performance improvement)
- Thread-Safe: Concurrent request handling with async mutex protection
- Docker Ready: Multi-stage builds with production-ready containers
- Health Monitoring: Built-in health checks and logging
- Rust 1.70+
- Docker (optional)
- CUDA toolkit (for GPU acceleration)
# Build the service
cargo build --release
# Run the service
cargo run
# Service will be available at http://localhost:9000
# Build and run with Docker Compose
docker-compose up -d
# View logs
docker-compose logs -f embeddings-service
# Production deployment with nginx
docker-compose --profile production up -d
GET /health
Response:
{
"status": "healthy",
"models": ["all-minilm-l6-v2"]
}
POST /embeddings
Content-Type: application/json
{
"texts": ["Hello world", "How are you?"],
"model": "all-minilm-l6-v2"
}
Response:
{
"embeddings": [[0.1, 0.2, ...], [0.3, 0.4, ...]],
"model": "all-minilm-l6-v2",
"dimensions": 384
}
GET /embeddings?text=Hello%20world&model=all-minilm-l6-v2
- all-minilm-l6-v2: Default model (384 dimensions)
- Sentence transformers model optimized for semantic similarity
- Cached locally in
~/.cache/.rustbert/
or/app/.cache/.rustbert/
in Docker
Variable | Default | Description |
---|---|---|
PORT |
9000 |
Server port |
RUST_LOG |
info |
Logging level |
TORCH_CUDA_VERSION |
- | CUDA version for GPU builds (e.g., cu124 ) |
# docker-compose.yaml
services:
embeddings-service:
build: .
ports:
- "3000:3000"
environment:
- RUST_LOG=info
- PORT=3000
volumes:
- model_cache:/app/.cache/.rustbert
- CPU: 2-4x faster than Python-based solutions
- GPU: 6x improvement with CUDA acceleration
- Memory: 512MB minimum, 2GB recommended for production
- First Build: 5-15 minutes (downloads LibTorch)
- Models are loaded once at startup
- Thread-safe model access with async mutex
- Batch processing for multiple texts
- Memory-efficient Docker builds (~200MB final image)
# Debug build
cargo build
# Release build (optimized)
cargo build --release
# Quick syntax check
cargo check
# Run unit tests
cargo test
# Run with logging
RUST_LOG=debug cargo test
# Build with CUDA support
docker build --build-arg TORCH_CUDA_VERSION=cu124 -t embeddings-service-gpu .
- Web Framework: Axum with CORS support
- ML Backend: rust-bert with LibTorch
- Model Management: Arc-wrapped state with async mutex
- Serialization: Serde for JSON handling
- Logging: Tracing with configurable levels
- Memory: 2GB limit, 512MB reservation
- CPU: 1.0 limit, 0.5 reservation
- Storage: Volume for model cache persistence
- Health check endpoint with 30s intervals
- Structured logging with tracing
- Container restart policies
- Optional nginx reverse proxy
- Fork the repository
- Create a feature branch
- Run tests:
cargo test
- Submit a pull request
This project is licensed under the MIT License.