Initial configuration #slow #windows #6648
-
IntroductionHello, LocalAI stack seems amazing so I tried to setup a docker compose to run it locally. And it kind of works, but everything is extremely slow, no matter my setup. The server itself hangs for many minutes before opening/starting the API, even if models and backends are already downloaded. Actually, depending on the number (and size) of models, it can be quicker to download instead of parsing those present during startup. Then, once started, if I send a prompt to a model, it takes several minutes before starting to process it, then the inference itself does seem as quick as expected. I create this message in Discussions instead of Issues because I really don't know where the problem comes from, and I think several existing issues might be related to my problem. My setup (hardware+software)
My setup (compose files)As stated above, my last try and provided logs are done with compose files coming from https://github.com/mudler/LocalAGI docker-compose.yaml
services:
localai:
# See https://localai.io/basics/container/#standard-container-images for
# a list of available container images (or build your own with the provided Dockerfile)
# Available images with CUDA, ROCm, SYCL, Vulkan
# Image list (quay.io): https://quay.io/repository/go-skynet/local-ai?tab=tags
# Image list (dockerhub): https://hub.docker.com/r/localai/localai
image: localai/localai:master
command:
- ${MODEL_NAME:-gemma-3-4b-it-qat}
- ${MULTIMODAL_MODEL:-moondream2-20250414}
- ${IMAGE_MODEL:-sd-1.5-ggml}
- granite-embedding-107m-multilingual
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
interval: 60s
timeout: 10m
retries: 120
ports:
- 8081:8080
environment:
- DEBUG=true
#- LOCALAI_API_KEY=sk-1234567890
volumes:
- ./volumes/models:/models
- ./volumes/backends:/backends
- ./volumes/images:/tmp/generated/images
localrecall:
image: quay.io/mudler/localrecall:main
ports:
- 8080
environment:
- COLLECTION_DB_PATH=/db
- EMBEDDING_MODEL=granite-embedding-107m-multilingual
- FILE_ASSETS=/assets
- OPENAI_API_KEY=sk-1234567890
- OPENAI_BASE_URL=http://localai:8080
volumes:
- ./volumes/localrag/db:/db
- ./volumes/localrag/assets/:/assets
localrecall-healthcheck:
depends_on:
localrecall:
condition: service_started
image: busybox
command: ["sh", "-c", "until wget -q -O - http://localrecall:8080 > /dev/null 2>&1; do echo 'Waiting for localrecall...'; sleep 1; done; echo 'localrecall is up!'"]
sshbox:
build:
context: .
dockerfile: Dockerfile.sshbox
ports:
- "22"
environment:
- SSH_USER=root
- SSH_PASSWORD=root
- DOCKER_HOST=tcp://dind:2375
depends_on:
dind:
condition: service_healthy
dind:
image: docker:dind
privileged: true
environment:
- DOCKER_TLS_CERTDIR=""
healthcheck:
test: ["CMD", "docker", "info"]
interval: 10s
timeout: 5s
retries: 3
localagi:
depends_on:
localai:
condition: service_healthy
localrecall-healthcheck:
condition: service_completed_successfully
dind:
condition: service_healthy
build:
context: .
dockerfile: Dockerfile.webui
ports:
- 8080:3000
#image: quay.io/mudler/localagi:master
environment:
- LOCALAGI_MODEL=${MODEL_NAME:-gemma-3-4b-it-qat}
- LOCALAGI_MULTIMODAL_MODEL=${MULTIMODAL_MODEL:-moondream2-20250414}
- LOCALAGI_IMAGE_MODEL=${IMAGE_MODEL:-sd-1.5-ggml}
- LOCALAGI_LLM_API_URL=http://localai:8080
#- LOCALAGI_LLM_API_KEY=sk-1234567890
- LOCALAGI_LOCALRAG_URL=http://localrecall:8080
- LOCALAGI_STATE_DIR=/pool
- LOCALAGI_TIMEOUT=5m
- LOCALAGI_ENABLE_CONVERSATIONS_LOGGING=false
- LOCALAGI_SSHBOX_URL=root:root@sshbox:22
- DOCKER_HOST=tcp://dind:2375
extra_hosts:
- "host.docker.internal:host-gateway"
volumes:
- ./volumes/localagi/:/pooldocker-compose.nvidia.yaml
services:
localai:
extends:
file: docker-compose.yaml
service: localai
environment:
- LOCALAI_SINGLE_ACTIVE_BACKEND=true
- DEBUG=true
image: localai/localai:master-gpu-nvidia-cuda-12
# For images with python backends, use:
# image: localai/localai:master-cublas-cuda12-ffmpeg
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
dind:
extends:
file: docker-compose.yaml
service: dind
localrecall:
extends:
file: docker-compose.yaml
service: localrecall
localrecall-healthcheck:
extends:
file: docker-compose.yaml
service: localrecall-healthcheck
localagi:
extends:
file: docker-compose.yaml
service: localagi
The LogsHere are some logs I want to highlight, but you can also find the complete logs here. # second startup (more than half an hour to start, even if models and backends were already downloaded/installed during the first startup)
11:45AM DBG Setting logging to debug
12:21PM INF LocalAI API is listening! Please connect to the endpoint for API documentation. endpoint=http://0.0.0.0:8080
# navigate to chat endpoint
12:50PM INF Success ip=172.24.0.1 latency=163.212451ms method=GET status=200 url=/
12:50PM INF Success ip=172.24.0.1 latency=21.570322ms method=GET status=200 url=/chat/gemma-3-4b-it-qat
# "send" my prompt to model gemma-3-4b-it-qat (~7 minutes for TTFT)
12:52PM DBG context local model name not found, setting to the first model first model name=moondream2-20250414
12:58PM DBG guessDefaultsFromFile: NGPULayers set NGPULayers=99999999
12:58PM DBG guessDefaultsFromFile: template already set name=gemma-3-4b-it-qat
12:58PM DBG templated message for chat: <start_of_turn>user
please answer back `OK` (and nothing else) as fast as you can<end_of_turn>
12:58PM DBG Stream request received
12:58PM DBG Loading GRPC Process: /backends/cuda12-llama-cpp/run.sh
12:58PM DBG Wait for the service to start up
12:58PM DBG GRPC Service Ready
12:59PM DBG Sending chunk: {"created":1761051501,"object":"chat.completion.chunk","id":"9f362a2e-3095-4a2d-8875-d0b48edc301f","model":"gemma-3-4b-it-qat","choices":[{"index":0,"finish_reason":"","delta":{"content":"OK"}}],"usage":{"prompt_tokens":25,"completion_tokens":1,"total_tokens":26}}
12:59PM DBG Sending chunk failed: connection closed
12:59PM DBG Stream ended
2:18PM INF Success ip=172.24.0.1 latency=22.389575ms method=GET status=200 url=/chat/moondream2-20250414
# "send" my prompt to model moondream2-20250414 (~3 minutes for TTFT)
2:19PM DBG context local model name not found, setting to the first model first model name=gemma-3-4b-it-qat
2:21PM DBG guessDefaultsFromFile: NGPULayers set NGPULayers=99999999
2:21PM DBG guessDefaultsFromFile: template already set name=moondream2-20250414
2:21PM DBG Prompt (before templating):
Question: please answer back `OK` (and nothing else) as fast as you can
2:21PM DBG Stream request received
2:21PM INF Success ip=172.24.0.1 latency=1m49.418067173s method=POST status=200 url=/v1/chat/completions
2:21PM DBG Sending chunk: {"created":1761056463,"object":"chat.completion.chunk","id":"3989898c-df78-41e2-87fd-6fb322894587","model":"moondream2-20250414","choices":[{"index":0,"finish_reason":"","delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
2:21PM DBG Stopping all backends except 'moondream2-20250414'
2:21PM DBG Deleting process gemma-3-4b-it-qat
2:21PM INF BackendLoader starting backend=llama-cpp modelID=moondream2-20250414 o.model=moondream2-text-model-f16_ct-vicuna.gguf
2:21PM DBG Loading model in memory from file: /models/moondream2-text-model-f16_ct-vicuna.gguf
2:21PM DBG Loading Model moondream2-20250414 with gRPC (file: /models/moondream2-text-model-f16_ct-vicuna.gguf) (backend: llama-cpp): {backendString:llama-cpp model:moondream2-text-model-f16_ct-vicuna.gguf modelID:moondream2-20250414 context:{emptyCtx:{}} gRPCOptions:0xc00076e608 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 parallelRequests:false}
2:21PM DBG GRPC Service Started
2:21PM DBG Wait for the service to start up
2:21PM DBG GRPC Service Ready
2:22PM DBG Sending chunk: {"created":1761056463,"object":"chat.completion.chunk","id":"3989898c-df78-41e2-87fd-6fb322894587","model":"moondream2-20250414","choices":[{"index":0,"finish_reason":"","delta":{"content":"ok"}}],"usage":{"prompt_tokens":24,"completion_tokens":1,"total_tokens":25}}
2:22PM DBG Sending chunk failed: connection closed
2:22PM DBG No choices in the response, skipping
2:22PM DBG No choices in the response, skipping
2:22PM ERR Stream ended with error: rpc error: code = Canceled desc = context canceled
Error rpc error: code = Canceled desc = context canceled
Additional Notes1/ With this run I'm using master (because LocalAGI docker-compose file without modification) and I receive following error in the chat : 2/ Despite the logs 3/ I don't like to compare with ollama because LocalAI seems much more complete in terms of API endpoints, hence capabilities, but since it works on my setup and that's the alternative I use :
Other issues/discussions I found, that might be related (similar logs to what I saw during the different attempts)
ConclusionI spent approximately a week trying different configurations, different models (in case my setup was not the cause) and with no success so far. I'm able to get it "working", meaning LocalAI is able to respond, but I'm not able to make it as fast as it should. And from what I understand from my searches, I'm not the only one struggling with the initial configuration. So I really hope someone here can help and give some explanation. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
I got help on the discord server from @lunamidori5 (thanks a lot) and everything is resolved now. Basically I installed Docker Desktop on Windows, my docker compose files + volumes were stored on my Windows C: driver, and I executed the docker compose command directly from Windows. All wrong. Apparently the speed diff is about 10000x. I didn't do the maths myself, but given my experience I'm enclined to believe it without verification (even downloading models seems much quicker) Edit : Additional insigths from someone with the same problem:
Edit : Windows copy helper In order to copy your volumes into the wsl distro, you might want to use robocopy instead of Windows classic drag-and-drop copy-paste method |
Beta Was this translation helpful? Give feedback.
I got help on the discord server from @lunamidori5 (thanks a lot) and everything is resolved now.
Not sure to use the correct vocabulary as I'm bit new to docker environment, but the gist is about WSL and where the docker engine is running.
Basically I installed Docker Desktop on Windows, my docker compose files + volumes were stored on my Windows C: driver, and I executed the docker compose command directly from Windows. All wrong.
As suggested I moved everything to my WSL distro (compose files & volumes) and ran my docker compose command from the wsl itself.
Now the server starts instantly and the models are loaded instantly.
Apparently the speed diff is about 10000x. I didn't do the ma…