Initial configuration #slow #windows #6648

doriongilmore · 2025-10-21T18:26:14Z

doriongilmore
Oct 21, 2025

Introduction

Hello, LocalAI stack seems amazing so I tried to setup a docker compose to run it locally.
After many attempts ([base image, aio image, builder images], [cpu only, gpu], [master, latest, specific versions]) and some searches on lots of threads/issues/etc. I tried to directly clone LocalAGI and use the docker-compose files from there. (I saw one of Mudler's comment, I don't remember where, saying that "it just works" with the compose files provided in this project.)

And it kind of works, but everything is extremely slow, no matter my setup.

The server itself hangs for many minutes before opening/starting the API, even if models and backends are already downloaded. Actually, depending on the number (and size) of models, it can be quicker to download instead of parsing those present during startup.
I don't see it in provided log file but I saw several times guessDefaultsFromFile: panic while parsing gguf file during other attempts.

Then, once started, if I send a prompt to a model, it takes several minutes before starting to process it, then the inference itself does seem as quick as expected.

I create this message in Discussions instead of Issues because I really don't know where the problem comes from, and I think several existing issues might be related to my problem.

My setup (hardware+software)

CPU : 12th Gen Intel(R) Core(TM) i9-12900H (2.90 GHz) (note there is no AVX512 support despite the CPU being recent)
RAM: 64,0 Go
Storage: 2To SSD
GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU
OS : Windows with WSL2
CLI : docker compose

My setup (compose files)

As stated above, my last try and provided logs are done with compose files coming from https://github.com/mudler/LocalAGI

docker-compose.yaml

services:
  localai:
    # See https://localai.io/basics/container/#standard-container-images for
    # a list of available container images (or build your own with the provided Dockerfile)
    # Available images with CUDA, ROCm, SYCL, Vulkan
    # Image list (quay.io): https://quay.io/repository/go-skynet/local-ai?tab=tags
    # Image list (dockerhub): https://hub.docker.com/r/localai/localai
    image: localai/localai:master
    command: 
    - ${MODEL_NAME:-gemma-3-4b-it-qat}
    - ${MULTIMODAL_MODEL:-moondream2-20250414}
    - ${IMAGE_MODEL:-sd-1.5-ggml}
    - granite-embedding-107m-multilingual
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
      interval: 60s
      timeout: 10m
      retries: 120
    ports:
    - 8081:8080
    environment:
      - DEBUG=true
      #- LOCALAI_API_KEY=sk-1234567890
    volumes:
      - ./volumes/models:/models
      - ./volumes/backends:/backends
      - ./volumes/images:/tmp/generated/images

  localrecall:
    image: quay.io/mudler/localrecall:main
    ports:
      - 8080
    environment:
      - COLLECTION_DB_PATH=/db
      - EMBEDDING_MODEL=granite-embedding-107m-multilingual
      - FILE_ASSETS=/assets
      - OPENAI_API_KEY=sk-1234567890
      - OPENAI_BASE_URL=http://localai:8080
    volumes:
      - ./volumes/localrag/db:/db
      - ./volumes/localrag/assets/:/assets

  localrecall-healthcheck:
    depends_on:
      localrecall:
        condition: service_started
    image: busybox
    command: ["sh", "-c", "until wget -q -O - http://localrecall:8080 > /dev/null 2>&1; do echo 'Waiting for localrecall...'; sleep 1; done; echo 'localrecall is up!'"]

  sshbox:
    build:
      context: .
      dockerfile: Dockerfile.sshbox
    ports:
      - "22"
    environment:
      - SSH_USER=root
      - SSH_PASSWORD=root
      - DOCKER_HOST=tcp://dind:2375
    depends_on:
      dind:
        condition: service_healthy

  dind:
    image: docker:dind
    privileged: true
    environment:
      - DOCKER_TLS_CERTDIR=""
    healthcheck:
      test: ["CMD", "docker", "info"]
      interval: 10s
      timeout: 5s
      retries: 3

  localagi:
    depends_on:
      localai:
        condition: service_healthy
      localrecall-healthcheck:
        condition: service_completed_successfully
      dind:
        condition: service_healthy
    build:
      context: .
      dockerfile: Dockerfile.webui
    ports:
      - 8080:3000
    #image: quay.io/mudler/localagi:master
    environment:
      - LOCALAGI_MODEL=${MODEL_NAME:-gemma-3-4b-it-qat}
      - LOCALAGI_MULTIMODAL_MODEL=${MULTIMODAL_MODEL:-moondream2-20250414}
      - LOCALAGI_IMAGE_MODEL=${IMAGE_MODEL:-sd-1.5-ggml}
      - LOCALAGI_LLM_API_URL=http://localai:8080
      #- LOCALAGI_LLM_API_KEY=sk-1234567890
      - LOCALAGI_LOCALRAG_URL=http://localrecall:8080
      - LOCALAGI_STATE_DIR=/pool
      - LOCALAGI_TIMEOUT=5m
      - LOCALAGI_ENABLE_CONVERSATIONS_LOGGING=false
      - LOCALAGI_SSHBOX_URL=root:root@sshbox:22
      - DOCKER_HOST=tcp://dind:2375
    extra_hosts:
      - "host.docker.internal:host-gateway"
    volumes:
      - ./volumes/localagi/:/pool

docker-compose.nvidia.yaml

services:
  localai:
    extends:
      file: docker-compose.yaml
      service: localai
    environment:
      - LOCALAI_SINGLE_ACTIVE_BACKEND=true
      - DEBUG=true
    image: localai/localai:master-gpu-nvidia-cuda-12
    # For images with python backends, use:
    # image: localai/localai:master-cublas-cuda12-ffmpeg
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  dind:
    extends:
      file: docker-compose.yaml
      service: dind

  localrecall:
    extends:
      file: docker-compose.yaml
      service: localrecall

  localrecall-healthcheck:
    extends:
      file: docker-compose.yaml
      service: localrecall-healthcheck

  localagi:
    extends:
      file: docker-compose.yaml
      service: localagi

The Logs

Here are some logs I want to highlight, but you can also find the complete logs here.
(You should note the server crashed and rebooted, so there is two startups in these logs)

# second startup (more than half an hour to start, even if models and backends were already downloaded/installed during the first startup)
11:45AM DBG Setting logging to debug
12:21PM INF LocalAI API is listening! Please connect to the endpoint for API documentation. endpoint=http://0.0.0.0:8080

# navigate to chat endpoint
12:50PM INF Success ip=172.24.0.1 latency=163.212451ms method=GET status=200 url=/

12:50PM INF Success ip=172.24.0.1 latency=21.570322ms method=GET status=200 url=/chat/gemma-3-4b-it-qat
# "send" my prompt to model gemma-3-4b-it-qat (~7 minutes for TTFT)
12:52PM DBG context local model name not found, setting to the first model first model name=moondream2-20250414
12:58PM DBG guessDefaultsFromFile: NGPULayers set NGPULayers=99999999
12:58PM DBG guessDefaultsFromFile: template already set name=gemma-3-4b-it-qat
12:58PM DBG templated message for chat: <start_of_turn>user
please answer back `OK` (and nothing else) as fast as you can<end_of_turn>
12:58PM DBG Stream request received
12:58PM DBG Loading GRPC Process: /backends/cuda12-llama-cpp/run.sh
12:58PM DBG Wait for the service to start up
12:58PM DBG GRPC Service Ready
12:59PM DBG Sending chunk: {"created":1761051501,"object":"chat.completion.chunk","id":"9f362a2e-3095-4a2d-8875-d0b48edc301f","model":"gemma-3-4b-it-qat","choices":[{"index":0,"finish_reason":"","delta":{"content":"OK"}}],"usage":{"prompt_tokens":25,"completion_tokens":1,"total_tokens":26}}
12:59PM DBG Sending chunk failed: connection closed
12:59PM DBG Stream ended



2:18PM INF Success ip=172.24.0.1 latency=22.389575ms method=GET status=200 url=/chat/moondream2-20250414
# "send" my prompt to model moondream2-20250414  (~3 minutes for TTFT)
2:19PM DBG context local model name not found, setting to the first model first model name=gemma-3-4b-it-qat
2:21PM DBG guessDefaultsFromFile: NGPULayers set NGPULayers=99999999
2:21PM DBG guessDefaultsFromFile: template already set name=moondream2-20250414
2:21PM DBG Prompt (before templating): 
Question: please answer back `OK` (and nothing else) as fast as you can
2:21PM DBG Stream request received
2:21PM INF Success ip=172.24.0.1 latency=1m49.418067173s method=POST status=200 url=/v1/chat/completions
2:21PM DBG Sending chunk: {"created":1761056463,"object":"chat.completion.chunk","id":"3989898c-df78-41e2-87fd-6fb322894587","model":"moondream2-20250414","choices":[{"index":0,"finish_reason":"","delta":{"role":"assistant","content":""}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
2:21PM DBG Stopping all backends except 'moondream2-20250414'
2:21PM DBG Deleting process gemma-3-4b-it-qat
2:21PM INF BackendLoader starting backend=llama-cpp modelID=moondream2-20250414 o.model=moondream2-text-model-f16_ct-vicuna.gguf
2:21PM DBG Loading model in memory from file: /models/moondream2-text-model-f16_ct-vicuna.gguf
2:21PM DBG Loading Model moondream2-20250414 with gRPC (file: /models/moondream2-text-model-f16_ct-vicuna.gguf) (backend: llama-cpp): {backendString:llama-cpp model:moondream2-text-model-f16_ct-vicuna.gguf modelID:moondream2-20250414 context:{emptyCtx:{}} gRPCOptions:0xc00076e608 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 parallelRequests:false}
2:21PM DBG GRPC Service Started
2:21PM DBG Wait for the service to start up
2:21PM DBG GRPC Service Ready
2:22PM DBG Sending chunk: {"created":1761056463,"object":"chat.completion.chunk","id":"3989898c-df78-41e2-87fd-6fb322894587","model":"moondream2-20250414","choices":[{"index":0,"finish_reason":"","delta":{"content":"ok"}}],"usage":{"prompt_tokens":24,"completion_tokens":1,"total_tokens":25}}
2:22PM DBG Sending chunk failed: connection closed
2:22PM DBG No choices in the response, skipping
2:22PM DBG No choices in the response, skipping
2:22PM ERR Stream ended with error: rpc error: code = Canceled desc = context canceled
Error rpc error: code = Canceled desc = context canceled

Additional Notes

1/ With this run I'm using master (because LocalAGI docker-compose file without modification) and I receive following error in the chat :
Request timeout: MCP processing is taking longer than expected. Please try again.
It probably explains the logs DBG Sending chunk failed: connection closed. But when I used "latest" instead, The chat eventually sent back the expected response with approximately the same delay we can see in current logs (which is several minutes).

2/ Despite the logs DBG GPUs gpus=[] and GPU vendor gpuVendor=, I can use nvidia-smi in the container, and can see the backend is correctly using my GPU once the inference has started.

3/ I don't like to compare with ollama because LocalAI seems much more complete in terms of API endpoints, hence capabilities, but since it works on my setup and that's the alternative I use :

The server itself is obviously quicker to start (no downloads or installation during startup) : it's a matter of seconds instead of many minutes.
The send request/receive response process is also a matter of seconds instead of minutes for the same prompt "please answer back OK (and nothing else) as fast as you can" and the same model (gemma3-4b). More precisely, about 10 seconds if the model is not loaded yet, and about 1 second if it is loaded already.

Other issues/discussions I found, that might be related (similar logs to what I saw during the different attempts)

LocalAI requires AVX2 on CPU — rpc error on older CPUs #6348
GRPC Service Not Ready #2220 (grpc service not ready)
ERR Server error error="failed to load model with internal loader: grpc service not ready" ip=172.17.0.1 latency=40.041976229s method=POST status=500 url=/v1/audio/transcriptions #4857 (grpc service not ready)
Rerank API not accessible: {"error":{"code":500,"message":"grpc service not ready","type":""}} #2503 (grpc service not ready)
failed starting/connecting to the gRPC service error="rpc error: code = Unavailable desc #4658 (grpc service error="rpc error: code = Unavailable desc)
CUDA 13 support #6266 (cuda 13 not available)

Conclusion

I spent approximately a week trying different configurations, different models (in case my setup was not the cause) and with no success so far. I'm able to get it "working", meaning LocalAI is able to respond, but I'm not able to make it as fast as it should. And from what I understand from my searches, I'm not the only one struggling with the initial configuration. So I really hope someone here can help and give some explanation.

Answered by doriongilmore

Oct 22, 2025

I got help on the discord server from @lunamidori5 (thanks a lot) and everything is resolved now.
Not sure to use the correct vocabulary as I'm bit new to docker environment, but the gist is about WSL and where the docker engine is running.

Basically I installed Docker Desktop on Windows, my docker compose files + volumes were stored on my Windows C: driver, and I executed the docker compose command directly from Windows. All wrong.
As suggested I moved everything to my WSL distro (compose files & volumes) and ran my docker compose command from the wsl itself.
Now the server starts instantly and the models are loaded instantly.

Apparently the speed diff is about 10000x. I didn't do the ma…

View full answer

doriongilmore · 2025-10-22T15:26:26Z

doriongilmore
Oct 22, 2025
Author

I got help on the discord server from @lunamidori5 (thanks a lot) and everything is resolved now.
Not sure to use the correct vocabulary as I'm bit new to docker environment, but the gist is about WSL and where the docker engine is running.

Basically I installed Docker Desktop on Windows, my docker compose files + volumes were stored on my Windows C: driver, and I executed the docker compose command directly from Windows. All wrong.
As suggested I moved everything to my WSL distro (compose files & volumes) and ran my docker compose command from the wsl itself.
Now the server starts instantly and the models are loaded instantly.

Apparently the speed diff is about 10000x. I didn't do the maths myself, but given my experience I'm enclined to believe it without verification (even downloading models seems much quicker)

Edit : Additional insigths from someone with the same problem:

You can't use the default docker-desktop WSL2 distro to accomplish this, you need to install a secondary WSL2 distro. Create a models folder inside this distro.,
In Docker Desktop resource configuration, give docker access to the newly created distro,
You can keep the docker-compose.yaml in your Windows filesystem.,
In the volumes: stanza use the UNC format to reference the location inside the secondary distro (i.e. \\wsl$\Ubuntu\localai\models:/models)

Edit : Windows copy helper

In order to copy your volumes into the wsl distro, you might want to use robocopy instead of Windows classic drag-and-drop copy-paste method
robocopy "C:\your\path\to\volumes"" "\\wsl.localhost\Ubuntu\home\your\new\path" /e /z /mt

1 reply

lunamidori5 Oct 26, 2025
Collaborator

well I mean so if it takes 15 mins to load a 70gb model vs about 2s id say that is a big speedup...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Initial configuration #slow #windows #6648

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Initial configuration #slow #windows #6648

Uh oh!

doriongilmore Oct 21, 2025

Introduction

My setup (hardware+software)

My setup (compose files)

The Logs

Additional Notes

Other issues/discussions I found, that might be related (similar logs to what I saw during the different attempts)

Conclusion

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

doriongilmore Oct 22, 2025 Author

Uh oh!

lunamidori5 Oct 26, 2025 Collaborator

doriongilmore
Oct 21, 2025

Replies: 1 comment 1 reply

doriongilmore
Oct 22, 2025
Author

lunamidori5 Oct 26, 2025
Collaborator