Skip to content

Bug: moondream2 inference not correct (severe quality degradation compared to reference) #8037

@cmp-nct

Description

@cmp-nct

What happened?

Moondream2 is a superb vision model, however on llama.cpp it performs at quality below vanilla llava-1
@vikhyat maybe you'd like to take a look ?

I compared images using python and using llama.cpp, both in fp16 format
moondream2 does recognize images roughly, also the language part seems to work but the quality is totally off through llama.cpp
When asked about spatial information (like lower left corner) it tends to just give anything from the left side or even a random object
On python, the response is precise and surprisingly accurate.

I looked a bit deeper (https://github.com/vikhyat/moondream/blob/main/moondream/vision_encoder.py) and this appears to have support for multiple resolutions, while on llama.cpp it runs in llava-1.5 mode.

However, in my test image llama.cpp creates 729 input embeddings for the image, python did the same.
So it's not just the input embedding count, something deeper is going wrong. My guess is that the sampling/patches are mixed up somehow.

For reference: moondream2 support was merged here: #6899

Name and Version

abd894a

What operating system are you seeing the problem on?

No response

Relevant log output

Below is an example image:
image

Prompt:<image>\n\nQuestion: What is in the lower left corner?\n\nAnswer:
Answer on python: "In the lower left corner, there is a green sticky note pad."
Answer on llave-cli: "A cup of coffee is in the lower left corner."
(I used the official supplied gguf files)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-unconfirmedmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)stale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions