Skip to content

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented May 27, 2025

Key changes:

  • float32 models will be downcasted to whatever dtype is prioritized by the platform (defined by the first item in current_platform.supported_dtypes). This means that models may be downcasted to bfloat16 instead of float16.
  • Attempt to detect safetensors dtype from HF Hub repo if the dtype is not specified in the configuration file. This avoids accidentally casting float16 weights to bfloat16 if the dtype is not inside the config on HF Hub.
  • Exclude some models (gemma2, gemma3, plamo2, glm4) from being downcasted to float16 because of reported numerical stability issues.
  • Clean up some related log messages

Related to #17123

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337 DarkLight1337 requested a review from ywang96 as a code owner May 27, 2025 13:43
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label May 27, 2025
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337 DarkLight1337 changed the title [Core] Refactor dtype resolution [Core] Rework dtype resolution May 27, 2025
@mergify
Copy link

mergify bot commented May 28, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @DarkLight1337.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 28, 2025
Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, thanks for the rework!

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label May 30, 2025
@mergify mergify bot removed the needs-rebase label May 30, 2025
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337
Copy link
Member Author

@Isotr0py
Copy link
Member

Hmm, I tried the tests pipeline with this PR and it can pass locally:

pytest -s -v tests/models/language/pooling/test_snowflake_arctic_embed.py -k test_embed_models_mteb -m 'not core_model'
VLLM: 0.6811808101978006
SentenceTransformers: 0.6811455579280281
Difference: -3.525226977252327e-05
PASSED

CI outputs:

[2025-05-30T12:18:42Z] VLLM: torch.float32 0.6811515659720713
[2025-05-30T12:18:42Z] SentenceTransformer: torch.float32 0.6806553028675326
[2025-05-30T12:18:42Z] Difference: -0.0004962631045386523
[2025-05-30T12:18:43Z] FAILED

Seems the degradation is from sentence-transformers side again... ☹️

@noooop
Copy link
Collaborator

noooop commented May 31, 2025

torch.float16 0.6811108652277692
torch.bfloat16 0.6803641333652566

This result is closer to bfloat16, so I think the model was converted to bfloat16 somewhere.

I don't know why SentenceTransformer's model_kwargs={"torch_dtype": dtype} is not working, still use torch.float32.

but torch.set_default_dtype(dtype) does.

import mteb
import torch
from sentence_transformers import SentenceTransformer

MTEB_EMBED_TASKS = ["STS12"]

model_name = "Snowflake/snowflake-arctic-embed-m-long"


def run_mteb_embed_task(encoder, tasks):
    tasks = mteb.get_tasks(tasks=tasks)
    evaluation = mteb.MTEB(tasks=tasks)
    results = evaluation.run(encoder, verbosity=0, output_folder=None)

    main_score = results[0].scores["test"][0]["main_score"]
    return main_score


def run_mteb_embed_task_st(model_name, tasks, dtype):
    #torch.set_default_dtype(dtype)
    model = SentenceTransformer(model_name,
                                trust_remote_code=True,
                                model_kwargs={"torch_dtype": dtype})


    model_dtype = next(model.parameters()).dtype
    st_main_score = run_mteb_embed_task(model, tasks)

    print(model_dtype, st_main_score)


run_mteb_embed_task_st(model_name, MTEB_EMBED_TASKS, torch.float16)
run_mteb_embed_task_st(model_name, MTEB_EMBED_TASKS, torch.bfloat16)

  • This means that models may be downcasted to bfloat16 instead of float16.

I don't think using bfloat16 is a good idea for embedding models.

@noooop
Copy link
Collaborator

noooop commented May 31, 2025

I can reproduce this error if there is somewhere torch.set_default_dtype(torch.bfloat16).

import pytest
import torch
from tests.models.language.pooling.mteb_utils import mteb_test_embed_models
from tests.models.utils import EmbedModelInfo

MODELS = [
    EmbedModelInfo("Snowflake/snowflake-arctic-embed-m-long",
                   is_matryoshka=False,
                   architecture="NomicBertModel",
                   dtype="half",
                   enable_test=True),
]

torch.set_default_dtype(torch.bfloat16)

@pytest.mark.parametrize("model_info", MODELS)
def test_embed_models_mteb(hf_runner, vllm_runner,
                           model_info: EmbedModelInfo) -> None:
    mteb_test_embed_models(hf_runner, vllm_runner, model_info)

Placing with hf_runner below with set_default_torch_dtype(vllm_dtype) can fix it.

    with set_default_torch_dtype(vllm_dtype):
        with hf_runner(
            model_info.name, is_sentence_transformer=True,
            dtype=vllm_dtype) as hf_model:

            if hf_model_callback is not None:
                hf_model_callback(hf_model)

            st_main_score = run_mteb_embed_task(hf_model, MTEB_EMBED_TASKS)

            st_dtype = next(hf_model.model.parameters()).dtype

    default_dtype = torch.get_default_dtype()
    print("default_dtype: ", default_dtype)
    print("VLLM:", vllm_dtype, vllm_main_score)
    print("SentenceTransformers:", st_dtype, st_main_score)
    print("Difference:", st_main_score - vllm_main_score)

    assert st_main_score == pytest.approx(vllm_main_score, abs=MTEB_EMBED_TOL)

you may need

# ruff: noqa: SIM117

Using SentenceTransformers and VLLM with the same dtype might miss discovering potential issues.

@DarkLight1337
Copy link
Member Author

I see, let's default to float16 for embedding models and later change it to float32 in your PR then

@noooop
Copy link
Collaborator

noooop commented May 31, 2025

Now there is a serious problem.

I don't know why SentenceTransformer's model_kwargs={"torch_dtype": dtype} is not working.

but torch.set_default_dtype(dtype) does.

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337 DarkLight1337 merged commit 6aa8f9a into vllm-project:main Jun 1, 2025
63 checks passed
@DarkLight1337 DarkLight1337 deleted the refactor-dtype-resolution branch June 1, 2025 03:04
amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025
amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants