Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
2d36c83
Reorganize tests
DarkLight1337 Aug 23, 2024
9561d6b
Fix chameleon test
DarkLight1337 Aug 23, 2024
93e8707
Remove unnecessary `pytest.mark`
DarkLight1337 Aug 23, 2024
96c4d68
Update timings
DarkLight1337 Aug 23, 2024
e351790
Update timings
DarkLight1337 Aug 24, 2024
019a010
Rename and split multimodal tests
DarkLight1337 Aug 24, 2024
44324a8
Skip qwen-vl tests
DarkLight1337 Aug 24, 2024
37c7d36
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Aug 24, 2024
794ba26
Define `core_model` marker
DarkLight1337 Aug 24, 2024
6630dbe
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Aug 26, 2024
864be29
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Aug 26, 2024
e781692
Remove notebook
DarkLight1337 Aug 26, 2024
160c929
Update timings
DarkLight1337 Aug 26, 2024
36a8357
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Aug 28, 2024
755d97e
Fix imports
DarkLight1337 Aug 28, 2024
69bf5d4
Move multi-gpu tests for basic correctness and models
DarkLight1337 Aug 28, 2024
b76d784
Also move the chunked prefill tests
DarkLight1337 Aug 28, 2024
2afa1ff
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Aug 28, 2024
efe5cd0
Fix CPU tests
DarkLight1337 Aug 28, 2024
c0d2304
Avoid checking value at import time to avoid problems when performing…
DarkLight1337 Aug 28, 2024
88f30c8
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Aug 28, 2024
20c1e2b
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Aug 29, 2024
4a5abd7
Fuse tests together
DarkLight1337 Aug 29, 2024
9c2d851
Fix missing args
DarkLight1337 Aug 29, 2024
2690409
Simplify
DarkLight1337 Aug 29, 2024
81a33e0
Fix max tokens
DarkLight1337 Aug 29, 2024
07db9f5
Fix CUDA reinitialization problem for bart
DarkLight1337 Aug 29, 2024
424ae60
Fix type error
DarkLight1337 Aug 29, 2024
77814e5
Specify model
DarkLight1337 Aug 29, 2024
8a1f53e
Remove error suppression as it causes confusion
DarkLight1337 Aug 29, 2024
5ab47bc
Work around CUDA reinitialization error
DarkLight1337 Aug 29, 2024
d74b4ad
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Aug 30, 2024
1715f88
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Aug 30, 2024
f11f5de
Fix CUDA reinitialization error
DarkLight1337 Aug 30, 2024
dc42191
Fix template
DarkLight1337 Aug 30, 2024
646ee5b
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Aug 30, 2024
a6f3a38
Move multimodal distributed tests back into their own file as I can't…
DarkLight1337 Aug 30, 2024
6068127
Remove trailing tab
DarkLight1337 Aug 30, 2024
0db50dd
Add comment
DarkLight1337 Aug 30, 2024
54bd56e
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Aug 30, 2024
c17ca7c
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Aug 31, 2024
56b61e0
Update timings
DarkLight1337 Aug 31, 2024
3e48b64
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Aug 31, 2024
bc1a2de
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Sep 2, 2024
ac63126
Move new model tests files
DarkLight1337 Sep 2, 2024
cc50c9b
Remove mark
DarkLight1337 Sep 2, 2024
0255bd3
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Sep 4, 2024
1198ac4
Update docstrings
DarkLight1337 Sep 4, 2024
1e0201e
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Sep 5, 2024
0b31855
Fix import
DarkLight1337 Sep 5, 2024
507fcdd
Fix unnecessary move
DarkLight1337 Sep 5, 2024
6a6b6f6
Avoid import error during test collection on CPU
DarkLight1337 Sep 5, 2024
56c5eb4
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Sep 5, 2024
d14c527
Fix broken link
DarkLight1337 Sep 5, 2024
98a5ea3
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Sep 5, 2024
556dc9b
Update docs for Qwen-VL
DarkLight1337 Sep 5, 2024
feb0a8f
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Sep 7, 2024
dc0f844
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Sep 7, 2024
de800ff
Remove marker
DarkLight1337 Sep 7, 2024
26eb3f7
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Sep 7, 2024
fea9e64
format
DarkLight1337 Sep 7, 2024
846576a
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Sep 11, 2024
3c54839
Move new tests
DarkLight1337 Sep 11, 2024
f86a2ce
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Sep 12, 2024
63622c4
Move test
DarkLight1337 Sep 12, 2024
122d2a5
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Sep 12, 2024
8ae76f1
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Sep 13, 2024
efc487c
Merge branch 'upstream' into reorganize-models-tests
DarkLight1337 Sep 13, 2024
42cb113
Update lazy import
DarkLight1337 Sep 13, 2024
fddea8e
Update path
DarkLight1337 Sep 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 4 additions & 6 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,10 @@ docker exec cpu-test-avx2 bash -c "python3 examples/offline_inference.py"
# Run basic model test
docker exec cpu-test bash -c "
pip install pytest matplotlib einops transformers_stream_generator
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py \
--ignore=tests/models/test_oot_registration.py \
--ignore=tests/models/test_registry.py \
--ignore=tests/models/test_fp8.py \
--ignore=tests/models/test_jamba.py \
--ignore=tests/models/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported
pytest -v -s tests/models/decoder_only/language \
--ignore=tests/models/test_fp8.py \
--ignore=tests/models/decoder_only/language/test_jamba.py \
--ignore=tests/models/decoder_only/language/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported

# Run compressed-tensor test
docker exec cpu-test bash -c "
Expand Down
70 changes: 45 additions & 25 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,6 @@ steps:
- pytest -v -s entrypoints/test_chat_utils.py
- pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests


- label: Distributed Tests (4 GPUs) # 10min
working_dir: "/vllm-workspace/tests"
num_gpus: 4
Expand Down Expand Up @@ -164,30 +163,13 @@ steps:
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- python3 offline_inference_encoder_decoder.py

- label: Models Test # 1hr10min
source_file_dependencies:
- vllm/
- tests/models
commands:
- pip install -e ./plugins/vllm_add_dummy_model
- pytest -v -s models/test_oot_registration.py # it needs a clean process
- pytest -v -s models -m \"not vlm\" --ignore=models/test_oot_registration.py

- label: torch compile integration test
source_file_dependencies:
- vllm/
commands:
- pytest -v -s ./compile/test_full_graph.py
- pytest -v -s ./compile/test_wrapper.py


- label: Vision Language Models Test # 42min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
commands:
- pytest -v -s models -m vlm

- label: Prefix Caching Test # 7min
#mirror_hardwares: [amd]
source_file_dependencies:
Expand Down Expand Up @@ -286,6 +268,45 @@ steps:
commands:
- pytest -v -s tool_use

##### models test #####

- label: Basic Models Test # 3min
source_file_dependencies:
- vllm/
- tests/models
commands:
- pip install -e ./plugins/vllm_add_dummy_model
- pytest -v -s models/test_oot_registration.py # it needs a clean process
- pytest -v -s models/*.py --ignore=models/test_oot_registration.py

- label: Decoder-only Language Models Test # 1h3min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/models/decoder_only/language
commands:
- pytest -v -s models/decoder_only/language

- label: Decoder-only Multi-Modal Models Test # 56min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/models/decoder_only/audio_language
- tests/models/decoder_only/vision_language
commands:
- pytest -v -s models/decoder_only/audio_language
- pytest -v -s models/decoder_only/vision_language

- label: Other Models Test # 5min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/models/embedding/language
- tests/models/encoder_decoder/language
commands:
- pytest -v -s models/embedding/language
- pytest -v -s models/encoder_decoder/language

##### 1 GPU test #####
##### multi gpus test #####

Expand All @@ -311,11 +332,11 @@ steps:
- tests/distributed/
commands:
- # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep -q 'Same node test passed'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add grep?

Copy link
Member Author

@DarkLight1337 DarkLight1337 Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I have added an if __name__ == "__main__" guard to avoid executing the code during test collection, I use grep to ensure that the code inside is actually run during the test.

- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
- # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep -q 'Same node test passed'

- label: Distributed Tests (2 GPUs) # 28min
#mirror_hardwares: [amd]
Expand All @@ -328,11 +349,10 @@ steps:
- vllm/model_executor/models/
- tests/distributed/
commands:
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py
- TARGET_TEST_SUITE=L4 pytest -v -s distributed/test_basic_distributed_correctness.py
- pytest -v -s distributed/test_basic_distributed_correctness_enc_dec.py
- pytest -v -s distributed/test_chunked_prefill_distributed.py
- pytest -v -s distributed/test_multimodal_broadcast.py
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep -q 'Same node test passed'
- TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m distributed_2_gpus
# Avoid importing model tests that cause CUDA reinitialization error
- pytest models/encoder_decoder/language/test_bart.py models/decoder_only/vision_language/test_broadcast.py -v -s -m distributed_2_gpus
- pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- pip install -e ./plugins/vllm_add_dummy_model
- pytest -v -s distributed/test_distributed_oot.py
Expand Down
2 changes: 1 addition & 1 deletion docs/source/models/supported_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -342,7 +342,7 @@ Note that, as an inference engine, vLLM does not introduce new models. Therefore

We have the following levels of testing for models:

1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to `test_models.py <https://github.com/vllm-project/vllm/blob/main/tests/models/test_models.py>`_ and `test_big_models.py <https://github.com/vllm-project/vllm/blob/main/tests/models/test_big_models.py>`_ for the models that have passed this test.
1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to `models tests <https://github.com/vllm-project/vllm/blob/main/tests/models>`_ for the models that have passed this test.
2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to `functionality tests <https://github.com/vllm-project/vllm/tree/main/tests>`_ and `examples <https://github.com/vllm-project/vllm/tree/main/examples>`_ for the models that have passed this test.
4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -85,5 +85,6 @@ skip_gitignore = true
[tool.pytest.ini_options]
markers = [
"skip_global_cleanup",
"vlm: run tests for vision language models only",
"core_model: run this model test in each PR instead of just daily",
"distributed_2_gpus: run this test only in distributed tests for 2 GPUs",
]
62 changes: 62 additions & 0 deletions tests/basic_correctness/test_basic_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,15 @@
from vllm.worker.model_runner import ModelInputForGPUWithSamplingMetadata

from ..models.utils import check_outputs_equal
from ..utils import multi_gpu_test

MODELS = [
"facebook/opt-125m",
"meta-llama/Llama-2-7b-hf",
]

TARGET_TEST_SUITE = os.environ.get("TARGET_TEST_SUITE", "L4")


def test_vllm_gc_ed():
"""Verify vllm instance is GC'ed when it is deleted"""
Expand Down Expand Up @@ -70,6 +73,65 @@ def test_models(
)


@multi_gpu_test(num_gpus=2)
@pytest.mark.parametrize(
"model, distributed_executor_backend, attention_backend, "
"test_suite", [
("facebook/opt-125m", "ray", "", "L4"),
("facebook/opt-125m", "mp", "", "L4"),
("meta-llama/Llama-2-7b-hf", "ray", "", "L4"),
("meta-llama/Llama-2-7b-hf", "mp", "", "L4"),
("facebook/opt-125m", "ray", "", "A100"),
("facebook/opt-125m", "mp", "", "A100"),
("facebook/opt-125m", "mp", "FLASHINFER", "A100"),
("meta-llama/Meta-Llama-3-8B", "ray", "FLASHINFER", "A100"),
])
def test_models_distributed(
hf_runner,
vllm_runner,
example_prompts,
model: str,
distributed_executor_backend: str,
attention_backend: str,
test_suite: str,
) -> None:

if test_suite != TARGET_TEST_SUITE:
pytest.skip(f"Skip test for {test_suite}")

if model == "meta-llama/Llama-2-7b-hf" and distributed_executor_backend == "ray" and attention_backend == "" and test_suite == "L4": # noqa
# test ray adag
os.environ['VLLM_USE_RAY_SPMD_WORKER'] = "1"
os.environ['VLLM_USE_RAY_COMPILED_DAG'] = "1"

if attention_backend:
os.environ["VLLM_ATTENTION_BACKEND"] = attention_backend

dtype = "half"
max_tokens = 5

# NOTE: take care of the order. run vLLM first, and then run HF.
# vLLM needs a fresh new process without cuda initialization.
# if we run HF first, the cuda initialization will be done and it
# will hurt multiprocessing backend with fork method (the default method).
with vllm_runner(model,
dtype=dtype,
tensor_parallel_size=2,
distributed_executor_backend=distributed_executor_backend
) as vllm_model:
vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)

with hf_runner(model, dtype=dtype) as hf_model:
hf_outputs = hf_model.generate_greedy(example_prompts, max_tokens)

check_outputs_equal(
outputs_0_lst=hf_outputs,
outputs_1_lst=vllm_outputs,
name_0="hf",
name_1="vllm",
)


def test_model_with_failure(vllm_runner) -> None:
try:
with patch("vllm.model_executor.models.opt.OPTForCausalLM.forward",
Expand Down
55 changes: 55 additions & 0 deletions tests/basic_correctness/test_chunked_prefill.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,13 @@

Run `pytest tests/models/test_chunked_prefill.py`.
"""
import os
from contextlib import nullcontext

import pytest

from ..models.utils import check_logprobs_close, check_outputs_equal
from ..utils import multi_gpu_test

MODELS = [
"facebook/opt-125m",
Expand Down Expand Up @@ -66,6 +68,59 @@ def test_models(
)


@multi_gpu_test(num_gpus=2)
@pytest.mark.parametrize("distributed_executor_backend", ["ray", "mp"])
@pytest.mark.parametrize("model", MODELS)
def test_models_distributed(
hf_runner,
vllm_runner,
example_prompts,
model: str,
distributed_executor_backend: str,
) -> None:
if (model == "meta-llama/Llama-2-7b-hf"
and distributed_executor_backend == "ray"):
# test ray adag
os.environ['VLLM_USE_RAY_SPMD_WORKER'] = "1"
os.environ['VLLM_USE_RAY_COMPILED_DAG'] = "1"

dtype = "half"
max_tokens = 5
chunked_prefill_token_size = 16

# Add a chunked prefill config.
max_num_seqs = min(chunked_prefill_token_size, 256)
assert chunked_prefill_token_size != -1
enable_chunked_prefill = True
max_num_batched_tokens = chunked_prefill_token_size

# NOTE: take care of the order. run vLLM first, and then run HF.
# vLLM needs a fresh new process without cuda initialization.
# if we run HF first, the cuda initialization will be done and it
# will hurt multiprocessing backend with fork method (the default method).

with vllm_runner(
model,
dtype=dtype,
tensor_parallel_size=2,
max_num_seqs=max_num_seqs,
enable_chunked_prefill=enable_chunked_prefill,
max_num_batched_tokens=max_num_batched_tokens,
distributed_executor_backend=distributed_executor_backend,
) as vllm_model:
vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)

with hf_runner(model, dtype=dtype) as hf_model:
hf_outputs = hf_model.generate_greedy(example_prompts, max_tokens)

check_outputs_equal(
outputs_0_lst=hf_outputs,
outputs_1_lst=vllm_outputs,
name_0="hf",
name_1="vllm",
)


@pytest.mark.parametrize(
"kv_cache_dtype,model",
[("fp8_e4m3",
Expand Down
11 changes: 7 additions & 4 deletions tests/basic_correctness/test_preemption.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,13 @@
"facebook/opt-125m",
]

assert ENABLE_ARTIFICIAL_PREEMPT is True, (
"Use an env var VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1. "
"`VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest "
"tests/basic_correctness/test_preemption.py`")

@pytest.fixture(scope="module", autouse=True)
def check_settings():
assert ENABLE_ARTIFICIAL_PREEMPT is True, (
"Use an env var VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1. "
"`VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest "
"tests/basic_correctness/test_preemption.py`")


@pytest.fixture
Expand Down
29 changes: 12 additions & 17 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@
import tempfile
from collections import UserList
from enum import Enum
from typing import (Any, Callable, Dict, List, Optional, Tuple, TypedDict,
TypeVar, Union)
from typing import (Any, Callable, Dict, List, Optional, Tuple, Type,
TypedDict, TypeVar, Union)

import numpy as np
import pytest
Expand All @@ -18,6 +18,7 @@
from PIL import Image
from transformers import (AutoModelForCausalLM, AutoTokenizer, BatchEncoding,
BatchFeature)
from transformers.models.auto.auto_factory import _BaseAutoModelClass

from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
Expand Down Expand Up @@ -260,7 +261,7 @@ def __init__(
*,
model_kwargs: Optional[Dict[str, Any]] = None,
is_embedding_model: bool = False,
auto_cls=AutoModelForCausalLM,
auto_cls: Type[_BaseAutoModelClass] = AutoModelForCausalLM,
postprocess_inputs: Callable[[BatchEncoding],
BatchEncoding] = identity,
) -> None:
Expand Down Expand Up @@ -292,20 +293,14 @@ def __init__(
trust_remote_code=True,
)

try:
# don't put this import at the top level
# it will call torch.cuda.device_count()
from transformers import AutoProcessor # noqa: F401
self.processor = AutoProcessor.from_pretrained(
model_name,
torch_dtype=torch_dtype,
trust_remote_code=True,
)
except Exception as exc:
logger.warning(
"Unable to auto-load HuggingFace processor for model (%s). "
"Using tokenizer instead. Reason: %s", model_name, exc)
self.processor = self.tokenizer
# don't put this import at the top level
# it will call torch.cuda.device_count()
from transformers import AutoProcessor # noqa: F401
self.processor = AutoProcessor.from_pretrained(
model_name,
torch_dtype=torch_dtype,
trust_remote_code=True,
)

self.postprocess_inputs = postprocess_inputs

Expand Down
Loading
Loading