Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1112 commits
Select commit Hold shift + click to select a range
9799280
[CI/Build]Reduce the time consumption for LoRA tests (#7396)
jeejeelee Aug 14, 2024
ea49e6a
[misc][ci] fix cpu test with plugins (#7489)
youkaichao Aug 14, 2024
dd164d7
[Bugfix][Docs] Update list of mock imports (#7493)
DarkLight1337 Aug 14, 2024
199adbb
[doc] update test script to include cudagraph (#7501)
youkaichao Aug 14, 2024
c134a46
Fix empty output when temp is too low (#2937)
CatherineSue Aug 14, 2024
d3d9cb6
[ci] fix model tests (#7507)
youkaichao Aug 14, 2024
67d115d
[Bugfix][Frontend] Disable embedding API for chat models (#7504)
QwertyJack Aug 14, 2024
70b746e
[Misc] Deprecation Warning when setting --engine-use-ray (#7424)
wallashss Aug 14, 2024
3f674a4
[VLM][Core] Support profiling with multiple multi-modal inputs per pr…
DarkLight1337 Aug 14, 2024
2ecf7b1
[core] [3/N] multi-step args and sequence.py (#7452)
SolitaryThinker Aug 14, 2024
951fdd6
[TPU] Set per-rank XLA cache (#7533)
WoosukKwon Aug 14, 2024
f55a9ae
[Misc] Revert `compressed-tensors` code reuse (#7521)
kylesayrs Aug 14, 2024
22b39e1
llama_index serving integration documentation (#6973)
pavanjava Aug 14, 2024
fc93e56
[Bugfix][TPU] Correct env variable for XLA cache path (#7544)
WoosukKwon Aug 15, 2024
9c1f78d
[Bugfix] update neuron for version > 0.5.0 (#7175)
omrishiv Aug 15, 2024
f4da5f7
[Misc] Update dockerfile for CPU to cover protobuf installation (#7182)
PHILO-HE Aug 15, 2024
21313e0
[Bugfix] Fix default weight loading for scalars (#7534)
mgoin Aug 15, 2024
9c8e2d1
[Bugfix][Harmless] Fix float16 dtype for model_is_embedding (#7566)
mgoin Aug 16, 2024
b67ae00
[Misc] Add quantization config support for speculative model. (#7343)
ShangmingCai Aug 16, 2024
f878c8f
[Feature]: Add OpenAI server prompt_logprobs support #6508 (#7453)
gnpinkert Aug 16, 2024
4cd7d47
[ci/test] rearrange tests and make adag test soft fail (#7572)
youkaichao Aug 16, 2024
3b19e39
Chat method for offline llm (#5049)
nunjunj Aug 16, 2024
e165528
[CI] Move quantization cpu offload tests out of fastcheck (#7574)
mgoin Aug 16, 2024
50b8d08
[Misc/Testing] Use `torch.testing.assert_close` (#7324)
jon-chuang Aug 16, 2024
54bd9a0
register custom op for flash attn and use from torch.ops (#7536)
youkaichao Aug 16, 2024
9587b05
[Core] Use uvloop with zmq-decoupled front-end (#7570)
njhill Aug 16, 2024
6fc5b0f
[CI] Fix crashes of performance benchmark (#7500)
KuntaiDu Aug 16, 2024
0e39a33
[Bugfix][Hardware][AMD][Frontend] add quantization param to embedding…
gongdao123 Aug 16, 2024
ec724a7
support tqdm in notebooks (#7510)
fzyzcjy Aug 16, 2024
e837b62
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210)
charlifu Aug 16, 2024
7fc23be
[Kernel] W8A16 Int8 inside FusedMoE (#7415)
mzusman Aug 16, 2024
855866c
[Kernel] Add tuned triton configs for ExpertsInt8 (#7601)
mgoin Aug 16, 2024
f366f63
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend (…
SolitaryThinker Aug 16, 2024
93478b6
[Core] Fix tracking of model forward time in case of PP>1 (#7440)
sfc-gh-mkeralapura Aug 16, 2024
b3f4e17
[Doc] Add docs for llmcompressor INT8 and FP8 checkpoints (#7444)
mgoin Aug 16, 2024
d4f0f17
[Doc] Update quantization supported hardware table (#7595)
mgoin Aug 16, 2024
9f69856
[Kernel] register punica functions as torch ops (#7591)
bnellnm Aug 16, 2024
7759ae9
[Kernel][Misc] dynamo support for ScalarType (#7594)
bnellnm Aug 16, 2024
37fd47e
[Kernel] fix types used in aqlm and ggml kernels to support dynamo (#…
bnellnm Aug 16, 2024
44f26a9
[Model] Align nemotron config with final HF state and fix lm-eval-sma…
mgoin Aug 16, 2024
e680349
[Bugfix] Fix custom_ar support check (#7617)
bnellnm Aug 17, 2024
6bd1955
.[Build/CI] Enabling passing AMD tests. (#7610)
Alexei-V-Ivanov-AMD Aug 17, 2024
bae888c
[Bugfix] Clear engine reference in AsyncEngineRPCServer (#7618)
ruisearch42 Aug 17, 2024
4706eb6
[aDAG] Unflake aDAG + PP tests (#7600)
rkooo567 Aug 17, 2024
7c0b7ea
[Bugfix] add >= 1.0 constraint for openai dependency (#7612)
metasyn Aug 17, 2024
eed020f
[misc] use nvml to get consistent device name (#7582)
youkaichao Aug 17, 2024
5bf45db
[ci][test] fix engine/logger test (#7621)
youkaichao Aug 17, 2024
d95cc0a
[core][misc] update libcudart finding (#7620)
youkaichao Aug 17, 2024
e73f76e
[Model] Pipeline parallel support for JAIS (#7603)
mrbesher Aug 17, 2024
832163b
[ci][test] allow longer wait time for api server (#7629)
youkaichao Aug 17, 2024
1ef13cf
[Misc]Fix BitAndBytes exception messages (#7626)
jeejeelee Aug 17, 2024
bbf55c4
[VLM] Refactor `MultiModalConfig` initialization and profiling (#7530)
ywang96 Aug 17, 2024
ce14335
[TPU] Skip creating empty tensor (#7630)
WoosukKwon Aug 17, 2024
0c2fa50
[TPU] Use mark_dynamic only for dummy run (#7634)
WoosukKwon Aug 18, 2024
ab7165f
[TPU] Optimize RoPE forward_native2 (#7636)
WoosukKwon Aug 18, 2024
e3b3182
[ Bugfix ] Fix Prometheus Metrics With `zeromq` Frontend (#7279)
robertgshaw2-redhat Aug 18, 2024
40e1360
[CI/Build] Add text-only test for Qwen models (#7475)
alex-jw-brooks Aug 18, 2024
200a2ff
[Misc] Refactor Llama3 RoPE initialization (#7637)
WoosukKwon Aug 19, 2024
ff7ec82
[Core] Optimize SPMD architecture with delta + serialization optimiza…
rkooo567 Aug 19, 2024
f710fb5
[Core] Use flashinfer sampling kernel when available (#7137)
peng1999 Aug 19, 2024
1a36287
[Bugfix] Fix xpu build (#7644)
jikunshang Aug 19, 2024
df845b2
[Misc] Remove Gemma RoPE (#7638)
WoosukKwon Aug 19, 2024
3ac50b4
[MISC] Add prefix cache hit rate to metrics (#7606)
comaniac Aug 19, 2024
dad961e
[Bugfix] fix lora_dtype value type in arg_utils.py - part 2 (#5428)
c3-ali Aug 19, 2024
47b65a5
[core] Multi Step Scheduling (#7000)
SolitaryThinker Aug 19, 2024
7601cb0
[Core] Support tensor parallelism for GGUF quantization (#7520)
Isotr0py Aug 19, 2024
da11523
[Bugfix] Don't disable existing loggers (#7664)
a-ys Aug 19, 2024
43735bf
[TPU] Remove redundant input tensor cloning (#7660)
WoosukKwon Aug 19, 2024
67e02fa
[Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs…
tjohnson31415 Aug 20, 2024
e54ebc2
[doc] fix doc build error caused by msgspec (#7659)
youkaichao Aug 20, 2024
312f761
[Speculative Decoding] Fixing hidden states handling in batch expansi…
abhigoyal1997 Aug 20, 2024
0df7ec0
[ci] Install Buildkite test suite analysis (#7667)
khluu Aug 20, 2024
f4fc733
[Bugfix] support `tie_word_embeddings` for all models (#5724)
zijian-hu Aug 20, 2024
3d8a5f0
[CI] Organizing performance benchmark files (#7616)
KuntaiDu Aug 20, 2024
c4be16e
[misc] add nvidia related library in collect env (#7674)
youkaichao Aug 20, 2024
e6d811d
[XPU] fallback to native implementation for xpu custom op (#7670)
jianyizh Aug 20, 2024
ad28a74
[misc][cuda] add warning for pynvml user (#7675)
youkaichao Aug 20, 2024
b6f99a6
[Core] Refactor executor classes for easier inheritance (#7673)
jikunshang Aug 20, 2024
5288c06
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kern…
LucasWilkinson Aug 20, 2024
398521a
[OpenVINO] Updated documentation (#7687)
ilya-lavrenov Aug 20, 2024
aae6927
[VLM][Model] Add test for InternViT vision encoder (#7409)
Isotr0py Aug 20, 2024
c42590f
[Hardware] [Intel GPU] refactor xpu worker/executor (#7686)
jikunshang Aug 20, 2024
2aa00d5
[CI/Build] Pin OpenTelemetry versions and make errors clearer (#7266)
ronensc Aug 20, 2024
c6af027
[Misc] Add jinja2 as an explicit build requirement (#7695)
LucasWilkinson Aug 20, 2024
3b68217
[Core] Add `AttentionState` abstraction (#7663)
Yard1 Aug 20, 2024
6e4658c
[Intel GPU] fix xpu not support punica kernel (which use torch.librar…
jikunshang Aug 20, 2024
9e51b6a
[ci][test] adjust max wait time for cpu offloading test (#7709)
youkaichao Aug 21, 2024
66a9e71
[Core] Pipe `worker_class_fn` argument in Executor (#7707)
Yard1 Aug 21, 2024
b74a125
[ci] try to log process using the port to debug the port usage (#7711)
youkaichao Aug 21, 2024
12e1c65
[Model] Add AWQ quantization support for InternVL2 model (#7187)
Isotr0py Aug 21, 2024
4506641
[Doc] Section for Multimodal Language Models (#7719)
ywang96 Aug 21, 2024
baaedfd
[mypy] Enable following imports for entrypoints (#7248)
DarkLight1337 Aug 21, 2024
dd3fa0e
[Bugfix] Mirror jinja2 in pyproject.toml (#7723)
sasha0552 Aug 21, 2024
c75363f
[BugFix] Avoid premature async generator exit and raise all exception…
njhill Aug 21, 2024
53328d7
[BUG] fix crash on flashinfer backend with cudagraph disabled, when a…
learninmou Aug 21, 2024
6925cdb
[Bugfix][Hardware][CPU] Fix `mm_limits` initialization for CPU backen…
Isotr0py Aug 21, 2024
9b73a2f
[Spec Decoding] Use target model max length as default for draft mode…
njhill Aug 21, 2024
d3c002e
[Bugfix] chat method add_generation_prompt param (#7734)
brian14708 Aug 21, 2024
f7e3b0c
[Bugfix][Frontend] Fix Issues Under High Load With `zeromq` Frontend …
robertgshaw2-redhat Aug 21, 2024
1b32e02
[Bugfix] Pass PYTHONPATH from setup.py to CMake (#7730)
sasha0552 Aug 21, 2024
91f4522
[multi-step] Raise error if not using async engine (#7703)
SolitaryThinker Aug 21, 2024
970dfdc
[Frontend] Improve Startup Failure UX (#7716)
robertgshaw2-redhat Aug 21, 2024
dd53c4b
[misc] Add Torch profiler support (#7451)
SolitaryThinker Aug 21, 2024
1ca0d4f
[Model] Add UltravoxModel and UltravoxConfig (#7615)
petersalas Aug 21, 2024
5844017
[ci] [multi-step] narrow multi-step test dependency paths (#7760)
SolitaryThinker Aug 21, 2024
8678a69
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7…
dsikka Aug 21, 2024
7eebe8c
[distributed][misc] error on same VLLM_HOST_IP setting (#7756)
youkaichao Aug 21, 2024
9984605
[AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 head…
gshtras Aug 21, 2024
7937009
[Kernel] Replaced `blockReduce[...]` functions with `cub::BlockReduce…
ProExpertProg Aug 22, 2024
df1a211
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue (#7710)
zifeitong Aug 22, 2024
cde9183
[Bug][Frontend] Improve ZMQ client robustness (#7443)
joerunde Aug 22, 2024
aae74ef
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Ke…
mgoin Aug 22, 2024
eeee1c3
[TPU] Avoid initializing TPU runtime in is_tpu (#7763)
WoosukKwon Aug 22, 2024
8c6f694
[ci] refine dependency for distributed tests (#7776)
youkaichao Aug 22, 2024
b3856be
[Misc] Use torch.compile for GemmaRMSNorm (#7642)
WoosukKwon Aug 22, 2024
a3fce56
[Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830)
abhigoyal1997 Aug 22, 2024
4f419c0
Fix ShardedStateLoader for vllm fp8 quantization (#7708)
sfc-gh-zhwang Aug 22, 2024
55d63b1
[Bugfix] Don't build machete on cuda <12.0 (#7757)
LucasWilkinson Aug 22, 2024
955b519
[Misc] update fp8 to use `vLLMParameter` (#7437)
dsikka Aug 22, 2024
cc0eaf1
[Bugfix] spec decode handle None entries in topk args in create_seque…
tjohnson31415 Aug 22, 2024
d3b5b98
[Misc] Enhance prefix-caching benchmark tool (#6568)
Jeffwan Aug 22, 2024
57792ed
[Doc] Fix incorrect docs from #7615 (#7788)
petersalas Aug 22, 2024
15310b5
[Bugfix] Use LoadFormat values for `vllm serve --load-format` (#7784)
mgoin Aug 22, 2024
666ad0a
[ci] Cleanup & refactor Dockerfile to pass different Python versions …
khluu Aug 22, 2024
a152246
[Misc] fix typo in triton import warning (#7794)
lsy323 Aug 22, 2024
b903e1b
[Frontend] error suppression cleanup (#7786)
joerunde Aug 22, 2024
c01a6cb
[Ray backend] Better error when pg topology is bad. (#7584)
rkooo567 Aug 23, 2024
fc5ebbd
[Hardware][Intel GPU] refactor xpu_model_runner for tp (#7712)
jikunshang Aug 23, 2024
faeddb5
[misc] Add Torch profiler support for CPU-only devices (#7806)
DamonFool Aug 23, 2024
e25fee5
[BugFix] Fix server crash on empty prompt (#7746)
maxdebayser Aug 23, 2024
35ee2ad
[github][misc] promote asking llm first (#7809)
youkaichao Aug 23, 2024
f1df5db
[Misc] Update `marlin` to use vLLMParameters (#7803)
dsikka Aug 23, 2024
09c7792
Bump version to v0.5.5 (#7823)
simon-mo Aug 23, 2024
9db93de
[Core] Add multi-step support to LLMEngine (#7789)
alexm-redhat Aug 23, 2024
6885fde
[Bugfix] Fix run_batch logger (#7640)
pooyadavoodi Aug 23, 2024
8da48e4
[Frontend] Publish Prometheus metrics in run_batch API (#7641)
pooyadavoodi Aug 24, 2024
d81abef
[Frontend] add json_schema support from OpenAI protocol (#7654)
rockwotj Aug 24, 2024
7d9ffa2
[misc][core] lazy import outlines (#7831)
youkaichao Aug 24, 2024
ea9fa16
[ci][test] exclude model download time in server start time (#7834)
youkaichao Aug 24, 2024
aab0fcd
[ci][test] fix RemoteOpenAIServer (#7838)
youkaichao Aug 24, 2024
80162c4
[Bugfix] Fix Phi-3v crash when input images are of certain sizes (#7840)
zifeitong Aug 25, 2024
8aaf3d5
[Model][VLM] Support multi-images inputs for Phi-3-vision models (#7…
Isotr0py Aug 25, 2024
2059b8d
[Misc] Remove snapshot_download usage in InternVL2 test (#7835)
Isotr0py Aug 25, 2024
70c094a
[misc][cuda] improve pynvml warning (#7852)
youkaichao Aug 25, 2024
1856aff
[Spec Decoding] Streamline batch expansion tensor manipulation (#7851)
njhill Aug 25, 2024
0b76999
[Bugfix]: Use float32 for base64 embedding (#7855)
HollowMan6 Aug 26, 2024
029c71d
[CI/Build] Avoid downloading all HF files in `RemoteOpenAIServer` (#7…
DarkLight1337 Aug 26, 2024
2deb029
[Performance][BlockManagerV2] Mark prefix cache block as computed aft…
comaniac Aug 26, 2024
6653040
[Misc] Update `qqq` to use vLLMParameters (#7805)
dsikka Aug 26, 2024
dd9857f
[Misc] Update `gptq_marlin_24` to use vLLMParameters (#7762)
dsikka Aug 26, 2024
05826c8
[misc] fix custom allreduce p2p cache file generation (#7853)
youkaichao Aug 26, 2024
760e9f7
[Bugfix] neuron: enable tensor parallelism (#7562)
omrishiv Aug 26, 2024
015e6cc
[Misc] Update compressed tensors lifecycle to remove `prefix` from `c…
dsikka Aug 27, 2024
2eedede
[Core] Asynchronous Output Processor (#7049)
megha95 Aug 27, 2024
39178c7
[Tests] Disable retries and use context manager for openai client (#7…
njhill Aug 27, 2024
64cc644
[core][torch.compile] discard the compile for profiling (#7796)
youkaichao Aug 27, 2024
9606c71
Revert #7509 (#7887)
comaniac Aug 27, 2024
6fc4e6e
[Model] Add Mistral Tokenization to improve robustness and chat encod…
patrickvonplaten Aug 27, 2024
9db6421
[CI/Build][VLM] Cleanup multiple images inputs model test (#7897)
Isotr0py Aug 27, 2024
076169f
[Hardware][Intel GPU] Add intel GPU pipeline parallel support. (#7810)
jikunshang Aug 27, 2024
42e932c
[CI/Build][ROCm] Enabling tensorizer tests for ROCm (#7237)
alexeykondrat Aug 27, 2024
b09c755
[Bugfix] Fix phi3v incorrect image_idx when using async engine (#7916)
Isotr0py Aug 27, 2024
ed6f002
[cuda][misc] error on empty CUDA_VISIBLE_DEVICES (#7924)
youkaichao Aug 27, 2024
fc91188
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
dsikka Aug 27, 2024
345be0e
[benchmark] Update TGI version (#7917)
philschmid Aug 27, 2024
5340a2d
[Model] Add multi-image input support for LLaVA-Next offline inferenc…
zifeitong Aug 27, 2024
9c71c97
[mypy] Enable mypy type checking for `vllm/core` (#7229)
jberkhahn Aug 27, 2024
fab5f53
[Core][VLM] Stack multimodal tensors to represent multiple images wit…
petersalas Aug 28, 2024
bc6e42a
[hardware][rocm] allow rocm to override default env var (#7926)
youkaichao Aug 28, 2024
c166e7e
[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add che…
bnellnm Aug 28, 2024
51f86bf
[mypy][CI/Build] Fix mypy errors (#7929)
DarkLight1337 Aug 28, 2024
f508e03
[Core] Async_output_proc: Add virtual engine support (towards pipelin…
alexm-redhat Aug 28, 2024
e358053
[Performance] Enable chunked prefill and prefix caching together (#7753)
comaniac Aug 28, 2024
f52a43a
[ci][test] fix pp test failure (#7945)
youkaichao Aug 28, 2024
98c12cf
[Doc] fix the autoAWQ example (#7937)
stas00 Aug 28, 2024
ef9baee
[Bugfix][VLM] Fix incompatibility between #7902 and #7230 (#7948)
DarkLight1337 Aug 28, 2024
b98cc28
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when availabl…
pavanimajety Aug 28, 2024
e5697d1
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize …
rasmith Aug 28, 2024
eeffde1
[TPU] Upgrade PyTorch XLA nightly (#7967)
WoosukKwon Aug 28, 2024
8c56e57
[Doc] fix 404 link (#7966)
stas00 Aug 28, 2024
9f7e830
Merge remote-tracking branch 'upstream/main'
gshtras Aug 28, 2024
fdd9daa
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#…
mzusman Aug 28, 2024
3cdfe1f
[Bugfix] Make torch registration of punica ops optional (#7970)
bnellnm Aug 28, 2024
5fe12cf
Merge remote-tracking branch 'upstream/main' into v5.5_upstream_merge_rc
gshtras Aug 28, 2024
f5bfb03
Merge remote-tracking branch 'origin/main' into v5.5_upstream_merge_rc
gshtras Aug 28, 2024
ce6bf3a
[torch.compile] avoid Dynamo guard evaluation overhead (#7898)
youkaichao Aug 28, 2024
af59df0
Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test (#7961)
mgoin Aug 28, 2024
4289cad
[Frontend] Minor optimizations to zmq decoupled front-end (#7957)
njhill Aug 29, 2024
a7f65c2
[torch.compile] remove reset (#7975)
youkaichao Aug 29, 2024
74d5543
[VLM][Core] Fix exceptions on ragged NestedTensors (#7974)
petersalas Aug 29, 2024
ef99a78
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when …
youkaichao Aug 29, 2024
f205c09
[Bugfix] Unify rank computation across regular decoding and speculati…
jmkuebler Aug 29, 2024
b6ae399
moe exports required for test_moe_rocm. Type fix in sync_llm. Linting
gshtras Aug 29, 2024
384c141
Merge remote-tracking branch 'upstream/main' into v5.5_upstream_merge_rc
gshtras Aug 29, 2024
23aa669
Post merge regression fix
gshtras Aug 29, 2024
3f60f22
[Core] Combine async postprocessor and multi-step (#7921)
alexm-redhat Aug 29, 2024
6b34215
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFi…
pavanimajety Aug 29, 2024
c334b18
extend cuda graph size for H200 (#7894)
kushanam Aug 29, 2024
d78789a
[Bugfix] Fix incorrect vocal embedding shards for GGUF model in tenso…
Isotr0py Aug 29, 2024
d28eaef
Merge remote-tracking branch 'upstream/main' into v5.5_upstream_merge_rc
gshtras Aug 29, 2024
216cfb1
Merge remote-tracking branch 'origin/main' into v5.5_upstream_merge_rc
gshtras Aug 29, 2024
a50159e
fp8 bulk convert is no longer experimental
gshtras Aug 29, 2024
86a677d
[misc] update tpu int8 to use new vLLM Parameters (#7973)
dsikka Aug 29, 2024
257afc3
[Neuron] Adding support for context-lenght, token-gen buckets. (#7885)
hbikki Aug 29, 2024
65d921d
Temporary fix for fp8 tp>1 and scaled_mm for different torch versions
gshtras Aug 29, 2024
4e36cd9
Removed redundant checks for awq dequantize as in hip it always uses …
gshtras Aug 29, 2024
8295ea0
linter and unused import
gshtras Aug 29, 2024
4664cea
support bitsandbytes 8-bit and FP4 quantized models (#7445)
chenqianfzh Aug 29, 2024
0c785d3
Add more percentiles and latencies (#7759)
wschin Aug 29, 2024
4abed65
[VLM] Disallow overflowing `max_model_len` for multimodal models (#7998)
DarkLight1337 Aug 30, 2024
428dd14
[Core] Logprobs support in Multi-step (#7652)
afeldman-nm Aug 30, 2024
80c7b08
[TPU] Async output processing for TPU (#8011)
WoosukKwon Aug 30, 2024
34a0e96
[Kernel] changing fused moe kernel chunk size default to 32k (#7995)
avshalomman Aug 30, 2024
dc13e99
[MODEL] add Exaone model support (#7819)
nayohan Aug 30, 2024
2148441
[TPU] Support single and multi-host TPUs on GKE (#7613)
richardsliu Aug 30, 2024
afd39a4
[Bugfix] Fix import error in Exaone model (#8034)
DarkLight1337 Aug 30, 2024
f97be32
[VLM][Model] TP support for ViTs (#7186)
ChristopherCho Aug 30, 2024
98cef6a
[Core] Increase default `max_num_batched_tokens` for multimodal model…
DarkLight1337 Aug 30, 2024
058344f
[Frontend]-config-cli-args (#7737)
KaunilD Aug 30, 2024
2684efc
[TPU][Bugfix] Fix tpu type api (#8035)
WoosukKwon Aug 30, 2024
1248e85
[Model] Adding support for MSFT Phi-3.5-MoE (#7729)
wenxcs Aug 30, 2024
622f8ab
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013)
pavanimajety Aug 31, 2024
d05f0a9
[Bugfix] Fix import error in Phi-3.5-MoE (#8052)
DarkLight1337 Aug 31, 2024
4f5d844
[Bugfix] Fix ModelScope models in v0.5.5 (#8037)
NickLucche Aug 31, 2024
8423aef
[BugFix][Core] Multistep Fix Crash on Request Cancellation (#8059)
robertgshaw2-redhat Aug 31, 2024
5231f08
[Frontend][VLM] Add support for multiple multi-modal items (#8049)
ywang96 Aug 31, 2024
5b86b19
[Misc] Optional installation of audio related packages (#8063)
ywang96 Sep 1, 2024
f8d6014
[Model] Add Granite model (#7436)
shawntan Sep 2, 2024
e6a26ed
[SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244)
LiuXiaoxuanPKU Sep 2, 2024
e2b2aa5
[TPU] Align worker index with node boundary (#7932)
WoosukKwon Sep 2, 2024
4ca65a9
[Core][Bugfix] Accept GGUF model without .gguf extension (#8056)
Isotr0py Sep 2, 2024
dd2a6a8
[Bugfix] Fix internlm2 tensor parallel inference (#8055)
Isotr0py Sep 2, 2024
6e36f4f
improve chunked prefill performance
noooop Sep 2, 2024
0fbc669
[Bugfix] Fix single output condition in output processor (#7881)
WoosukKwon Sep 3, 2024
ec26653
[Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backe…
Isotr0py Sep 3, 2024
bd852f2
[Performance] Enable chunked prefill and prefix caching together (#8120)
comaniac Sep 3, 2024
95a178f
[CI] Only PR reviewers/committers can trigger CI on PR (#8124)
khluu Sep 3, 2024
6d646d0
[Core] Optimize Async + Multi-step (#8050)
alexm-redhat Sep 3, 2024
652c83b
[Misc] Raise a more informative exception in add/remove_logger (#7750)
Yard1 Sep 3, 2024
c02638e
[CI/Build] make pip install vllm work in macos (for import only) (#8118)
tomeras91 Sep 3, 2024
f1575dc
[ci] Fix GHA workflow (#8129)
khluu Sep 3, 2024
0af3abe
[TPU][Bugfix] Fix next_token_ids shape (#8128)
WoosukKwon Sep 3, 2024
dc0b606
[CI] Change PR remainder to avoid at-mentions (#8134)
simon-mo Sep 3, 2024
2188a60
[Misc] Update `GPTQ` to use `vLLMParameters` (#7976)
dsikka Sep 3, 2024
be9f84e
Initial support for compressed-tensors quantization
gshtras Sep 3, 2024
05e67ab
Picking fixes from https://github.com/ROCm/vllm/pull/163/files by @ma…
gshtras Sep 3, 2024
7fd46eb
Merge remote-tracking branch 'upstream/main' into v5.5_upstream_merge_rc
gshtras Sep 3, 2024
7edb2fd
Update Dockerfile to 6.2, update ROCm components, remove Cython (#166)
mawong-amd Sep 4, 2024
46c5fed
Linters and adapting the sync server to upstream API changes
gshtras Sep 4, 2024
abcdce9
More linting
gshtras Sep 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import os
import zipfile

MAX_SIZE_MB = 200
MAX_SIZE_MB = 250


def print_top_10_largest_files(zip_file):
Expand Down
18 changes: 0 additions & 18 deletions .buildkite/download-images.sh

This file was deleted.

12 changes: 12 additions & 0 deletions .buildkite/lm-eval-harness/configs/DeepSeek-V2-Lite-Chat.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m deepseek-ai/DeepSeek-V2-Lite-Chat -b "auto" -l 1000 -f 5 -t 2
model_name: "deepseek-ai/DeepSeek-V2-Lite-Chat"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.671
- name: "exact_match,flexible-extract"
value: 0.664
limit: 1000
num_fewshot: 5
trust_remote_code: True
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5
model_name: "nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.905
- name: "exact_match,flexible-extract"
value: 0.905
limit: 1000
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-70B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-70B-Instruct -b 32 -l 250 -f 5
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.892
- name: "exact_match,flexible-extract"
value: 0.892
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8A8-FP8-Channelwise-compressed-tensors"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.752
- name: "exact_match,flexible-extract"
value: 0.754
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-FBGEMM-nonuniform"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.753
- name: "exact_match,flexible-extract"
value: 0.753
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test -b 32 -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-FP8-compressed-tensors-test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.755
- name: "exact_match,flexible-extract"
value: 0.755
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Meta-Llama-3-8B-Instruct-FP8 -b 32 -l 250 -f 5 -t 1
model_name: "neuralmagic/Meta-Llama-3-8B-Instruct-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.753
- name: "exact_match,flexible-extract"
value: 0.753
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Per-Token-Test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.728
- name: "exact_match,flexible-extract"
value: 0.728
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-nonuniform-test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.758
- name: "exact_match,flexible-extract"
value: 0.759
limit: 1000
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m meta-llama/Meta-Llama-3-8B-Instruct -b 32 -l 250 -f 5 -t 1
model_name: "meta-llama/Meta-Llama-3-8B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.756
- name: "exact_match,flexible-extract"
value: 0.752
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.419
- name: "exact_match,flexible-extract"
value: 0.416
limit: 1000
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Minitron-4B-Base-FP8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
model_name: "mgoin/Minitron-4B-Base-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.233
- name: "exact_match,flexible-extract"
value: 0.236
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic -b "auto" -l 250 -f 5 -t 8
model_name: "neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8-dynamic"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.86
- name: "exact_match,flexible-extract"
value: 0.86
limit: 250
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 -b "auto" -l 250 -f 5 -t 4
model_name: "neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.624
- name: "exact_match,flexible-extract"
value: 0.624
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Mixtral-8x7B-Instruct-v0.1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh -m neuralmagic/Mixtral-8x7B-Instruct-v0.1 -b 32 -l 250 -f 5 -t 4
model_name: "mistralai/Mixtral-8x7B-Instruct-v0.1"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.616
- name: "exact_match,flexible-extract"
value: 0.632
limit: 250
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen2-1.5B-Instruct-FP8W8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-FP8W8 -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-FP8W8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.578
- name: "exact_match,flexible-extract"
value: 0.585
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Qwen2-1.5B-Instruct-quantized.w8a8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.593
- name: "exact_match,flexible-extract"
value: 0.588
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise -b "auto" -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.595
- name: "exact_match,flexible-extract"
value: 0.582
limit: 1000
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen2-57B-A14-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m Qwen/Qwen2-57B-A14B-Instruct -b "auto" -l 250 -f 5 -t 4
model_name: "Qwen/Qwen2-57B-A14B-Instruct"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.792
- name: "exact_match,flexible-extract"
value: 0.824
limit: 250
num_fewshot: 5
5 changes: 5 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-large.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform.yaml
Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
Qwen2-57B-A14-Instruct.yaml
DeepSeek-V2-Lite-Chat.yaml
9 changes: 9 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Minitron-4B-Base-FP8.yaml
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
Qwen2-1.5B-Instruct-FP8W8.yaml
Meta-Llama-3-8B-QQQ.yaml
46 changes: 46 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@9516087b81a61d0e220b22cc1b75be76de23bc10

usage() {
echo``
echo "Runs lm eval harness on GSM8k using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -b - batch size to run the evaluation at"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo
}

while getopts "m:b:l:f:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
51 changes: 51 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on GSM for vllm.
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.3

usage() {
echo``
echo "Runs lm eval harness on GSM8k using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -b - batch size to run the evaluation at"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo " -t - tensor parallel size to run at"
echo
}

while getopts "m:b:l:f:t:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend="ray",trust_remote_code=true,max_model_len=4096 \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
Loading