Skip to content

Decode error while inferencing a batch of prompts #340

@SiriusNEO

Description

@SiriusNEO

I'm trying to benchmark the performance of vLLM OPT. But I find that when I pass a relatively large batch of prompts to vLLM, it will raise decode error when the sequence length meets a threshold (which makes the problem look like an OOM).

A minimal reproduction for this issue:

from vllm import LLM, SamplingParams

def make_input(bs):
    return ["Hello!" for _ in range(bs)]

bs = 128
generate_length = 200

# Create a sampling params object.
sampling_params = SamplingParams(
    temperature=0.8, 
    top_p=0.95, 
    max_tokens=generate_length)

# Create an LLM.
llm = LLM(
    model="facebook/opt-125m",
    use_dummy_weights=True,
)
input = make_input(bs)
out = llm.generate(input, sampling_params)

When bs=128, the error happens in the 108-th token approximately. The error looks like

Traceback (most recent call last):
  File "vllm-none-problem-repro.py", line 21, in <module>
    out = llm.generate(input, sampling_params)
  File "/llm-bench/vllm-src/vllm/entrypoints/llm.py", line 127, in generate
    return self._run_engine(use_tqdm)
  File "/llm-bench/vllm-src/vllm/entrypoints/llm.py", line 147, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/llm-bench/vllm-src/vllm/engine/llm_engine.py", line 246, in step
    self._decode_sequences(seq_groups)
  File "/llm-bench/vllm-src/vllm/engine/llm_engine.py", line 263, in _decode_sequences
    new_token, new_output_text = detokenize_incrementally(
  File "/llm-bench/vllm-src/vllm/transformers_utils/tokenizer.py", line 73, in detokenize_incrementally
    output_text = tokenizer.convert_tokens_to_string(output_tokens)
  File "/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 533, in convert_tokens_to_string
    return self.backend_tokenizer.decoder.decode(tokens)
TypeError: argument 'tokens': 'NoneType' object cannot be converted to 'PyString

If I use a smaller bs, the "threshold" will also increase (>108). For example, it's around 210 when bs=64. Seems that there is a limit for bs * length.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions