Decode error while inferencing a batch of prompts

I'm trying to benchmark the performance of vLLM OPT. But I find that when I pass a relatively large batch of prompts to vLLM, it will raise decode error **when the sequence length meets a threshold** (which makes the problem look like an OOM).

A minimal reproduction for this issue:
```python
from vllm import LLM, SamplingParams

def make_input(bs):
    return ["Hello!" for _ in range(bs)]

bs = 128
generate_length = 200

# Create a sampling params object.
sampling_params = SamplingParams(
    temperature=0.8, 
    top_p=0.95, 
    max_tokens=generate_length)

# Create an LLM.
llm = LLM(
    model="facebook/opt-125m",
    use_dummy_weights=True,
)
input = make_input(bs)
out = llm.generate(input, sampling_params)
```
When `bs=128`, the error happens in the 108-th token approximately. The error looks like
```
Traceback (most recent call last):
  File "vllm-none-problem-repro.py", line 21, in <module>
    out = llm.generate(input, sampling_params)
  File "/llm-bench/vllm-src/vllm/entrypoints/llm.py", line 127, in generate
    return self._run_engine(use_tqdm)
  File "/llm-bench/vllm-src/vllm/entrypoints/llm.py", line 147, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/llm-bench/vllm-src/vllm/engine/llm_engine.py", line 246, in step
    self._decode_sequences(seq_groups)
  File "/llm-bench/vllm-src/vllm/engine/llm_engine.py", line 263, in _decode_sequences
    new_token, new_output_text = detokenize_incrementally(
  File "/llm-bench/vllm-src/vllm/transformers_utils/tokenizer.py", line 73, in detokenize_incrementally
    output_text = tokenizer.convert_tokens_to_string(output_tokens)
  File "/opt/conda/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 533, in convert_tokens_to_string
    return self.backend_tokenizer.decoder.decode(tokens)
TypeError: argument 'tokens': 'NoneType' object cannot be converted to 'PyString
```
If I use a smaller bs, the "threshold" will also increase (>108). For example, it's around 210 when `bs=64`. Seems that there is a limit for `bs * length`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Decode error while inferencing a batch of prompts #340

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Decode error while inferencing a batch of prompts #340

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions