[Feature]: Why vllm embedding just can't truncate longer text automatically? And a strange example here.

### 🚀 The feature, motivation and pitch

Here's a very **strange** example:

I load a model with max position embeddings=512, and I want to get the text embeddings.
I use the following code to load the model, and truncate longer text:

```python
embedding_model = LLM(model='FinLang/finance-embeddings-investopedia', task='embedding', tensor_parallel_size=2)
tokenizer = embedding_model.get_tokenizer()

def truncate_texts(texts, tokenizer, max_length):
    new_texts = []
    for text in tqdm(texts, total=len(texts), desc='truncating text'):
        tokens = tokenizer.tokenize(text)
        if len(tokens) > max_length:
            tokens = tokens[:max_length]
            truncate_text = tokenizer.convert_tokens_to_string(tokens)
            new_texts.append(truncate_text)
        else:
            new_texts.append(text)
    return new_texts
```


```
In [30]: len(tokenizer.tokenize(new_texts2[2117]))
Out[30]: 500

In [31]: new_texts2[2117]
Out[31]: '[UNK] [UNK] 、 こんにちは 。 [UNK] はmrt [UNK] [UNK] 会 社 代 [UNK] [UNK] [UNK] [UNK] 社 長 の 小 川 智 也 と [UNK] します 。 本 日 はお [UNK] しい 中 お [UNK] まりいたたきまして 、 [UNK] にありかとうこさいます 。 それては 、 2019 年 12 月 [UNK] [UNK] 2 四 [UNK] [UNK] [UNK] [UNK] [UNK] 明 会 を [UNK] [UNK] させていたたきます 。 ては 、 ます1 [UNK] 目 の [UNK] [UNK] [UNK] [UNK] に [UNK] しててす 。 [UNK] ともmrtの [UNK] [UNK] としましては 、 東 京 大 学 [UNK] 学 部 発 のヘンチャーてありまして 、 [UNK] [UNK] [UNK] の [UNK] [UNK] [UNK] を [UNK] [UNK] か [UNK] めておりますか 、 [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] しておりまして 、 [UNK] [UNK] [UNK] 事 [UNK] の 会 [UNK] [UNK] は25 [UNK] 名 を [UNK] っております 。 こちらは 、 ます [UNK] [UNK] は [UNK] 7 [UNK] 名 [UNK] [UNK] [UNK] しておりますけれとも 、 それ [UNK] 外 の [UNK] [UNK] [UNK] 、 [UNK] [UNK] [UNK] [UNK] [UNK] をはしめとして 、 [UNK] [UNK] [UNK] 事 [UNK] 、 [UNK] [UNK] [UNK] を [UNK] めますと25 [UNK] 名 [UNK] [UNK] 成 しております 。 [UNK] のスライトも [UNK] 愛 させていたたきます 。 [UNK] [UNK] 、 [UNK] ともは [UNK] 国 [UNK] 1 [UNK] の [UNK] [UNK] [UNK] [UNK] や [UNK] 生 方 とともに [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] しておりますか 、 主 [UNK] は [UNK] [UNK] [UNK] [UNK] に [UNK] する [UNK] 生 の [UNK] [UNK] [UNK] のこ [UNK] 介 てこさいます 。 このこ [UNK] 介 に [UNK] しましては 、 [UNK] [UNK] 日 500 名 [UNK] 上 の [UNK] [UNK] を [UNK] 国 の [UNK] [UNK] [UNK] [UNK] に [UNK] [UNK] しております 。 こちらに [UNK] してはますますニースか 高 まっておりまして 、 [UNK] [UNK] [UNK] 事 [UNK] [UNK] [UNK] [UNK] [UNK] いたたく [UNK] [UNK] [UNK] [UNK] も [UNK] えております 。 [UNK] に [UNK] [UNK] 子 会 社 の [UNK] 明 に [UNK] らせていたたきたいと [UNK] います'

In [32]: embedding_model.encode(new_texts2[2117])
```
output:
```
----> 1 embedding_model.encode(new_texts2[2117])

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/utils.py:1196, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
   1189             msg += f" {additional_message}"
   1191         warnings.warn(
   1192             DeprecationWarning(msg),
   1193             stacklevel=3,  # The inner function takes up one level
   1194         )
-> 1196 return fn(*args, **kwargs)

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/entrypoints/llm.py:944, in LLM.encode(self, prompts, pooling_params, prompt_token_ids, use_tqdm, lora_request, prompt_adapter_request)
    941     for pooling_param in pooling_params:
    942         pooling_param.verify(self.llm_engine.model_config)
--> 944 self._validate_and_add_requests(
    945     prompts=parsed_prompts,
    946     params=pooling_params,
    947     lora_request=lora_request,
    948     prompt_adapter_request=prompt_adapter_request,
    949 )
    951 outputs = self._run_engine(use_tqdm=use_tqdm)
    952 return self.engine_class.validate_outputs(outputs,
    953                                           PoolingRequestOutput)

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/entrypoints/llm.py:1354, in LLM._validate_and_add_requests(self, prompts, params, lora_request, prompt_adapter_request, guided_options, priority)
   1352 # Add requests to the engine.
   1353 for i, prompt in enumerate(prompts):
-> 1354     self._add_request(
   1355         prompt,
   1356         params[i] if isinstance(params, Sequence) else params,
   1357         lora_request=lora_request[i] if isinstance(
   1358             lora_request, Sequence) else lora_request,
   1359         prompt_adapter_request=prompt_adapter_request,
   1360         priority=priority[i] if priority else 0,
   1361     )

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/entrypoints/llm.py:1372, in LLM._add_request(self, prompt, params, lora_request, prompt_adapter_request, priority)
   1363 def _add_request(
   1364     self,
   1365     prompt: PromptType,
   (...)
   1369     priority: int = 0,
   1370 ) -> None:
   1371     request_id = str(next(self.request_counter))
-> 1372     self.llm_engine.add_request(
   1373         request_id,
   1374         prompt,
   1375         params,
   1376         lora_request=lora_request,
   1377         prompt_adapter_request=prompt_adapter_request,
   1378         priority=priority,
   1379     )

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/utils.py:1196, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
   1189             msg += f" {additional_message}"
   1191         warnings.warn(
   1192             DeprecationWarning(msg),
   1193             stacklevel=3,  # The inner function takes up one level
   1194         )
-> 1196 return fn(*args, **kwargs)

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/engine/llm_engine.py:765, in LLMEngine.add_request(self, request_id, prompt, params, arrival_time, lora_request, trace_headers, prompt_adapter_request, priority, inputs)
    755     self._validate_token_prompt(
    756         prompt,
    757         tokenizer=self.get_tokenizer(lora_request=lora_request))
    759 processed_inputs = self.input_preprocessor.preprocess(
    760     prompt,
    761     lora_request=lora_request,
    762     prompt_adapter_request=prompt_adapter_request,
    763 )
--> 765 self._add_processed_request(
    766     request_id=request_id,
    767     processed_inputs=processed_inputs,
    768     params=params,
    769     arrival_time=arrival_time,
    770     lora_request=lora_request,
    771     prompt_adapter_request=prompt_adapter_request,
    772     trace_headers=trace_headers,
    773     priority=priority,
    774 )

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/engine/llm_engine.py:586, in LLMEngine._add_processed_request(self, request_id, processed_inputs, params, arrival_time, lora_request, prompt_adapter_request, trace_headers, priority)
    573     ParallelSampleSequenceGroup.add_request(
    574         request_id,
    575         self,
   (...)
    582         priority=priority,
    583     )
    584     return None
--> 586 self._validate_model_inputs(processed_inputs, lora_request)
    587 # Create the sequences.
    588 block_size = self.cache_config.block_size

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/engine/llm_engine.py:2017, in LLMEngine._validate_model_inputs(self, inputs, lora_request)
   2012 if encoder_inputs is not None:
   2013     self._validate_model_input(encoder_inputs,
   2014                                lora_request,
   2015                                prompt_type="encoder")
-> 2017 self._validate_model_input(decoder_inputs,
   2018                            lora_request,
   2019                            prompt_type="decoder")

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/engine/llm_engine.py:2063, in LLMEngine._validate_model_input(self, prompt_inputs, lora_request, prompt_type)
   2058 else:
   2059     suggestion = (
   2060         "Make sure that `max_model_len` is no smaller than the "
   2061         "number of text tokens.")
-> 2063 raise ValueError(
   2064     f"The {prompt_type} prompt (length {len(prompt_ids)}) is "
   2065     f"longer than the maximum model length of {max_prompt_len}. "
   2066     f"{suggestion}")

ValueError: The decoder prompt (length 952) is longer than the maximum model length of 512. Make sure that `max_model_len` is no smaller than the number of text tokens.
```

Why?! As you can see, my text is already truncated to 500 tokens, why the `encode` func raise report that my prompt is 952? 

This is driving me crazy...


### Alternatives


Why vllm embedding just can't truncate longer text automatically? Like sentense-transformer.

### Additional context

env: vllm 0.8.5.post1


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Why vllm embedding just can't truncate longer text automatically? And a strange example here. #20744

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Why vllm embedding just can't truncate longer text automatically? And a strange example here. #20744

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions