Skip to content

[Feature]: Why vllm embedding just can't truncate longer text automatically? And a strange example here. #20744

@beyondguo

Description

@beyondguo

🚀 The feature, motivation and pitch

Here's a very strange example:

I load a model with max position embeddings=512, and I want to get the text embeddings.
I use the following code to load the model, and truncate longer text:

embedding_model = LLM(model='FinLang/finance-embeddings-investopedia', task='embedding', tensor_parallel_size=2)
tokenizer = embedding_model.get_tokenizer()

def truncate_texts(texts, tokenizer, max_length):
    new_texts = []
    for text in tqdm(texts, total=len(texts), desc='truncating text'):
        tokens = tokenizer.tokenize(text)
        if len(tokens) > max_length:
            tokens = tokens[:max_length]
            truncate_text = tokenizer.convert_tokens_to_string(tokens)
            new_texts.append(truncate_text)
        else:
            new_texts.append(text)
    return new_texts
In [30]: len(tokenizer.tokenize(new_texts2[2117]))
Out[30]: 500

In [31]: new_texts2[2117]
Out[31]: '[UNK] [UNK] 、 こんにちは 。 [UNK] はmrt [UNK] [UNK] 会 社 代 [UNK] [UNK] [UNK] [UNK] 社 長 の 小 川 智 也 と [UNK] します 。 本 日 はお [UNK] しい 中 お [UNK] まりいたたきまして 、 [UNK] にありかとうこさいます 。 それては 、 2019 年 12 月 [UNK] [UNK] 2 四 [UNK] [UNK] [UNK] [UNK] [UNK] 明 会 を [UNK] [UNK] させていたたきます 。 ては 、 ます1 [UNK] 目 の [UNK] [UNK] [UNK] [UNK] に [UNK] しててす 。 [UNK] ともmrtの [UNK] [UNK] としましては 、 東 京 大 学 [UNK] 学 部 発 のヘンチャーてありまして 、 [UNK] [UNK] [UNK] の [UNK] [UNK] [UNK] を [UNK] [UNK] か [UNK] めておりますか 、 [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] しておりまして 、 [UNK] [UNK] [UNK] 事 [UNK] の 会 [UNK] [UNK] は25 [UNK] 名 を [UNK] っております 。 こちらは 、 ます [UNK] [UNK] は [UNK] 7 [UNK] 名 [UNK] [UNK] [UNK] しておりますけれとも 、 それ [UNK] 外 の [UNK] [UNK] [UNK] 、 [UNK] [UNK] [UNK] [UNK] [UNK] をはしめとして 、 [UNK] [UNK] [UNK] 事 [UNK] 、 [UNK] [UNK] [UNK] を [UNK] めますと25 [UNK] 名 [UNK] [UNK] 成 しております 。 [UNK] のスライトも [UNK] 愛 させていたたきます 。 [UNK] [UNK] 、 [UNK] ともは [UNK] 国 [UNK] 1 [UNK] の [UNK] [UNK] [UNK] [UNK] や [UNK] 生 方 とともに [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] しておりますか 、 主 [UNK] は [UNK] [UNK] [UNK] [UNK] に [UNK] する [UNK] 生 の [UNK] [UNK] [UNK] のこ [UNK] 介 てこさいます 。 このこ [UNK] 介 に [UNK] しましては 、 [UNK] [UNK] 日 500 名 [UNK] 上 の [UNK] [UNK] を [UNK] 国 の [UNK] [UNK] [UNK] [UNK] に [UNK] [UNK] しております 。 こちらに [UNK] してはますますニースか 高 まっておりまして 、 [UNK] [UNK] [UNK] 事 [UNK] [UNK] [UNK] [UNK] [UNK] いたたく [UNK] [UNK] [UNK] [UNK] も [UNK] えております 。 [UNK] に [UNK] [UNK] 子 会 社 の [UNK] 明 に [UNK] らせていたたきたいと [UNK] います'

In [32]: embedding_model.encode(new_texts2[2117])

output:

----> 1 embedding_model.encode(new_texts2[2117])

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/utils.py:1196, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
   1189             msg += f" {additional_message}"
   1191         warnings.warn(
   1192             DeprecationWarning(msg),
   1193             stacklevel=3,  # The inner function takes up one level
   1194         )
-> 1196 return fn(*args, **kwargs)

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/entrypoints/llm.py:944, in LLM.encode(self, prompts, pooling_params, prompt_token_ids, use_tqdm, lora_request, prompt_adapter_request)
    941     for pooling_param in pooling_params:
    942         pooling_param.verify(self.llm_engine.model_config)
--> 944 self._validate_and_add_requests(
    945     prompts=parsed_prompts,
    946     params=pooling_params,
    947     lora_request=lora_request,
    948     prompt_adapter_request=prompt_adapter_request,
    949 )
    951 outputs = self._run_engine(use_tqdm=use_tqdm)
    952 return self.engine_class.validate_outputs(outputs,
    953                                           PoolingRequestOutput)

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/entrypoints/llm.py:1354, in LLM._validate_and_add_requests(self, prompts, params, lora_request, prompt_adapter_request, guided_options, priority)
   1352 # Add requests to the engine.
   1353 for i, prompt in enumerate(prompts):
-> 1354     self._add_request(
   1355         prompt,
   1356         params[i] if isinstance(params, Sequence) else params,
   1357         lora_request=lora_request[i] if isinstance(
   1358             lora_request, Sequence) else lora_request,
   1359         prompt_adapter_request=prompt_adapter_request,
   1360         priority=priority[i] if priority else 0,
   1361     )

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/entrypoints/llm.py:1372, in LLM._add_request(self, prompt, params, lora_request, prompt_adapter_request, priority)
   1363 def _add_request(
   1364     self,
   1365     prompt: PromptType,
   (...)
   1369     priority: int = 0,
   1370 ) -> None:
   1371     request_id = str(next(self.request_counter))
-> 1372     self.llm_engine.add_request(
   1373         request_id,
   1374         prompt,
   1375         params,
   1376         lora_request=lora_request,
   1377         prompt_adapter_request=prompt_adapter_request,
   1378         priority=priority,
   1379     )

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/utils.py:1196, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
   1189             msg += f" {additional_message}"
   1191         warnings.warn(
   1192             DeprecationWarning(msg),
   1193             stacklevel=3,  # The inner function takes up one level
   1194         )
-> 1196 return fn(*args, **kwargs)

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/engine/llm_engine.py:765, in LLMEngine.add_request(self, request_id, prompt, params, arrival_time, lora_request, trace_headers, prompt_adapter_request, priority, inputs)
    755     self._validate_token_prompt(
    756         prompt,
    757         tokenizer=self.get_tokenizer(lora_request=lora_request))
    759 processed_inputs = self.input_preprocessor.preprocess(
    760     prompt,
    761     lora_request=lora_request,
    762     prompt_adapter_request=prompt_adapter_request,
    763 )
--> 765 self._add_processed_request(
    766     request_id=request_id,
    767     processed_inputs=processed_inputs,
    768     params=params,
    769     arrival_time=arrival_time,
    770     lora_request=lora_request,
    771     prompt_adapter_request=prompt_adapter_request,
    772     trace_headers=trace_headers,
    773     priority=priority,
    774 )

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/engine/llm_engine.py:586, in LLMEngine._add_processed_request(self, request_id, processed_inputs, params, arrival_time, lora_request, prompt_adapter_request, trace_headers, priority)
    573     ParallelSampleSequenceGroup.add_request(
    574         request_id,
    575         self,
   (...)
    582         priority=priority,
    583     )
    584     return None
--> 586 self._validate_model_inputs(processed_inputs, lora_request)
    587 # Create the sequences.
    588 block_size = self.cache_config.block_size

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/engine/llm_engine.py:2017, in LLMEngine._validate_model_inputs(self, inputs, lora_request)
   2012 if encoder_inputs is not None:
   2013     self._validate_model_input(encoder_inputs,
   2014                                lora_request,
   2015                                prompt_type="encoder")
-> 2017 self._validate_model_input(decoder_inputs,
   2018                            lora_request,
   2019                            prompt_type="decoder")

File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/engine/llm_engine.py:2063, in LLMEngine._validate_model_input(self, prompt_inputs, lora_request, prompt_type)
   2058 else:
   2059     suggestion = (
   2060         "Make sure that `max_model_len` is no smaller than the "
   2061         "number of text tokens.")
-> 2063 raise ValueError(
   2064     f"The {prompt_type} prompt (length {len(prompt_ids)}) is "
   2065     f"longer than the maximum model length of {max_prompt_len}. "
   2066     f"{suggestion}")

ValueError: The decoder prompt (length 952) is longer than the maximum model length of 512. Make sure that `max_model_len` is no smaller than the number of text tokens.

Why?! As you can see, my text is already truncated to 500 tokens, why the encode func raise report that my prompt is 952?

This is driving me crazy...

Alternatives

Why vllm embedding just can't truncate longer text automatically? Like sentense-transformer.

Additional context

env: vllm 0.8.5.post1

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions