-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Closed
Labels
feature requestNew feature or requestNew feature or request
Description
🚀 The feature, motivation and pitch
Here's a very strange example:
I load a model with max position embeddings=512, and I want to get the text embeddings.
I use the following code to load the model, and truncate longer text:
embedding_model = LLM(model='FinLang/finance-embeddings-investopedia', task='embedding', tensor_parallel_size=2)
tokenizer = embedding_model.get_tokenizer()
def truncate_texts(texts, tokenizer, max_length):
new_texts = []
for text in tqdm(texts, total=len(texts), desc='truncating text'):
tokens = tokenizer.tokenize(text)
if len(tokens) > max_length:
tokens = tokens[:max_length]
truncate_text = tokenizer.convert_tokens_to_string(tokens)
new_texts.append(truncate_text)
else:
new_texts.append(text)
return new_textsIn [30]: len(tokenizer.tokenize(new_texts2[2117]))
Out[30]: 500
In [31]: new_texts2[2117]
Out[31]: '[UNK] [UNK] 、 こんにちは 。 [UNK] はmrt [UNK] [UNK] 会 社 代 [UNK] [UNK] [UNK] [UNK] 社 長 の 小 川 智 也 と [UNK] します 。 本 日 はお [UNK] しい 中 お [UNK] まりいたたきまして 、 [UNK] にありかとうこさいます 。 それては 、 2019 年 12 月 [UNK] [UNK] 2 四 [UNK] [UNK] [UNK] [UNK] [UNK] 明 会 を [UNK] [UNK] させていたたきます 。 ては 、 ます1 [UNK] 目 の [UNK] [UNK] [UNK] [UNK] に [UNK] しててす 。 [UNK] ともmrtの [UNK] [UNK] としましては 、 東 京 大 学 [UNK] 学 部 発 のヘンチャーてありまして 、 [UNK] [UNK] [UNK] の [UNK] [UNK] [UNK] を [UNK] [UNK] か [UNK] めておりますか 、 [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] しておりまして 、 [UNK] [UNK] [UNK] 事 [UNK] の 会 [UNK] [UNK] は25 [UNK] 名 を [UNK] っております 。 こちらは 、 ます [UNK] [UNK] は [UNK] 7 [UNK] 名 [UNK] [UNK] [UNK] しておりますけれとも 、 それ [UNK] 外 の [UNK] [UNK] [UNK] 、 [UNK] [UNK] [UNK] [UNK] [UNK] をはしめとして 、 [UNK] [UNK] [UNK] 事 [UNK] 、 [UNK] [UNK] [UNK] を [UNK] めますと25 [UNK] 名 [UNK] [UNK] 成 しております 。 [UNK] のスライトも [UNK] 愛 させていたたきます 。 [UNK] [UNK] 、 [UNK] ともは [UNK] 国 [UNK] 1 [UNK] の [UNK] [UNK] [UNK] [UNK] や [UNK] 生 方 とともに [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] しておりますか 、 主 [UNK] は [UNK] [UNK] [UNK] [UNK] に [UNK] する [UNK] 生 の [UNK] [UNK] [UNK] のこ [UNK] 介 てこさいます 。 このこ [UNK] 介 に [UNK] しましては 、 [UNK] [UNK] 日 500 名 [UNK] 上 の [UNK] [UNK] を [UNK] 国 の [UNK] [UNK] [UNK] [UNK] に [UNK] [UNK] しております 。 こちらに [UNK] してはますますニースか 高 まっておりまして 、 [UNK] [UNK] [UNK] 事 [UNK] [UNK] [UNK] [UNK] [UNK] いたたく [UNK] [UNK] [UNK] [UNK] も [UNK] えております 。 [UNK] に [UNK] [UNK] 子 会 社 の [UNK] 明 に [UNK] らせていたたきたいと [UNK] います'
In [32]: embedding_model.encode(new_texts2[2117])
output:
----> 1 embedding_model.encode(new_texts2[2117])
File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/utils.py:1196, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
1189 msg += f" {additional_message}"
1191 warnings.warn(
1192 DeprecationWarning(msg),
1193 stacklevel=3, # The inner function takes up one level
1194 )
-> 1196 return fn(*args, **kwargs)
File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/entrypoints/llm.py:944, in LLM.encode(self, prompts, pooling_params, prompt_token_ids, use_tqdm, lora_request, prompt_adapter_request)
941 for pooling_param in pooling_params:
942 pooling_param.verify(self.llm_engine.model_config)
--> 944 self._validate_and_add_requests(
945 prompts=parsed_prompts,
946 params=pooling_params,
947 lora_request=lora_request,
948 prompt_adapter_request=prompt_adapter_request,
949 )
951 outputs = self._run_engine(use_tqdm=use_tqdm)
952 return self.engine_class.validate_outputs(outputs,
953 PoolingRequestOutput)
File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/entrypoints/llm.py:1354, in LLM._validate_and_add_requests(self, prompts, params, lora_request, prompt_adapter_request, guided_options, priority)
1352 # Add requests to the engine.
1353 for i, prompt in enumerate(prompts):
-> 1354 self._add_request(
1355 prompt,
1356 params[i] if isinstance(params, Sequence) else params,
1357 lora_request=lora_request[i] if isinstance(
1358 lora_request, Sequence) else lora_request,
1359 prompt_adapter_request=prompt_adapter_request,
1360 priority=priority[i] if priority else 0,
1361 )
File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/entrypoints/llm.py:1372, in LLM._add_request(self, prompt, params, lora_request, prompt_adapter_request, priority)
1363 def _add_request(
1364 self,
1365 prompt: PromptType,
(...)
1369 priority: int = 0,
1370 ) -> None:
1371 request_id = str(next(self.request_counter))
-> 1372 self.llm_engine.add_request(
1373 request_id,
1374 prompt,
1375 params,
1376 lora_request=lora_request,
1377 prompt_adapter_request=prompt_adapter_request,
1378 priority=priority,
1379 )
File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/utils.py:1196, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
1189 msg += f" {additional_message}"
1191 warnings.warn(
1192 DeprecationWarning(msg),
1193 stacklevel=3, # The inner function takes up one level
1194 )
-> 1196 return fn(*args, **kwargs)
File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/engine/llm_engine.py:765, in LLMEngine.add_request(self, request_id, prompt, params, arrival_time, lora_request, trace_headers, prompt_adapter_request, priority, inputs)
755 self._validate_token_prompt(
756 prompt,
757 tokenizer=self.get_tokenizer(lora_request=lora_request))
759 processed_inputs = self.input_preprocessor.preprocess(
760 prompt,
761 lora_request=lora_request,
762 prompt_adapter_request=prompt_adapter_request,
763 )
--> 765 self._add_processed_request(
766 request_id=request_id,
767 processed_inputs=processed_inputs,
768 params=params,
769 arrival_time=arrival_time,
770 lora_request=lora_request,
771 prompt_adapter_request=prompt_adapter_request,
772 trace_headers=trace_headers,
773 priority=priority,
774 )
File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/engine/llm_engine.py:586, in LLMEngine._add_processed_request(self, request_id, processed_inputs, params, arrival_time, lora_request, prompt_adapter_request, trace_headers, priority)
573 ParallelSampleSequenceGroup.add_request(
574 request_id,
575 self,
(...)
582 priority=priority,
583 )
584 return None
--> 586 self._validate_model_inputs(processed_inputs, lora_request)
587 # Create the sequences.
588 block_size = self.cache_config.block_size
File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/engine/llm_engine.py:2017, in LLMEngine._validate_model_inputs(self, inputs, lora_request)
2012 if encoder_inputs is not None:
2013 self._validate_model_input(encoder_inputs,
2014 lora_request,
2015 prompt_type="encoder")
-> 2017 self._validate_model_input(decoder_inputs,
2018 lora_request,
2019 prompt_type="decoder")
File ~/app/Anaconda3-2021.05/envs/tt/lib/python3.10/site-packages/vllm/engine/llm_engine.py:2063, in LLMEngine._validate_model_input(self, prompt_inputs, lora_request, prompt_type)
2058 else:
2059 suggestion = (
2060 "Make sure that `max_model_len` is no smaller than the "
2061 "number of text tokens.")
-> 2063 raise ValueError(
2064 f"The {prompt_type} prompt (length {len(prompt_ids)}) is "
2065 f"longer than the maximum model length of {max_prompt_len}. "
2066 f"{suggestion}")
ValueError: The decoder prompt (length 952) is longer than the maximum model length of 512. Make sure that `max_model_len` is no smaller than the number of text tokens.
Why?! As you can see, my text is already truncated to 500 tokens, why the encode func raise report that my prompt is 952?
This is driving me crazy...
Alternatives
Why vllm embedding just can't truncate longer text automatically? Like sentense-transformer.
Additional context
env: vllm 0.8.5.post1
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or request