Skip to content

[Bug]: tensor parallel (of 4 cards) gives bad answers in version 0.5.1 and later (compared to 0.4.1) with gptq marlin kernels (compared to gptq) #6258

@orellavie1212

Description

@orellavie1212

Your current environment

sagemaker ml.g5.12xlarge instance (4 instances of a10g 24gb)
container is
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121
from
https://github.com/aws/deep-learning-containers/blob/master/available_images.md

🐛 Describe the bug

from vllm import LLM, SamplingParams
question = "what is the id of the team and what is the subtitute lineup of the home team for the match?"
history = str(["how many games the home team Sevilla won?"])
full_example = f"""
\n \n\nYou are a transformation helper specialist, based on the history helping in transforming user input\nto a more structured and simpler text to a smaller model, which is less smart as you. \n\nMost of the times, the history could help you about entities which are now missing \nfrom the question\nTo illustrate the mission, if the user asked in the history about an entity (like 'Barcelona'), and \nnow he asked about 'team' (could be team, player, or other entity) or it seems to you that the an entity \nis missing in the context, perhaps the entity ('Barcelona') from the history could be the option to fill the gap. \n\nIf there is no entity in the history, please do not hallucinate and offer weird entity, for example if in the history\nyou saw 'home team' and now he just mentioned 'team', replace 'team' with 'home team' (applicable for away team too).\n\nWhen a replacement is occurred, please do not add 'the' as part of the entity, just entity itself.\n\n \nReturn it in valid JSON format according to the schema\n \n\n User question: \n\n {question} \n\n History: \n\n {history}"""

prompts = [
full_example
]
sampling_params = SamplingParams(temperature=0, top_p=0.95, max_new_tokens=256)

llm = LLM(model="TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ", 'tokenizer_mode'='auto', 'gpu_memory_utilization'=0.7, 'guided_decoding_backend' ='lm-format-enforcer', tensor_parallel_size=4)

quantization not mentioned in order to get marlin kernels instead of standard gptq (if available).

outputs = llm.generate(prompts, sampling_params)

Print the outputs.

for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

the output for version 0.4.1 is right
"what is the id of Sevilla and what is the substitute lineup of Sevilla for the match?"

for all versions forward (0.4.2 and above)
it returns weird answers (which does not make sense. like 1900 1900 1900 1900 and weird tokens in the answers)

Notes:
1.As a side check I ran everything the same, but with TechxGenus/Meta-Llama-3-8B-Instruct-AWQ, and it worked. As I suspect, when the shard is above the size of the gpu memory (24gb not enough for 70b gptq, compared to 8b awq), and a tp is needed, it does affect (maybe in v0.4.2 and forward you changed something in the code of the tensor parallel).

  1. As another note, ran the same llama 70b gptq on 1 card v6000 48gb, everything worked great even with v0.5.1 (ofc 0.4.2 and up), so it almost 100% something with the megatron TP.

If another data needed, just comment and mention me.
Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions