[Bug]: tensor parallel (of 4 cards) gives bad answers in version 0.5.1 and later (compared to 0.4.1) with gptq marlin kernels (compared to gptq)

### Your current environment

sagemaker ml.g5.12xlarge instance (4 instances of a10g 24gb)
container is 
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121
from
https://github.com/aws/deep-learning-containers/blob/master/available_images.md

### 🐛 Describe the bug

from vllm import LLM, SamplingParams
question = "what is the id of the team and what is the subtitute lineup of the home team for the match?"
history = str(["how many games the home team Sevilla won?"])
full_example = f"""
\n        \n\nYou are a transformation helper specialist, based on the history helping in transforming user input\nto a more structured and simpler text to a smaller model, which is less smart as you. \n\nMost of the times, the history could help you about entities which are now missing \nfrom the question\nTo illustrate the mission, if the user asked in the history about an entity (like \'Barcelona\'), and \nnow he asked about \'team\' (could be team, player, or other entity) or it seems to you that the an entity \nis missing in the context, perhaps the entity (\'Barcelona\') from the history could be the option to fill the gap. \n\nIf there is no entity in the history, please do not hallucinate and offer weird entity, for example if in the history\nyou saw \'home team\' and now he just mentioned \'team\', replace \'team\' with \'home team\' (applicable for away team too).\n\nWhen a replacement is occurred, please do not add \'the\' as part of the entity, just entity itself.\n\n        \nReturn it in valid JSON format according to the schema\n          \n\n User question: \n\n {question} \n\n History: \n\n {history}"""

prompts = [
    full_example
]
sampling_params = SamplingParams(temperature=0, top_p=0.95, max_new_tokens=256)

llm = LLM(model="TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ", 'tokenizer_mode'='auto', 'gpu_memory_utilization'=0.7,  'guided_decoding_backend' ='lm-format-enforcer', tensor_parallel_size=4)

quantization not mentioned in order to get marlin kernels instead of standard gptq (if available).

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")



the output for version 0.4.1 is right
"what is the id of Sevilla and what is the substitute lineup of Sevilla for the match?"

for all versions forward (0.4.2 and above)
it returns weird answers (which does not make sense. like 1900 1900 1900 1900 and weird tokens in the answers)

Notes:
1.As a side check I ran everything the same, but with TechxGenus/Meta-Llama-3-8B-Instruct-AWQ, and it worked. As I suspect, when the shard is above the size of the gpu memory (24gb not enough for 70b gptq, compared to 8b awq), and a tp is needed, it does affect (maybe in v0.4.2 and forward you changed something in the code of the tensor parallel).

2. As another note, ran the same llama 70b gptq on 1 card v6000 48gb, everything worked great even with v0.5.1 (ofc 0.4.2 and up), so it almost 100% something with the megatron TP.

If another data needed, just comment and mention me.
Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: tensor parallel (of 4 cards) gives bad answers in version 0.5.1 and later (compared to 0.4.1) with gptq marlin kernels (compared to gptq) #6258

Your current environment

🐛 Describe the bug

Print the outputs.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: tensor parallel (of 4 cards) gives bad answers in version 0.5.1 and later (compared to 0.4.1) with gptq marlin kernels (compared to gptq) #6258

Description

Your current environment

🐛 Describe the bug

Print the outputs.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions