Skip to content

Conversation

@MasterJH5574
Copy link
Contributor

Prior to this PR, the TIR attention kernels does not cast matmul operands to fp32 before multiplying.
For models like Phi-2 which may have large Q/K/V data (at the level of a few hundreds), the fp16 multiplication exceeds the range of fp16, and lead to attention result being NAN sometimes.

This PR fixes this issue.

Prior to this PR, the TIR attention kernels does not cast matmul
operands to fp32 before multiplying.
For models like Phi-2 which may have large Q/K/V data (at the level
of a few hundreds), the fp16 multiplication exceeds the range of
fp16, and lead to attention result being NAN sometimes.

This PR fixes this issue.
@MasterJH5574
Copy link
Contributor Author

@tvm-bot rerun

@yongwww yongwww merged commit ad1da4e into apache:main Mar 4, 2024
Lunderberg pushed a commit to Lunderberg/tvm that referenced this pull request Mar 12, 2024
…ache#16667)

Prior to this PR, the TIR attention kernels does not cast matmul
operands to fp32 before multiplying.
For models like Phi-2 which may have large Q/K/V data (at the level
of a few hundreds), the fp16 multiplication exceeds the range of
fp16, and lead to attention result being NAN sometimes.

This PR fixes this issue.
thaisacs pushed a commit to thaisacs/tvm that referenced this pull request Apr 3, 2024
…ache#16667)

Prior to this PR, the TIR attention kernels does not cast matmul
operands to fp32 before multiplying.
For models like Phi-2 which may have large Q/K/V data (at the level
of a few hundreds), the fp16 multiplication exceeds the range of
fp16, and lead to attention result being NAN sometimes.

This PR fixes this issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants