-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[Kernel] Support Fp8 Checkpoints for Mixtral (Dynamic + Static) #4436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
94 commits
Select commit
Hold shift + click to select a range
79c94a1
fixed fp8 conflict with aqlm
f8b57e4
added quantization tests to buildkite
7175e5b
removed commented out piece
7a7520d
model loaded!
e0b4d72
renamed
f96428e
stash
88ba83b
added static fp8
0848b25
to try with torch.scaled_mm
15882ea
stash
7e6b675
added way to do weight quantization
cc959ea
working!
8d68dbc
fixed llama
881fc65
fixed llama again
e6dd46f
updated names
7e3933b
nit
453a236
cleanup
310e0a7
cleanup
ab4cb02
missed file :)
2edd93a
Update fp8.py
robertgshaw2-redhat ccee5d3
Implement static scaling for Mixtral
pcmoritz 8f71c79
fix
pcmoritz 6eb01e0
update
pcmoritz dc89cbc
fix
pcmoritz be60845
update
pcmoritz 4613cb5
update
pcmoritz 3d95d86
fix
pcmoritz 642763f
move
pcmoritz 706e931
update
pcmoritz 9a3c78c
lol
pcmoritz 1b6f020
fix cuda graph
pcmoritz b09bcec
fix
pcmoritz 052e2b3
update
pcmoritz b33c6d7
update
pcmoritz 475f58d
refactor
pcmoritz 56b4880
update
pcmoritz be37154
revert
pcmoritz 9c54d19
format
pcmoritz c5155ea
Update vllm/_custom_ops.py
pcmoritz 948cca7
Update vllm/model_executor/layers/fused_moe/fused_moe.py
pcmoritz 3feb887
Update vllm/model_executor/models/mixtral.py
pcmoritz df16316
format
pcmoritz 7b6b0fa
support static scales
1a3b2e1
fixed example
63ad2ef
Delete quantize.ipynb
robertgshaw2-redhat 794f1a1
Update vllm/_custom_ops.py
pcmoritz c13b6a4
update
pcmoritz 5a230ed
update
pcmoritz 80069c9
format
pcmoritz 5ce17d0
activation_scale -> act_scale
pcmoritz 5fc0335
Update scheme->activation_scheme
mgoin 92d5162
fix dynamic scaling -- need init to zero due to atomic update
pcmoritz e1bfe10
Format
mgoin 7242600
Fix tuple type
mgoin 8512513
Merge remote-tracking branch 'pcmoritz/mixtral-fp8-static' into fp8-s…
21ddbb4
stash tyler's state
d27015c
stash
1111f87
cutlass working, but slow jitting on hotpath
f5d32ae
first end to end run with mixtral
924e8ce
added missed file
823a2e7
Update run_fp8.py
mgoin 81f42be
Dynamic FP8 works, but static does not (#213)
robertgshaw2-redhat 1a4fd8a
static correctness
e48c981
static fp8 loading
02f683e
working for dense models
81b73ef
Update weight_utils.py
robertgshaw2-redhat 58dbe0f
moving mixtral updates to separate pr
6068dc5
Merge branch 'main' into fp8-static
robertgshaw2-redhat a8d4b33
make ./format pass
5be0970
better comments in linear.py
ef7992b
better comments in linear.py
0667791
fixed opt-125
d8adf14
removed run_fp8.py
9bb1a2b
format
169c9ed
Cleanup opt.py
mgoin 8ef9c7d
added testing
c7d6dd6
./format.sh
50b5823
fixed typing
4156ca9
fixed typing
3148fc9
added warning format
7846d67
Update opt.py
robertgshaw2-redhat ba408c6
formatted
04617fd
Update vllm/model_executor/layers/quantization/fp8.py
robertgshaw2-redhat cc3d395
Update vllm/model_executor/layers/quantization/fp8.py
robertgshaw2-redhat 6005ed2
baseline mixtral loading but not correct
5ca78f1
Merge branch 'fp8-static' into fp8-mixtral
51a686b
mixtral working end-to-end
e01833c
added test
03312e4
added test
171fcc9
format. Codespell not happy
b74b0a4
removed test b/c cannot get codespell to pass
233963b
Update format.sh
robertgshaw2-redhat f60aa36
formatted
82a8736
Merge remote-tracking branch 'upstream/main' into fp8-mixtral
mgoin c5a68fb
Fix mixtral definition
mgoin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove all
device="cuda"in this or the next PR.