Skip to content

Commit f16af8e

Browse files
authored
Update README.md
correcting old numbers in main readme with correct numbers in quantization readme
1 parent 96d49cd commit f16af8e

File tree

1 file changed

+11
-9
lines changed

1 file changed

+11
-9
lines changed

README.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -29,15 +29,17 @@ The models used were `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Meta-Llama-
2929

3030
| Model | Technique | wikitext-perplexity | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) |
3131
| ----------- | ------------------ | ------------------- | ------------- | ----------------------- | ---------------- | --------------- |
32-
| Llama-2-7B | Base (bfloat16) | 12.212 | 105.02 | 1387.78 | 13.21 | 13.90 |
33-
| | int8dq | 12.262 | 9.40 | 62.26 | 6.62 | 8.61 |
34-
| | int8wo | 12.204 | 147.03 | 973.54 | 6.62 | 8.95 |
35-
| | int4wo-64 | 12.843 | 199.81 | 746.45 | 3.74 | 4.75 |
36-
| | int4wo-64-GPTQ | 12.489 | 199.81 | 746.45 | 3.74 | 4.75 |
37-
| Llama-3-8B | Base (bfloat16) | | 94.91 | 1424.58 | 15.01 | 16.43 |
38-
| | int8dq | | 8.41 | 63.23 | 7.52 | 9.24 |
39-
| | int8wo | | 136.75 | 1028.38 | 7.52 | 10.42 |
40-
| | int4wo-64 | | 179.41 | 757.45 | 4.22 | 6.88 |
32+
| Llama-2-7B | Base (bfloat16) | 12.212 | 105.14 | 1389.35 | 13.88 | 13.21 |
33+
| | int8dq | 12.262 | 9.20 | 60.93 | 8.33 | 6.62 |
34+
| | int8wo | 12.204 | 150.18 | 994.40 | 8.95 | 6.62 |
35+
| | int4wo-64 | 12.843 | 199.86 | 746.66 | 4.50 | 3.74 |
36+
| | int4wo-64-GPTQ | 12.489 | 199.86 | 746.66 | 4.50 | 3.74 |
37+
| | autoquant | 12.204 | 159.22 | 1069.87 | 8.91 | 6.72 |
38+
| Llama-3-8B | Base (bfloat16) | N/A | 94.97 | 1425.55 | 16.43 | 15.01 |
39+
| | int8dq | N/A | 8.44 | 63.45 | 8.98 | 7.52 |
40+
| | int8wo | N/A | 139.76 | 1051.02 | 10.42 | 7.52 |
41+
| | int4wo-64 | N/A | 179.44 | 757.60 | 6.62 | 4.22 |
42+
| | autoquant | N/A | 137.71 | 1037.74 | 11.08 | 7.54 |
4143

4244
note: Int8 dynamic quantization works best on compute bound as opposed to memory bound models. Some relatable examples might be [SAM](https://github.com/pytorch-labs/segment-anything-fast) which is compute bound vs Llama at batchsize=1 which is memory bound.
4345

0 commit comments

Comments
 (0)