forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 20
Open
Description
Opening this as a ticket as this is quite a large thing to solve.
We still suffer a significant slowdown compared to the fast speed for the first 1-2k context.
- 2144: [ 18725, 64, 8]x[ 47, 47, 47]=[ 18725, 64, 8] CONT ( 1) cpu = 44.000 / 44.000 ms, wall = 43.976 / 43.976 ms [ 60 V] [CPU] (Slow)
- 2150: [ 64, 18725, 8]x[ 64, 1, 128]=[ 18725, 1, 128] MUL_MAT ( 4) cpu = 29.000 / 7.250 ms, wall = 27.057 / 6.764 ms [ 60 KQ] [CPU]
- 2154: [ 18725, 64, 8]x[ 18725, 1, 128]=[ 64, 1, 128] MUL_MAT ( 4) cpu = 8.000 / 2.000 ms, wall = 11.296 / 2.824 ms [ 60 KQV] [CPU]
- 2164: [ 8192, 65040, 1]x[ 8192, 1, 1]=[ 65040, 1, 1] MUL_MAT ( 4) cpu = 7.000 / 1.750 ms, wall = 7.280 / 1.820 ms [ 0 result_lm_head] [GPUxQ]
- 2153: [ 18725, 1, 128]x[ 47, 47, 47]=[ 18725, 1, 128] SOFT_MAX ( 4) cpu = 5.000 / 1.250 ms, wall = 5.425 / 1.356 ms [ 60 KQ_soft_max] [CPU]
The biggest hit is getting V straight after cache extraction and that should be something that can be avoided
struct ggml_tensor* V = ggml_permute(
ctx0,
ggml_view_3d(
ctx0,
kv_self.v,
head_dim, n_head_kv, n_past + N,
head_dim * sizeof_wtype,
head_dim * n_head_kv * sizeof_wtype,
il * n_ctx * ggml_element_size(kv_self.v) * n_head_kv * head_dim),
1, 2, 0, 3);
V = ggml_cont(ctx0, V);
One token:
perf_total_per_op_us[ ADD] = 1.483 ms
perf_total_per_op_us[ MUL] = 1.183 ms
perf_total_per_op_us[ GELU] = 1.878 ms
perf_total_per_op_us[ NORM] = 1.800 ms
perf_total_per_op_us[ MUL_MAT] = 2913.213 ms
perf_total_per_op_us[ SCALE] = 21.552 ms
perf_total_per_op_us[ CPY] = 1.307 ms
perf_total_per_op_us[ CONT] = 2676.875 ms
perf_total_per_op_us[ VIEW] = 0.440 ms
perf_total_per_op_us[ PERMUTE] = 0.240 ms
perf_total_per_op_us[ GET_ROWS] = 0.008 ms
perf_total_per_op_us[ DIAG_MASK_INF] = 0.331 ms
perf_total_per_op_us[ SOFT_MAX] = 335.385 ms
perf_total_per_op_us[ ROPE] = 2.865 ms
Last layer:
- 2125: [ 8192, 1, 1]x[ 47, 47, 47]=[ 8192, 1, 1] NORM ( 4) cpu = 0.000 / 0.000 ms, wall = 0.037 / 0.009 ms [ 60 node_2125] [CPU]
- 2126: [ 8192, 1, 1]x[ 8192, 1, 1]=[ 8192, 1, 1] MUL ( 4) cpu = 0.000 / 0.000 ms, wall = 0.004 / 0.001 ms [ 60 node_2126] [CPU]
- 2127: [ 8192, 1, 1]x[ 8192, 1, 1]=[ 8192, 1, 1] ADD ( 4) cpu = 0.000 / 0.000 ms, wall = 0.002 / 0.001 ms [ 60 node_2127] [CPU]
- 2128: [ 8192, 9216, 1]x[ 8192, 1, 1]=[ 9216, 1, 1] MUL_MAT ( 4) cpu = 1.000 / 0.250 ms, wall = 1.037 / 0.259 ms [ 60 node_2128] [GPUxQ]
- 2129: [ 9216, 1, 1]x[ 47, 47, 47]=[ 64, 8, 1] VIEW ( 1) cpu = 0.000 / 0.000 ms, wall = 0.001 / 0.001 ms [ 60 Kcur] [CPU]
- 2130: [ 64, 8, 1]x[ 4, 1, 1]=[ 64, 8, 1] ROPE ( 4) cpu = 0.000 / 0.000 ms, wall = 0.007 / 0.002 ms [ 60 Kcur (view)] [CPU]
- 2131: [614400000, 1, 1]x[ 47, 47, 47]=[ 512, 1, 1] VIEW ( 1) cpu = 0.000 / 0.000 ms, wall = 0.001 / 0.001 ms [ 60 k] [GPU]
- 2132: [ 64, 8, 1]x[ 512, 1, 1]=[ 512, 1, 1] CPY ( 4) cpu = 0.000 / 0.000 ms, wall = 0.004 / 0.001 ms [ 60 k (copy of Kcur (view))] [CPU]
- 2133: [ 9216, 1, 1]x[ 47, 47, 47]=[ 64, 8, 1] VIEW ( 1) cpu = 0.000 / 0.000 ms, wall = 0.001 / 0.001 ms [ 60 Vcur] [CPU]
- 2134: [614400000, 1, 1]x[ 47, 47, 47]=[ 512, 1, 1] VIEW ( 1) cpu = 0.000 / 0.000 ms, wall = 0.001 / 0.001 ms [ 60 v] [GPU]
- 2135: [ 64, 8, 1]x[ 512, 1, 1]=[ 512, 1, 1] CPY ( 4) cpu = 0.000 / 0.000 ms, wall = 0.001 / 0.000 ms [ 60 v (copy of Vcur)] [CPU]
- 2136: [ 8192, 1, 1]x[ 47, 47, 47]=[ 8192, 1, 1] NORM ( 4) cpu = 0.000 / 0.000 ms, wall = 0.010 / 0.003 ms [ 60 node_2136] [CPU]
- 2137: [ 8192, 1, 1]x[ 8192, 1, 1]=[ 8192, 1, 1] MUL ( 4) cpu = 0.000 / 0.000 ms, wall = 0.003 / 0.001 ms [ 60 node_2137] [CPU]
- 2138: [ 8192, 1, 1]x[ 8192, 1, 1]=[ 8192, 1, 1] ADD ( 3) cpu = 0.000 / 0.000 ms, wall = 0.002 / 0.001 ms [ 60 inpFF] [CPU]
- 2139: [ 8192, 32768, 1]x[ 8192, 1, 1]=[ 32768, 1, 1] MUL_MAT ( 4) cpu = 4.000 / 1.000 ms, wall = 3.851 / 0.963 ms [ 60 inpFF*ff_up] [GPUxQ]
- 2140: [ 32768, 1, 1]x[ 47, 47, 47]=[ 32768, 1, 1] GELU ( 4) cpu = 0.000 / 0.000 ms, wall = 0.033 / 0.008 ms [ 60 inpFF*ff_up (view)] [CPU]
- 2141: [ 32768, 8192, 1]x[ 32768, 1, 1]=[ 8192, 1, 1] MUL_MAT ( 4) cpu = 3.000 / 0.750 ms, wall = 3.570 / 0.892 ms [ 60 gelu_cur*ff_down] [GPUxQ]
- 2142: [614400000, 1, 1]x[ 47, 47, 47]=[ 64, 8,18725] VIEW ( 1) cpu = 0.000 / 0.000 ms, wall = 0.001 / 0.001 ms [ 60 cache_v (view)] [GPU]
- 2143: [ 64, 8,18725]x[ 47, 47, 47]=[ 18725, 64, 8] PERMUTE ( 1) cpu = 0.000 / 0.000 ms, wall = 0.001 / 0.001 ms [ 60 cache_v (view) (permuted)] [CPU]
- 2144: [ 18725, 64, 8]x[ 47, 47, 47]=[ 18725, 64, 8] CONT ( 1) cpu = 44.000 / 44.000 ms, wall = 43.976 / 43.976 ms [ 60 V] [CPU] (Slow)
- 2145: [614400000, 1, 1]x[ 47, 47, 47]=[ 64, 8,18725] VIEW ( 1) cpu = 0.000 / 0.000 ms, wall = 0.001 / 0.001 ms [ 60 cache_k (view)] [GPU]
- 2146: [ 64, 8,18725]x[ 47, 47, 47]=[ 64, 18725, 8] PERMUTE ( 1) cpu = 0.000 / 0.000 ms, wall = 0.001 / 0.001 ms [ 60 K] [CPU]
- 2147: [ 9216, 1, 1]x[ 47, 47, 47]=[ 64, 128, 1] VIEW ( 1) cpu = 0.000 / 0.000 ms, wall = 0.001 / 0.001 ms [ 60 Qcur] [CPU]
- 2148: [ 64, 128, 1]x[ 4, 1, 1]=[ 64, 128, 1] ROPE ( 4) cpu = 0.000 / 0.000 ms, wall = 0.036 / 0.009 ms [ 60 Qcur (view)] [CPU]
- 2149: [ 64, 128, 1]x[ 47, 47, 47]=[ 64, 1, 128] PERMUTE ( 1) cpu = 0.000 / 0.000 ms, wall = 0.001 / 0.001 ms [ 60 Q] [CPU]
- 2150: [ 64, 18725, 8]x[ 64, 1, 128]=[ 18725, 1, 128] MUL_MAT ( 4) cpu = 29.000 / 7.250 ms, wall = 27.057 / 6.764 ms [ 60 KQ] [CPU]
- 2151: [ 18725, 1, 128]x[ 1, 1, 1]=[ 18725, 1, 128] SCALE ( 1) cpu = 0.000 / 0.000 ms, wall = 0.370 / 0.370 ms [ 60 KQ_scaled] [CPU]
- 2152: [ 18725, 1, 128]x[ 2, 1, 1]=[ 18725, 1, 128] DIAG_MASK_INF ( 4) cpu = 0.000 / 0.000 ms, wall = 0.004 / 0.001 ms [ 60 KQ_masked] [CPU]
- 2153: [ 18725, 1, 128]x[ 47, 47, 47]=[ 18725, 1, 128] SOFT_MAX ( 4) cpu = 5.000 / 1.250 ms, wall = 5.425 / 1.356 ms [ 60 KQ_soft_max] [CPU]
- 2154: [ 18725, 64, 8]x[ 18725, 1, 128]=[ 64, 1, 128] MUL_MAT ( 4) cpu = 8.000 / 2.000 ms, wall = 11.296 / 2.824 ms [ 60 KQV] [CPU]
- 2155: [ 64, 1, 128]x[ 47, 47, 47]=[ 64, 128, 1] PERMUTE ( 1) cpu = 0.000 / 0.000 ms, wall = 0.001 / 0.001 ms [ 60 KQV_merged] [CPU]
- 2156: [ 64, 128, 1]x[ 8192, 1, 1]=[ 8192, 1, 1] CPY ( 4) cpu = 0.000 / 0.000 ms, wall = 0.008 / 0.002 ms [ 60 KQV_merged (copy)] [CPU]
- 2157: [ 8192, 8192, 1]x[ 8192, 1, 1]=[ 8192, 1, 1] MUL_MAT ( 4) cpu = 2.000 / 0.500 ms, wall = 1.139 / 0.285 ms [ 60 result_wo] [GPUxQ]
- 2158: [ 8192, 1, 1]x[ 8192, 1, 1]=[ 8192, 1, 1] CPY ( 4) cpu = 0.000 / 0.000 ms, wall = 0.006 / 0.002 ms [ 60 attn_out] [CPU]
- 2159: [ 8192, 1, 1]x[ 8192, 1, 1]=[ 8192, 1, 1] ADD ( 4) cpu = 0.000 / 0.000 ms, wall = 0.003 / 0.001 ms [ 60 node_2159] [CPU]
- 2160: [ 8192, 1, 1]x[ 8192, 1, 1]=[ 8192, 1, 1] ADD ( 4) cpu = 0.000 / 0.000 ms, wall = 0.017 / 0.004 ms [ 60 inpFF_+_result_attn_out] [CPU]
- 2161: [ 8192, 1, 1]x[ 47, 47, 47]=[ 8192, 1, 1] NORM ( 4) cpu = 0.000 / 0.000 ms, wall = 0.006 / 0.002 ms [ 0 norm_cur] [CPU]
- 2162: [ 8192, 1, 1]x[ 8192, 1, 1]=[ 8192, 1, 1] MUL ( 2) cpu = 0.000 / 0.000 ms, wall = 0.003 / 0.002 ms [ 0 node_2162] [CPU]
- 2163: [ 8192, 1, 1]x[ 8192, 1, 1]=[ 8192, 1, 1] ADD ( 3) cpu = 0.000 / 0.000 ms, wall = 0.002 / 0.001 ms [ 0 result_norm] [CPU]
- 2164: [ 8192, 65040, 1]x[ 8192, 1, 1]=[ 65040, 1, 1] MUL_MAT ( 4) cpu = 7.000 / 1.750 ms, wall = 7.280 / 1.820 ms [ 0 result_lm_head] [GPUxQ]
vadi2
Metadata
Metadata
Assignees
Labels
No labels