Performance at high context (18k+)

Opening this as a ticket as this is quite a large thing to solve.
We still suffer a significant slowdown compared to the fast speed for the first 1-2k context.

 - 2144: [ 18725,    64,   8]x[    47,    47,  47]=[ 18725,    64,   8]             CONT   (  1) cpu =  44.000 /  44.000 ms, wall =  43.976 /  43.976 ms [ 60 V] [CPU]  (Slow)
 - 2150: [    64, 18725,   8]x[    64,     1, 128]=[ 18725,     1, 128]          MUL_MAT   (  4) cpu =  29.000 /   7.250 ms, wall =  27.057 /   6.764 ms [ 60 KQ] [CPU]
 - 2154: [ 18725,    64,   8]x[ 18725,     1, 128]=[    64,     1, 128]          MUL_MAT   (  4) cpu =   8.000 /   2.000 ms, wall =  11.296 /   2.824 ms [ 60 KQV] [CPU]
 - 2164: [  8192, 65040,   1]x[  8192,     1,   1]=[ 65040,     1,   1]          MUL_MAT   (  4) cpu =   7.000 /   1.750 ms, wall =   7.280 /   1.820 ms [  0 result_lm_head] [GPUxQ]
 - 2153: [ 18725,     1, 128]x[    47,    47,  47]=[ 18725,     1, 128]         SOFT_MAX   (  4) cpu =   5.000 /   1.250 ms, wall =   5.425 /   1.356 ms [ 60 KQ_soft_max] [CPU]

The biggest hit is getting V straight after cache extraction and that should be something that can be avoided
```
            struct ggml_tensor* V = ggml_permute(
                ctx0,
                ggml_view_3d(
                    ctx0,
                    kv_self.v,
                    head_dim, n_head_kv, n_past + N,
                    head_dim * sizeof_wtype,
                    head_dim * n_head_kv * sizeof_wtype,
                    il * n_ctx * ggml_element_size(kv_self.v) * n_head_kv * head_dim),
                1, 2, 0, 3);
                V = ggml_cont(ctx0, V);
```

One token:
```
perf_total_per_op_us[             ADD] =   1.483 ms
perf_total_per_op_us[             MUL] =   1.183 ms
perf_total_per_op_us[            GELU] =   1.878 ms
perf_total_per_op_us[            NORM] =   1.800 ms
perf_total_per_op_us[         MUL_MAT] = 2913.213 ms
perf_total_per_op_us[           SCALE] =  21.552 ms
perf_total_per_op_us[             CPY] =   1.307 ms
perf_total_per_op_us[            CONT] = 2676.875 ms
perf_total_per_op_us[            VIEW] =   0.440 ms
perf_total_per_op_us[         PERMUTE] =   0.240 ms
perf_total_per_op_us[        GET_ROWS] =   0.008 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.331 ms
perf_total_per_op_us[        SOFT_MAX] = 335.385 ms
perf_total_per_op_us[            ROPE] =   2.865 ms
```

Last layer:
```
 - 2125: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.037 /   0.009 ms [ 60 node_2125] [CPU]
 - 2126: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [ 60 node_2126] [CPU]
 - 2127: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [ 60 node_2127] [CPU]
 - 2128: [  8192,  9216,   1]x[  8192,     1,   1]=[  9216,     1,   1]          MUL_MAT   (  4) cpu =   1.000 /   0.250 ms, wall =   1.037 /   0.259 ms [ 60 node_2128] [GPUxQ]
 - 2129: [  9216,     1,   1]x[    47,    47,  47]=[    64,     8,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 Kcur] [CPU]
 - 2130: [    64,     8,   1]x[     4,     1,   1]=[    64,     8,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.002 ms [ 60 Kcur (view)] [CPU]
 - 2131: [614400000,     1,   1]x[    47,    47,  47]=[   512,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 k] [GPU]
 - 2132: [    64,     8,   1]x[   512,     1,   1]=[   512,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [ 60 k (copy of Kcur (view))] [CPU]
 - 2133: [  9216,     1,   1]x[    47,    47,  47]=[    64,     8,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 Vcur] [CPU]
 - 2134: [614400000,     1,   1]x[    47,    47,  47]=[   512,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 v] [GPU]
 - 2135: [    64,     8,   1]x[   512,     1,   1]=[   512,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 60 v (copy of Vcur)] [CPU]
 - 2136: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.003 ms [ 60 node_2136] [CPU]
 - 2137: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  4) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.001 ms [ 60 node_2137] [CPU]
 - 2138: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  3) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [ 60 inpFF] [CPU]
 - 2139: [  8192, 32768,   1]x[  8192,     1,   1]=[ 32768,     1,   1]          MUL_MAT   (  4) cpu =   4.000 /   1.000 ms, wall =   3.851 /   0.963 ms [ 60 inpFF*ff_up] [GPUxQ]
 - 2140: [ 32768,     1,   1]x[    47,    47,  47]=[ 32768,     1,   1]             GELU   (  4) cpu =   0.000 /   0.000 ms, wall =   0.033 /   0.008 ms [ 60 inpFF*ff_up (view)] [CPU]
 - 2141: [ 32768,  8192,   1]x[ 32768,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  4) cpu =   3.000 /   0.750 ms, wall =   3.570 /   0.892 ms [ 60 gelu_cur*ff_down] [GPUxQ]
 - 2142: [614400000,     1,   1]x[    47,    47,  47]=[    64,     8,18725]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 cache_v (view)] [GPU]
 - 2143: [    64,     8,18725]x[    47,    47,  47]=[ 18725,    64,   8]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 cache_v (view) (permuted)] [CPU]
 - 2144: [ 18725,    64,   8]x[    47,    47,  47]=[ 18725,    64,   8]             CONT   (  1) cpu =  44.000 /  44.000 ms, wall =  43.976 /  43.976 ms [ 60 V] [CPU]  (Slow)
 - 2145: [614400000,     1,   1]x[    47,    47,  47]=[    64,     8,18725]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 cache_k (view)] [GPU]
 - 2146: [    64,     8,18725]x[    47,    47,  47]=[    64, 18725,   8]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 K] [CPU]
 - 2147: [  9216,     1,   1]x[    47,    47,  47]=[    64,   128,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 Qcur] [CPU]
 - 2148: [    64,   128,   1]x[     4,     1,   1]=[    64,   128,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.036 /   0.009 ms [ 60 Qcur (view)] [CPU]
 - 2149: [    64,   128,   1]x[    47,    47,  47]=[    64,     1, 128]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 Q] [CPU]
 - 2150: [    64, 18725,   8]x[    64,     1, 128]=[ 18725,     1, 128]          MUL_MAT   (  4) cpu =  29.000 /   7.250 ms, wall =  27.057 /   6.764 ms [ 60 KQ] [CPU]
 - 2151: [ 18725,     1, 128]x[     1,     1,   1]=[ 18725,     1, 128]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.370 /   0.370 ms [ 60 KQ_scaled] [CPU]
 - 2152: [ 18725,     1, 128]x[     2,     1,   1]=[ 18725,     1, 128]    DIAG_MASK_INF   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [ 60 KQ_masked] [CPU]
 - 2153: [ 18725,     1, 128]x[    47,    47,  47]=[ 18725,     1, 128]         SOFT_MAX   (  4) cpu =   5.000 /   1.250 ms, wall =   5.425 /   1.356 ms [ 60 KQ_soft_max] [CPU]
 - 2154: [ 18725,    64,   8]x[ 18725,     1, 128]=[    64,     1, 128]          MUL_MAT   (  4) cpu =   8.000 /   2.000 ms, wall =  11.296 /   2.824 ms [ 60 KQV] [CPU]
 - 2155: [    64,     1, 128]x[    47,    47,  47]=[    64,   128,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 KQV_merged] [CPU]
 - 2156: [    64,   128,   1]x[  8192,     1,   1]=[  8192,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.008 /   0.002 ms [ 60 KQV_merged (copy)] [CPU]
 - 2157: [  8192,  8192,   1]x[  8192,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  4) cpu =   2.000 /   0.500 ms, wall =   1.139 /   0.285 ms [ 60 result_wo] [GPUxQ]
 - 2158: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.006 /   0.002 ms [ 60 attn_out] [CPU]
 - 2159: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.001 ms [ 60 node_2159] [CPU]
 - 2160: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.017 /   0.004 ms [ 60 inpFF_+_result_attn_out] [CPU]
 - 2161: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.006 /   0.002 ms [  0 norm_cur] [CPU]
 - 2162: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  2) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.002 ms [  0 node_2162] [CPU]
 - 2163: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  3) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [  0 result_norm] [CPU]
 - 2164: [  8192, 65040,   1]x[  8192,     1,   1]=[ 65040,     1,   1]          MUL_MAT   (  4) cpu =   7.000 /   1.750 ms, wall =   7.280 /   1.820 ms [  0 result_lm_head] [GPUxQ]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance at high context (18k+) #56

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Performance at high context (18k+) #56

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions