Skip to content

Performance at high context (18k+) #56

@cmp-nct

Description

@cmp-nct

Opening this as a ticket as this is quite a large thing to solve.
We still suffer a significant slowdown compared to the fast speed for the first 1-2k context.

  • 2144: [ 18725, 64, 8]x[ 47, 47, 47]=[ 18725, 64, 8] CONT ( 1) cpu = 44.000 / 44.000 ms, wall = 43.976 / 43.976 ms [ 60 V] [CPU] (Slow)
  • 2150: [ 64, 18725, 8]x[ 64, 1, 128]=[ 18725, 1, 128] MUL_MAT ( 4) cpu = 29.000 / 7.250 ms, wall = 27.057 / 6.764 ms [ 60 KQ] [CPU]
  • 2154: [ 18725, 64, 8]x[ 18725, 1, 128]=[ 64, 1, 128] MUL_MAT ( 4) cpu = 8.000 / 2.000 ms, wall = 11.296 / 2.824 ms [ 60 KQV] [CPU]
  • 2164: [ 8192, 65040, 1]x[ 8192, 1, 1]=[ 65040, 1, 1] MUL_MAT ( 4) cpu = 7.000 / 1.750 ms, wall = 7.280 / 1.820 ms [ 0 result_lm_head] [GPUxQ]
  • 2153: [ 18725, 1, 128]x[ 47, 47, 47]=[ 18725, 1, 128] SOFT_MAX ( 4) cpu = 5.000 / 1.250 ms, wall = 5.425 / 1.356 ms [ 60 KQ_soft_max] [CPU]

The biggest hit is getting V straight after cache extraction and that should be something that can be avoided

            struct ggml_tensor* V = ggml_permute(
                ctx0,
                ggml_view_3d(
                    ctx0,
                    kv_self.v,
                    head_dim, n_head_kv, n_past + N,
                    head_dim * sizeof_wtype,
                    head_dim * n_head_kv * sizeof_wtype,
                    il * n_ctx * ggml_element_size(kv_self.v) * n_head_kv * head_dim),
                1, 2, 0, 3);
                V = ggml_cont(ctx0, V);

One token:

perf_total_per_op_us[             ADD] =   1.483 ms
perf_total_per_op_us[             MUL] =   1.183 ms
perf_total_per_op_us[            GELU] =   1.878 ms
perf_total_per_op_us[            NORM] =   1.800 ms
perf_total_per_op_us[         MUL_MAT] = 2913.213 ms
perf_total_per_op_us[           SCALE] =  21.552 ms
perf_total_per_op_us[             CPY] =   1.307 ms
perf_total_per_op_us[            CONT] = 2676.875 ms
perf_total_per_op_us[            VIEW] =   0.440 ms
perf_total_per_op_us[         PERMUTE] =   0.240 ms
perf_total_per_op_us[        GET_ROWS] =   0.008 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.331 ms
perf_total_per_op_us[        SOFT_MAX] = 335.385 ms
perf_total_per_op_us[            ROPE] =   2.865 ms

Last layer:

 - 2125: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.037 /   0.009 ms [ 60 node_2125] [CPU]
 - 2126: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [ 60 node_2126] [CPU]
 - 2127: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [ 60 node_2127] [CPU]
 - 2128: [  8192,  9216,   1]x[  8192,     1,   1]=[  9216,     1,   1]          MUL_MAT   (  4) cpu =   1.000 /   0.250 ms, wall =   1.037 /   0.259 ms [ 60 node_2128] [GPUxQ]
 - 2129: [  9216,     1,   1]x[    47,    47,  47]=[    64,     8,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 Kcur] [CPU]
 - 2130: [    64,     8,   1]x[     4,     1,   1]=[    64,     8,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.002 ms [ 60 Kcur (view)] [CPU]
 - 2131: [614400000,     1,   1]x[    47,    47,  47]=[   512,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 k] [GPU]
 - 2132: [    64,     8,   1]x[   512,     1,   1]=[   512,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [ 60 k (copy of Kcur (view))] [CPU]
 - 2133: [  9216,     1,   1]x[    47,    47,  47]=[    64,     8,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 Vcur] [CPU]
 - 2134: [614400000,     1,   1]x[    47,    47,  47]=[   512,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 v] [GPU]
 - 2135: [    64,     8,   1]x[   512,     1,   1]=[   512,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 60 v (copy of Vcur)] [CPU]
 - 2136: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.003 ms [ 60 node_2136] [CPU]
 - 2137: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  4) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.001 ms [ 60 node_2137] [CPU]
 - 2138: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  3) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [ 60 inpFF] [CPU]
 - 2139: [  8192, 32768,   1]x[  8192,     1,   1]=[ 32768,     1,   1]          MUL_MAT   (  4) cpu =   4.000 /   1.000 ms, wall =   3.851 /   0.963 ms [ 60 inpFF*ff_up] [GPUxQ]
 - 2140: [ 32768,     1,   1]x[    47,    47,  47]=[ 32768,     1,   1]             GELU   (  4) cpu =   0.000 /   0.000 ms, wall =   0.033 /   0.008 ms [ 60 inpFF*ff_up (view)] [CPU]
 - 2141: [ 32768,  8192,   1]x[ 32768,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  4) cpu =   3.000 /   0.750 ms, wall =   3.570 /   0.892 ms [ 60 gelu_cur*ff_down] [GPUxQ]
 - 2142: [614400000,     1,   1]x[    47,    47,  47]=[    64,     8,18725]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 cache_v (view)] [GPU]
 - 2143: [    64,     8,18725]x[    47,    47,  47]=[ 18725,    64,   8]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 cache_v (view) (permuted)] [CPU]
 - 2144: [ 18725,    64,   8]x[    47,    47,  47]=[ 18725,    64,   8]             CONT   (  1) cpu =  44.000 /  44.000 ms, wall =  43.976 /  43.976 ms [ 60 V] [CPU]  (Slow)
 - 2145: [614400000,     1,   1]x[    47,    47,  47]=[    64,     8,18725]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 cache_k (view)] [GPU]
 - 2146: [    64,     8,18725]x[    47,    47,  47]=[    64, 18725,   8]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 K] [CPU]
 - 2147: [  9216,     1,   1]x[    47,    47,  47]=[    64,   128,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 Qcur] [CPU]
 - 2148: [    64,   128,   1]x[     4,     1,   1]=[    64,   128,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.036 /   0.009 ms [ 60 Qcur (view)] [CPU]
 - 2149: [    64,   128,   1]x[    47,    47,  47]=[    64,     1, 128]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 Q] [CPU]
 - 2150: [    64, 18725,   8]x[    64,     1, 128]=[ 18725,     1, 128]          MUL_MAT   (  4) cpu =  29.000 /   7.250 ms, wall =  27.057 /   6.764 ms [ 60 KQ] [CPU]
 - 2151: [ 18725,     1, 128]x[     1,     1,   1]=[ 18725,     1, 128]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.370 /   0.370 ms [ 60 KQ_scaled] [CPU]
 - 2152: [ 18725,     1, 128]x[     2,     1,   1]=[ 18725,     1, 128]    DIAG_MASK_INF   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [ 60 KQ_masked] [CPU]
 - 2153: [ 18725,     1, 128]x[    47,    47,  47]=[ 18725,     1, 128]         SOFT_MAX   (  4) cpu =   5.000 /   1.250 ms, wall =   5.425 /   1.356 ms [ 60 KQ_soft_max] [CPU]
 - 2154: [ 18725,    64,   8]x[ 18725,     1, 128]=[    64,     1, 128]          MUL_MAT   (  4) cpu =   8.000 /   2.000 ms, wall =  11.296 /   2.824 ms [ 60 KQV] [CPU]
 - 2155: [    64,     1, 128]x[    47,    47,  47]=[    64,   128,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 KQV_merged] [CPU]
 - 2156: [    64,   128,   1]x[  8192,     1,   1]=[  8192,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.008 /   0.002 ms [ 60 KQV_merged (copy)] [CPU]
 - 2157: [  8192,  8192,   1]x[  8192,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  4) cpu =   2.000 /   0.500 ms, wall =   1.139 /   0.285 ms [ 60 result_wo] [GPUxQ]
 - 2158: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.006 /   0.002 ms [ 60 attn_out] [CPU]
 - 2159: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.001 ms [ 60 node_2159] [CPU]
 - 2160: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.017 /   0.004 ms [ 60 inpFF_+_result_attn_out] [CPU]
 - 2161: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.006 /   0.002 ms [  0 norm_cur] [CPU]
 - 2162: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  2) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.002 ms [  0 node_2162] [CPU]
 - 2163: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  3) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [  0 result_norm] [CPU]
 - 2164: [  8192, 65040,   1]x[  8192,     1,   1]=[ 65040,     1,   1]          MUL_MAT   (  4) cpu =   7.000 /   1.750 ms, wall =   7.280 /   1.820 ms [  0 result_lm_head] [GPUxQ]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions