imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

EAddario · 2025-07-26T16:47:29Z

Following up from #9400 and #12718, I've started tinkering with activation-based statistics, in addition to what's currently available via --show-statistics.

At the moment, I'm exploring three options going from from easy to implement and OK approximation, to some assembly required but fairly accurate:

L2 norm of activation difference: where larger values would suggest the tensor has significantly transformed the input with respect to the previous layer.
KL Divergence reduction using a pre-computed logit file: using a similar approach as described by nostalgebraist in logit lens, and based on a pre-computed logit file (e.g. from a previous llama-perplexity --save-all-logits run)
Given that llama-imatrix already generates the actual logits to compute PPL, use Thông T. Nguyễn's logit prism approach to calculate the exact contribution of each layer to the final logit scores

Sharing with the readers, and in particular @compilade and @jukofyork, in case anyone's willing to double check assumptions and/or suggest alternative approaches I haven't considered.

compilade · 2025-07-26T17:02:21Z

tools/imatrix/imatrix.cpp

+            if (!stat.activations.empty()) {
+                const int32_t nact = (int32_t) stat.activations.size();
+                struct ggml_tensor * in_sum  = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, nact / nmat, nmat);
+                ggml_format_name(in_sum, "%s.in_sum", name.c_str()); // ToDo: consider a better name. 'in_act' maybe?


I think in_sum is fine, this fits with the intention of in_sum2.

compilade · 2025-07-26T17:06:57Z

tools/imatrix/imatrix.cpp

+    std::vector<float>   activations;
    std::vector<float>   values;


It might make sense to rename Stats.values to Stats.in_sum2, and Stats.activations to Stats.in_sum.

It should make it more obvious what maps to what in the resulting GGUF.

jukofyork · 2025-07-29T11:15:01Z

L2 norm of activation difference: where larger values would suggest the tensor has significantly transformed the input with respect to the previous layer.

If we had access to some numerical linear algebra routines then it would likely be possible to get much more interesting stats from this.

If you think about it:

The L2 norm of the activation difference is just measuring the Euclidean distance of the tip of the input vector vs the tip of the output vector.
The mean of these norms probably isn't that interesting (but could be used to test if a quant is systematically biasing or scaling the activations).
The variance of these norms is likely much more interesting and tells you about the "richness" of the transformation (indirectly - see below).

If instead of using the L2 norms of the differences, we construct the cross-covariance matrix of the paired samples, and then take the SVD of this:

The "richness" of the transformation (measured indirectly above) is actually to do with the distribution of the singular values, eg: there are many sets of activation differences with the same L2-norm, but those with a flat(er) distribution of singular values (vs a couple of large singular values) are likely to be much more important and interesting.
If you convert the SVD into a polar decomposition, then the scaling and rotational components will likely lead to other interesting insights, eg:

I suspect that the scaling part of the transformation is quite well handled by the current scaler quants, but the rotational component is likely not.

IIRC, some of the 1-2bit quants use vector quantization, and if so; these will likely handle the rotational components better and/or show quite different properties.

I'm on my phone ATM so can't easily link them, but there have been several papers showing:

Outlier activations in LLMs matter much more than simple rate–distortion theory would suggest/measure. This is likely related to the "flatness" of the singular values, where only rarely do some singular vector directions give a high dot-product with an input activation, but when they do; they add a significant/important contribution to the output.
LLMs are much more rotational than people first realised, eg: there was [IIRC] a Microsoft paper where they constrained everything to be on the surface of a unit ball, and there are several PEFT methods that purely alter the rotational directions via orthogonal transformations.

jukofyork · 2025-07-29T11:28:59Z

If it's any use, then there is code here to analyse the symmetrised cross-covaraince matrix I used for the control vectors:

https://github.com/jukofyork/control-vectors/blob/main/direction_analyzer.py

The symmetrised version deliberately gets rid of the rotational components as there can't be made use of if we are just looking for a single direction... You can actually do the same on the anti-symmetrised version (to look at the rotational components only), but Eigen-decompostion is less useful for this as it will return all complex vectors (hence why SVD makes more sense).

I should also add that from my experiments using SVD on the tensors (ie: ignoring the activations!) of LLMs, it often appears that the early/final tensors (which actually appear to be very important and are bumped in bits in the quant routines here!), actually tend to have a less flat distribution of singular values themselves! So when you ignore the distribution of input activations - they generally appear to be doing something inherently "lower dimensional" than the middle tensors!? It would be interesting to investigate this whilst also looking at the activations...

Use activations to calculate the stats

09bc7c2

EAddario marked this pull request as draft July 26, 2025 16:47

github-actions bot added the examples label Jul 26, 2025

compilade reviewed Jul 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

EAddario commented Jul 26, 2025

Uh oh!

compilade Jul 26, 2025

Uh oh!

compilade Jul 26, 2025

Uh oh!

jukofyork commented Jul 29, 2025 •

edited

Loading

Uh oh!

jukofyork commented Jul 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

Are you sure you want to change the base?

imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891

Conversation

EAddario commented Jul 26, 2025

Uh oh!

compilade Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

compilade Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

jukofyork commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jukofyork commented Jul 29, 2025 •

edited

Loading

jukofyork commented Jul 29, 2025 •

edited

Loading