-
Notifications
You must be signed in to change notification settings - Fork 12.5k
imatrix: calculate activation-based statistics for new format (GGUF) imatrices #14891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
if (!stat.activations.empty()) { | ||
const int32_t nact = (int32_t) stat.activations.size(); | ||
struct ggml_tensor * in_sum = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, nact / nmat, nmat); | ||
ggml_format_name(in_sum, "%s.in_sum", name.c_str()); // ToDo: consider a better name. 'in_act' maybe? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in_sum
is fine, this fits with the intention of in_sum2
.
std::vector<float> activations; | ||
std::vector<float> values; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to rename Stats.values
to Stats.in_sum2
, and Stats.activations
to Stats.in_sum
.
It should make it more obvious what maps to what in the resulting GGUF.
If we had access to some numerical linear algebra routines then it would likely be possible to get much more interesting stats from this. If you think about it:
If instead of using the L2 norms of the differences, we construct the cross-covariance matrix of the paired samples, and then take the SVD of this:
I suspect that the scaling part of the transformation is quite well handled by the current scaler quants, but the rotational component is likely not. IIRC, some of the 1-2bit quants use vector quantization, and if so; these will likely handle the rotational components better and/or show quite different properties. I'm on my phone ATM so can't easily link them, but there have been several papers showing:
|
If it's any use, then there is code here to analyse the symmetrised cross-covaraince matrix I used for the control vectors: https://github.com/jukofyork/control-vectors/blob/main/direction_analyzer.py The symmetrised version deliberately gets rid of the rotational components as there can't be made use of if we are just looking for a single direction... You can actually do the same on the anti-symmetrised version (to look at the rotational components only), but Eigen-decompostion is less useful for this as it will return all complex vectors (hence why SVD makes more sense). I should also add that from my experiments using SVD on the tensors (ie: ignoring the activations!) of LLMs, it often appears that the early/final tensors (which actually appear to be very important and are bumped in bits in the quant routines here!), actually tend to have a less flat distribution of singular values themselves! So when you ignore the distribution of input activations - they generally appear to be doing something inherently "lower dimensional" than the middle tensors!? It would be interesting to investigate this whilst also looking at the activations... |
Following up from #9400 and #12718, I've started tinkering with activation-based statistics, in addition to what's currently available via
--show-statistics
.At the moment, I'm exploring three options going from from easy to implement and OK approximation, to some assembly required but fairly accurate:
llama-perplexity --save-all-logits
run)llama-imatrix
already generates the actual logits to compute PPL, use Thông T. Nguyễn's logit prism approach to calculate the exact contribution of each layer to the final logit scoresSharing with the readers, and in particular @compilade and @jukofyork, in case anyone's willing to double check assumptions and/or suggest alternative approaches I haven't considered.