- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.2k
Description
I've been really enjoying using both llama.cpp-python and the original llama.cpp.  These are amazing developments here, especially for folks without massively powerful GPUs.
There's a really nice feature that was implemented in llama.cpp in January to allow self-extend (ala LongLLM's approach)).  It works well for the llama's main.cpp as well as server.cpp.  It works really well, and plenty of folks have noted self-extend is especially useful with Mistral/Mixtral, Gemma, and Phi 2.
It appears someone else might have been asking about this earlier here.  Right now, I'm having to move in and out of python when I want to run summarization on a 'just-slightly-too-long' article with self-extend.  Would you consider implementing self-extend as an option in llama.cpp-python?