- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.5k
Closed
Labels
Description
Motivation
Currently, cached tokens is reused in server by doing common_part(new_tokens, cached_tokens)
This is good in the situation where all incoming requests have the same prefix:
cached_tokens  a b c d e f g h i
new_tokens     a b c d e f x y z
reused_tokens  x x x x x x
However, if the input is shifted (for example, old messages in the conversation is dropped). In this case, number of reused token is reduced:
cached_tokens  a b c d e f g h i
new_tokens     a b c g h i k l m
reused_tokens  x x x
Proposal
My proposal is to detect such case and uses llama_kv_cache_seq_rm + llama_kv_cache_seq_add to shift the tokens in cache accordingly.
cached_tokens  a b c d e f g h i
shifted_cache  a b c g h i
new_tokens     a b c g h i k l m
reused_tokens  x x x x x x
I already tested this kind of behavior on my side. It works well, but the catch is that it only works with one single "conversation". Also, I have no idea if have negative impacts if being done frequently (i.e. fragmenting the cache?) @ggerganov
teleprint-me and n-ate