Server: reuse cached tokens for shifted prompt

## Motivation

Currently, cached tokens is reused in server by doing `common_part(new_tokens, cached_tokens)`

This is good in the situation where all incoming requests have the same prefix:

```
cached_tokens  a b c d e f g h i
new_tokens     a b c d e f x y z
reused_tokens  x x x x x x
```

However, if the input is shifted (for example, old messages in the conversation is dropped). In this case, number of reused token is reduced:

```
cached_tokens  a b c d e f g h i
new_tokens     a b c g h i k l m
reused_tokens  x x x
```

## Proposal

My proposal is to detect such case and uses `llama_kv_cache_seq_rm` + `llama_kv_cache_seq_add` to shift the tokens in cache accordingly.

```
cached_tokens  a b c d e f g h i
shifted_cache  a b c g h i
new_tokens     a b c g h i k l m
reused_tokens  x x x x x x
```

I already tested this kind of behavior on my side. It works well, but the catch is that it only works with one single "conversation". Also, I have no idea if have negative impacts if being done frequently (i.e. fragmenting the cache?) @ggerganov 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Server: reuse cached tokens for shifted prompt #5793

Motivation

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Server: reuse cached tokens for shifted prompt #5793

Description

Motivation

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions