A version of the llm-claude-3 plugin that supports caching. Forked from an older version, this branch does not support images or attachements.
Install this plugin in the same environment as LLM.
git clone -b prompt-caching https://github.com/irthomasthomas/llm-claude-3-caching.git
cd llm-claude-3-caching
llm install -e .
First, set an API key for Claude 3:
llm keys set claude
# Paste key here
Run llm models
to list the models, and llm models --options
to include a list of their options.
Run prompts like this:
llm -m claude-3.5-sonnet-cache 'Fun facts about pelicans' -o cache_prompt 1
llm -m claude-3-opus-cache 'Fun facts about squirrels' -o cache_prompt 1
llm -m claude-3-sonnet-cache 'Fun facts about walruses' -o cache_prompt 1
llm -m claude-3-haiku-cache 'Fun facts about armadillos' -o cache_prompt 1
This plugin now supports Anthropic's Prompt Caching feature, which can significantly improve performance and reduce costs for certain types of queries.
Prompt Caching allows you to store and reuse context within your prompt. This is especially useful for:
- Prompts with many examples
- Large amounts of context or background information
- Repetitive tasks with consistent instructions
- Long multi-turn conversations
The cache has a 5-minute lifetime, refreshed each time the cached content is used.
To enable Prompt Caching, use the following options:
-o cache_prompt 1
: Enables caching for the user prompt.-o cache_system 1
: Enables caching for the system prompt.
Example:
llm -m claude-3-sonnet -o cache_prompt 1 'Analyze this text: [long text here]'
llm -m claude-3-sonnet -o cache_prompt 1 -o cache_system 1 'Analyze this text: [long text here]' --system '[long system prompt here]'
llm -c # continues from cached prompt, if available
Based on comprehensive testing across all models:
Model | Cost Reduction Range | Average Reduction |
---|---|---|
Claude 3 Haiku | 78.1% - 99.1% | 92.0% |
Claude 3 Opus | 78.1% - 99.0% | 91.9% |
Claude 3.5 Sonnet | 91.2% - 99.0% | 95.2% |
Example cost reductions:
- Short queries (e.g., "What is the capital of France?")
- Haiku: $0.000016 → $0.000003 (78.1% reduction)
- Opus: $0.000960 → $0.000210 (78.1% reduction)
- Sonnet: $0.000477 → $0.000042 (91.2% reduction)
- Detailed queries (e.g., "Tell me about the Eiffel Tower")
- Haiku: $0.000428 → $0.000004 (99.1% reduction)
- Opus: $0.024840 → $0.000240 (99.0% reduction)
- Sonnet: $0.004653 → $0.000048 (99.0% reduction)
Additional benefits:
- Reduced latency: Improved response times by over 2x
- Improved consistency: Maintained response quality across cached queries
- Zero output token costs for cached responses
- The system checks if the prompt prefix is already cached from a recent query.
- If found, it uses the cached version, reducing processing time and costs.
- Otherwise, it processes the full prompt and caches the prefix for future use.
Prompt Caching is currently supported on:
- Claude 3.5 Sonnet
- claude 3.5 Haiku
- Claude 3 Haiku
- Claude 3 Opus
You can monitor cache performance using these fields in the API response:
cache_creation_input_tokens
: Number of tokens written to the cache when creating a new entry.cache_read_input_tokens
: Number of tokens retrieved from the cache for this request.
To set up this plugin locally, first checkout the code. Then create a new virtual environment:
cd llm-claude-3
python3 -m venv venv
source venv/bin/activate
Now install the dependencies and test dependencies:
llm install -e '.[test]'
To run the tests:
pytest