Skip to content
This repository was archived by the owner on Jun 24, 2024. It is now read-only.

Conversation

@LLukas22
Copy link
Contributor

Closes #378.

Adds custom context scaling to llama, falcon, gpt-j, gpt-neox.

Adds an Option<ggml::CustomRoPEArguments> parameter to the ModelParameters.

Adds the optional --rope-base and --rope-scaling cli parameters.

Copy link
Collaborator

@philpax philpax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good. What's the easiest way to test it?

@LLukas22
Copy link
Contributor Author

  1. Sample command for 8k context of llama 2:
    cargo run --release --features cublas -- infer -a llama -m "C:\Users\lkreu\Downloads\llama-2-13b-chat.ggmlv3.q5_K_M.bin" -p "A llama riding a crab" --use-gpu --rope-scaling 0.5 --num-ctx-tokens 8192 --ignore-eos --stats

  2. Sit back and get some coffee☕ (8192 tokens are a lot of tokens to be generated)

16k context is also possible by setting rope-scaling to 0.25 but then i don't have enough VRAM to infer on my GPU.

@LLukas22
Copy link
Contributor Author

The generated text gets repetitive after some time but i guess that's a smapler/setting issue.
lama_story.txt

@philpax
Copy link
Collaborator

philpax commented Jul 28, 2023

Great work! I just tested it with LLongMa-2; it's a bit finicky, but that shouldn't be a problem from us. I've revised the names a little to match llama.cpp / refer to frequency, but the rest is the same. Will merge once CI passes 🚀

@philpax philpax merged commit 9fe9f19 into rustformers:main Jul 28, 2023
@hhamud hhamud mentioned this pull request Aug 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement SuperHOT/interpolated RoPE support

2 participants