Skip to content

Conversation

@gante
Copy link
Contributor

@gante gante commented Jul 22, 2024

What does this PR do?

Same as #31999, but with llama being the only changed model.


Confirmed: slow tests are "passing" (same failures as main)
👉 RUN_SLOW=1 py.test -vv tests/models/llama/test_modeling_llama.py
👉 RUN_SLOW=1 py.test -vv tests/utils/test_cache_utils.py
👉 RUN_SLOW=1 py.test -vv tests/utils/test_modeling_rope_utils.py (new tests)


Throughput benchmarks: No changes vs previous main 💔

Comment on lines +83 to +84
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#31999, which propagates the changes to all models, will fix this.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work consolidating the rope logic!

Mostly some small questions and nits. Main comment is about the testing for all the compute functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of the arguments expected, even if optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, not at all :) the validation function exists to (among other things) detect incorrect parameter configurations

Comment on lines +323 to +304
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these should be tested in a test rope utils module, including checks for taking rope_kwargs and config and their equivalence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "rope_kwargs and config and their equivalence" ✅

Numerical checks will be a todo for the post-release follow-up PR (#31999)

Comment on lines 477 to 490
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works and is consistent with the other checks above. We should really make sure to check the rescaling values with specific numerical values in tests for the compute methods as well. This tests tells us things have changed, but not whether the change is in the right direction or magnitude

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, but that is a test that requires some numerical diving. Given our release goals -- would it be okay for me to add a todo/open an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it's actually done, then yes ;)

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]
self.rope_init_fn = ROPE_INIT_FUNCTIONS[self.rope_type]

should it be rope scaling rather than rope init? nit!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather go with init -- the default rope (i.e. not scaled) uses this path as well

Comment on lines 84 to 112
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this should leave enough freedom

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tho, the fact that we don't have a nested config makes it simpler, checks are run somwhere else so pretty much equivalent

Comment on lines 173 to 202
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice to see that go aways!

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful - thanks for adding and iterating!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤗

mig-mfreitas and others added 21 commits July 23, 2024 08:58
YaRN (Yet another RoPE extension method) combines the NTK-By-Parts
Interpolation and Attention Scaling methods, improving upon existing
RoPE interpolation methods for longer context window sizes.

Fine-tuned models maintain their original performance across benchmarks
while enabling efficient extrapolation and transfer learning for
quicker convergence, especially in compute-limited environments.

We implement YaRN and Dynamic-YaRN for the following list of models:

 - LLaMA
 - Falcon
 - GPT-NeoX
 - Olmo
 - Persimmon
 - Phi
 - StableLM
 - OpenLLaMA

New unit tests are added to assert YaRN's correct behavior on both
short and long sequence inputs.

For more details, please refer to https://arxiv.org/abs/2309.00071.

Co-authored-by: Miguel Almeida <[email protected]>
Iterate on YaRN implementation for LLaMA and remove diff from remaining
models for increased PR modularity.

This commit includes the following changes:
- Merge 'yarn_rope_scaling' and 'rope_scaling' dictionaries
- Remove unnecessary attributes ('extrapolation_factor' and 'finetuned')
  from YaRN classes
- Inherit 'forward' method in YaRN classes from superclass
- Rename 'yarn' method to 'compute_yarn_scaling'
- Extend YaRN tests with further assertions
- Fix style inconsistencies

Co-authored-by: Miguel Monte e Freitas <[email protected]>
- Comply with the the tensor building logic introduced in huggingface#30743
- Add referencing to the optimized Attention Factor equation
- Remove Dynamic YaRN for a more agile deployment

Co-authored-by: mig-mfreitas <[email protected]>
@gante gante force-pushed the llama_rope_refactor branch from 1416972 to c824be0 Compare July 23, 2024 08:58
@gante
Copy link
Contributor Author

gante commented Jul 23, 2024

merged the yarn PR (percursor), now merging this one as soon as CI goes green

@amyeroberts
Copy link
Contributor

Yarn PR is failing code quality checks on main. Could you make sure to rebase and then run make fix-copies etc here before merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants