Skip to content

Conversation

CokeDong
Copy link
Contributor

@CokeDong CokeDong commented Aug 30, 2023

add tgs metrics for trainer. the motivation is that: current speed_metrics only consider train_samples_per_second. but the length of each example is not the same(especailly cutting_off increase). this pr introduce tgs metrics, which take tokens into considerations.

@CokeDong CokeDong changed the title Add tgs metrics Add tgs speed metrics Aug 30, 2023
@amyeroberts
Copy link
Contributor

cc @muellerzr @pacman100

Comment on lines 1161 to 1168
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way we can perhaps do this by checking if max_steps is none and its streaming dataloader for instance? I worry about speed slowdowns if users for instance have a large dataset across the dataloaders.

I'd prefer this as perhaps an opt-in via TrainingArguments, noting how it will iterate over the dataloader to get these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tks = batch["input_ids"].size(0) * batch["input_ids"].size(1)
tokens = batch["input_ids"].size(0) * batch["input_ids"].size(1)

Full words please :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! having tokens per second would be quite handy, though I did make a nit on trying to save on time/streaming datasets. We may want this as an opt-in instead.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @CokeDong, Thank for you adding tokens_per_second log metric, helpful. Left a comment. I agree with Zach that iterating over the dataloader needs to be avoided as it will slow down the whole training.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are calculating max_steps on line 1614. Why can't we use that to avoid iterating through the entire dataloader?

Copy link
Contributor Author

@CokeDong CokeDong Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_steps doesnot contains the num of tokens infomation, so we call num_tokens on line 1617 .

yes, the iteration will slow down the whole training ,especally big datasets. Currently i am trying moving the approximation of tokens per second into each training step so as to avoild duplicated dataloading process. maybe that will solve the common concerns.

Copy link
Contributor

@muellerzr muellerzr Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's not possible (which I don't think it is without some effort) I'd rather see this as an opt-in. Because the other issue with this is users can use the Trainer to train any dataset, not just text (though that's the most common). So again, I'd rather see this as an opt-in specifying more in TrainingArguments.

Copy link
Contributor Author

@CokeDong CokeDong Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah.i got u:-) an opt-in maybe the most direct way for user to choose open tgs or not. will do that.

@CokeDong CokeDong changed the title Add tgs speed metrics [WIP] Add tgs speed metrics Sep 1, 2023
@CokeDong
Copy link
Contributor Author

CokeDong commented Sep 4, 2023

@muellerzr @pacman100 PTAL, thx

@CokeDong CokeDong changed the title [WIP] Add tgs speed metrics Add tgs speed metrics Sep 4, 2023
Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We're getting much, much closer. I've added a naming nit, cc @amyeroberts if you have some ideas on other naming conventions, and how this looks to you :) Let's avoid acronyms/shorthand as much as possible.

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work adding this @CokeDong!

Overall code looks good - as @muellerzr suggests we should rename some of the variables, and slightly rework the logic to make the code clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

args.tgs_metrics shouldn't be None - it should be either True or False

Suggested change
num_tokens=None if args.tgs_metrics is None else num_train_tokens / args.world_size,
num_tokens=None if not args.tgs_metrics else num_train_tokens / args.world_size,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set this to None here, then we don't need the conditional logic on L2016

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, refacted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this a bit clearer

Suggested change
* self.num_tokens(train_dataloader, True)
* self.num_tokens(train_dataloader, max_steps=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* self.num_tokens(train_dataloader, True)
* self.num_tokens(train_dataloader, max_steps=True)

@CokeDong
Copy link
Contributor Author

CokeDong commented Sep 7, 2023

https://app.circleci.com/jobs/github/huggingface/transformers/913233 test_hub failed, seems not related with current PR :(

@muellerzr
Copy link
Contributor

@CokeDong please rebase with the main branch of transformers, this should ensure it passes :)

@CokeDong CokeDong force-pushed the dkx_add_tgs_metrics branch from b2e1d81 to 3c97d2d Compare September 7, 2023 12:25
@muellerzr
Copy link
Contributor

BTW @CokeDong, you don't have to do force-pushes if you're worried about the commit-bloat post-merge, in transformers we squash when merging.

@CokeDong
Copy link
Contributor Author

CokeDong commented Sep 7, 2023

BTW @CokeDong, you don't have to do force-pushes if you're worried about the commit-bloat post-merge, in transformers we squash when merging.

Got that

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! LG2M

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding and iterating on this!

@amyeroberts amyeroberts merged commit 3744126 into huggingface:main Sep 7, 2023
parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023
* Add tgs metrics

* bugfix and black formatting

* workaround for tokens counting

* formating and bugfix

* Fix

* Add opt-in for tgs metrics

* make style and fix error

* Fix doc

* fix docbuild

* hf-doc-build

* fix

* test

* Update src/transformers/training_args.py

renaming

Co-authored-by: Zach Mueller <[email protected]>

* Update src/transformers/training_args.py

renaming

Co-authored-by: Zach Mueller <[email protected]>

* Fix some symbol

* test

* Update src/transformers/trainer_utils.py

match nameing patterns

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/training_args.py

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/trainer.py

nice

Co-authored-by: amyeroberts <[email protected]>

* Fix reviews

* Fix

* Fix black

---------

Co-authored-by: Zach Mueller <[email protected]>
Co-authored-by: amyeroberts <[email protected]>
@geronimi73
Copy link

thanks for this great feature!
quickly tried it with accelerate on 1-3 GPUs and the results confuse me 🤔
t/s should be higher with 3 GPUs vs 1 GPU, right?

W B Chart 13_11_2023, 06_25_18
W B Chart 13_11_2023, 06_26_15

@CokeDong
Copy link
Contributor Author

thanks for this great feature! quickly tried it with accelerate on 1-3 GPUs and the results confuse me 🤔 t/s should be higher with 3 GPUs vs 1 GPU, right?

W B Chart 13_11_2023, 06_25_18 W B Chart 13_11_2023, 06_26_15

hi, tokens/sec/gpu(tgs) meatures tokens throughput capability per device.

@geronimi73
Copy link

per device

now it makes sense, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants