Add `tgs` speed metrics #25858

CokeDong · 2023-08-30T09:02:14Z

add tgs metrics for trainer. the motivation is that: current speed_metrics only consider train_samples_per_second. but the length of each example is not the same(especailly cutting_off increase). this pr introduce tgs metrics, which take tokens into considerations.

amyeroberts · 2023-08-30T09:07:06Z

cc @muellerzr @pacman100

muellerzr · 2023-08-30T16:47:10Z

src/transformers/trainer.py

Is there a better way we can perhaps do this by checking if max_steps is none and its streaming dataloader for instance? I worry about speed slowdowns if users for instance have a large dataset across the dataloaders.

I'd prefer this as perhaps an opt-in via TrainingArguments, noting how it will iterate over the dataloader to get these.

muellerzr · 2023-08-30T16:47:37Z

src/transformers/trainer.py

Suggested change

tks = batch["input_ids"].size(0) * batch["input_ids"].size(1)

tokens = batch["input_ids"].size(0) * batch["input_ids"].size(1)

Full words please :)

muellerzr

Thanks! having tokens per second would be quite handy, though I did make a nit on trying to save on time/streaming datasets. We may want this as an opt-in instead.

HuggingFaceDocBuilderDev · 2023-08-30T17:09:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

pacman100

Hello @CokeDong, Thank for you adding tokens_per_second log metric, helpful. Left a comment. I agree with Zach that iterating over the dataloader needs to be avoided as it will slow down the whole training.

pacman100 · 2023-08-31T10:49:35Z

src/transformers/trainer.py

We are calculating max_steps on line 1614. Why can't we use that to avoid iterating through the entire dataloader?

max_steps doesnot contains the num of tokens infomation, so we call num_tokens on line 1617 .

yes, the iteration will slow down the whole training ,especally big datasets. Currently i am trying moving the approximation of tokens per second into each training step so as to avoild duplicated dataloading process. maybe that will solve the common concerns.

If that's not possible (which I don't think it is without some effort) I'd rather see this as an opt-in. Because the other issue with this is users can use the Trainer to train any dataset, not just text (though that's the most common). So again, I'd rather see this as an opt-in specifying more in TrainingArguments.

yeah.i got u:-) an opt-in maybe the most direct way for user to choose open tgs or not. will do that.

CokeDong · 2023-09-04T08:11:31Z

@muellerzr @pacman100 PTAL, thx

muellerzr

Thanks! We're getting much, much closer. I've added a naming nit, cc @amyeroberts if you have some ideas on other naming conventions, and how this looks to you :) Let's avoid acronyms/shorthand as much as possible.

src/transformers/training_args.py

amyeroberts

Thanks for the work adding this @CokeDong!

Overall code looks good - as @muellerzr suggests we should rename some of the variables, and slightly rework the logic to make the code clearer.

src/transformers/trainer_utils.py

src/transformers/training_args.py

amyeroberts · 2023-09-05T15:46:05Z

src/transformers/trainer.py

args.tgs_metrics shouldn't be None - it should be either True or False

Suggested change

num_tokens=None if args.tgs_metrics is None else num_train_tokens / args.world_size,

num_tokens=None if not args.tgs_metrics else num_train_tokens / args.world_size,

amyeroberts · 2023-09-05T15:48:54Z

src/transformers/trainer.py

If we set this to None here, then we don't need the conditional logic on L2016

yep, refacted

amyeroberts · 2023-09-05T15:49:17Z

src/transformers/trainer.py

Let's make this a bit clearer

Suggested change

* self.num_tokens(train_dataloader, True)

* self.num_tokens(train_dataloader, max_steps=True)

amyeroberts · 2023-09-05T15:49:39Z

src/transformers/trainer.py

Suggested change

* self.num_tokens(train_dataloader, True)

* self.num_tokens(train_dataloader, max_steps=True)

src/transformers/trainer.py

CokeDong · 2023-09-07T11:37:30Z

https://app.circleci.com/jobs/github/huggingface/transformers/913233 test_hub failed， seems not related with current PR :(

muellerzr · 2023-09-07T11:40:52Z

@CokeDong please rebase with the main branch of transformers, this should ensure it passes :)

renaming Co-authored-by: Zach Mueller <[email protected]>

match nameing patterns Co-authored-by: amyeroberts <[email protected]>

Co-authored-by: amyeroberts <[email protected]>

nice Co-authored-by: amyeroberts <[email protected]>

muellerzr · 2023-09-07T12:35:51Z

BTW @CokeDong, you don't have to do force-pushes if you're worried about the commit-bloat post-merge, in transformers we squash when merging.

CokeDong · 2023-09-07T12:44:39Z

BTW @CokeDong, you don't have to do force-pushes if you're worried about the commit-bloat post-merge, in transformers we squash when merging.

Got that

muellerzr

Thanks for iterating! LG2M

amyeroberts

Thanks for adding and iterating on this!

* Add tgs metrics * bugfix and black formatting * workaround for tokens counting * formating and bugfix * Fix * Add opt-in for tgs metrics * make style and fix error * Fix doc * fix docbuild * hf-doc-build * fix * test * Update src/transformers/training_args.py renaming Co-authored-by: Zach Mueller <[email protected]> * Update src/transformers/training_args.py renaming Co-authored-by: Zach Mueller <[email protected]> * Fix some symbol * test * Update src/transformers/trainer_utils.py match nameing patterns Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/training_args.py Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/trainer.py nice Co-authored-by: amyeroberts <[email protected]> * Fix reviews * Fix * Fix black --------- Co-authored-by: Zach Mueller <[email protected]> Co-authored-by: amyeroberts <[email protected]>

geronimi73 · 2023-11-13T14:41:39Z

thanks for this great feature!
quickly tried it with accelerate on 1-3 GPUs and the results confuse me 🤔
t/s should be higher with 3 GPUs vs 1 GPU, right?

CokeDong · 2023-11-15T03:35:43Z

thanks for this great feature! quickly tried it with accelerate on 1-3 GPUs and the results confuse me 🤔 t/s should be higher with 3 GPUs vs 1 GPU, right?

hi, tokens/sec/gpu(tgs) meatures tokens throughput capability per device.

geronimi73 · 2023-11-15T05:12:05Z

per device

now it makes sense, thank you!

CokeDong changed the title ~~Add tgs metrics~~ Add tgs speed metrics Aug 30, 2023

muellerzr reviewed Aug 30, 2023

View reviewed changes

pacman100 reviewed Aug 31, 2023

View reviewed changes

CokeDong changed the title ~~Add tgs speed metrics~~ [WIP] Add tgs speed metrics Sep 1, 2023

CokeDong changed the title ~~[WIP] Add tgs speed metrics~~ Add tgs speed metrics Sep 4, 2023

muellerzr reviewed Sep 5, 2023

View reviewed changes

src/transformers/training_args.py Outdated Show resolved Hide resolved

src/transformers/training_args.py Outdated Show resolved Hide resolved

amyeroberts reviewed Sep 6, 2023

View reviewed changes

CokeDong and others added 16 commits September 7, 2023 12:25

Add tgs metrics

6dc2d40

bugfix and black formatting

375c582

workaround for tokens counting

115b5b9

formating and bugfix

ac3a08f

Fix

0aed860

Add opt-in for tgs metrics

5a2ec4e

make style and fix error

ea07388

Fix doc

d0f8f1c

fix docbuild

f1f0882

hf-doc-build

c8349da

fix

07f9286

test

0c36c17

Update src/transformers/training_args.py

e559ca4

renaming Co-authored-by: Zach Mueller <[email protected]>

Update src/transformers/training_args.py

4122bd3

renaming Co-authored-by: Zach Mueller <[email protected]>

Fix some symbol

7592b9c

test

5992dd9

CokeDong and others added 6 commits September 7, 2023 12:25

Update src/transformers/trainer_utils.py

945e4cb

match nameing patterns Co-authored-by: amyeroberts <[email protected]>

Update src/transformers/training_args.py

66c737e

Co-authored-by: amyeroberts <[email protected]>

Update src/transformers/trainer.py

7667324

nice Co-authored-by: amyeroberts <[email protected]>

Fix reviews

5e9ed5e

Fix

d43e87c

Fix black

3c97d2d

CokeDong force-pushed the dkx_add_tgs_metrics branch from b2e1d81 to 3c97d2d Compare September 7, 2023 12:25

muellerzr approved these changes Sep 7, 2023

View reviewed changes

muellerzr requested a review from amyeroberts September 7, 2023 15:32

amyeroberts approved these changes Sep 7, 2023

View reviewed changes

amyeroberts merged commit 3744126 into huggingface:main Sep 7, 2023

muellerzr mentioned this pull request Oct 24, 2023

Count of tokens seen during training in Trainer #27027

Closed

	tks = batch["input_ids"].size(0) * batch["input_ids"].size(1)
	tokens = batch["input_ids"].size(0) * batch["input_ids"].size(1)

	num_tokens=None if args.tgs_metrics is None else num_train_tokens / args.world_size,
	num_tokens=None if not args.tgs_metrics else num_train_tokens / args.world_size,

	* self.num_tokens(train_dataloader, True)
	* self.num_tokens(train_dataloader, max_steps=True)

Add tgs speed metrics #25858

Add tgs speed metrics #25858

Uh oh!

Conversation

CokeDong commented Aug 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyeroberts commented Aug 30, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

muellerzr left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Aug 30, 2023

Uh oh!

pacman100 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CokeDong Aug 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

muellerzr Aug 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CokeDong Aug 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CokeDong commented Sep 4, 2023

Uh oh!

muellerzr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CokeDong commented Sep 7, 2023

Uh oh!

muellerzr commented Sep 7, 2023

Uh oh!

muellerzr commented Sep 7, 2023

Uh oh!

CokeDong commented Sep 7, 2023

Uh oh!

muellerzr left a comment

Choose a reason for hiding this comment

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

geronimi73 commented Nov 13, 2023

Uh oh!

CokeDong commented Nov 15, 2023

Uh oh!

Add `tgs` speed metrics #25858

Add `tgs` speed metrics #25858

CokeDong commented Aug 30, 2023 •

edited

Loading

CokeDong Aug 31, 2023 •

edited

Loading

muellerzr Aug 31, 2023 •

edited

Loading

CokeDong Aug 31, 2023 •

edited

Loading