Enable hpz based on secondary tensor presence #4906

HeyangQin · 2024-01-06T03:32:30Z

Previously we use a series of forward/backward flags to control if hpz should be enabled on certain allgather call. This PR simplifies this by enabling hpz only when its secondary tensor exists (and invalidating its secondary tensor whenever master weights changes). This should:

Prevent potential out-of-sync issue compared with our currently way of overwriting secondary tensor
Improve throughput because now hpz will be enabled in a lot of different scenarios including i) activation checkpointing, ii) gradient accumulation, iii)torch.no_grad context, iv) model.eval() mode, v)LoRA frozen weights, vi) gradient overflow

This is to fix #4851

Convergence test:

llama-2-7b random weights, using https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/llama2/run_llama2_7b.sh.

zero-3 Baseline: Evaluating perplexity, Epoch 4/4: ppl: 5.151907920837402, loss: 1.6393671035766602
hpz with this PR: ppl: 5.081737518310547, loss: 1.6256532669067383

llama-2-7b pretrained weights with lora, using https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/llama2/run_llama2_7b_lora.sh.

zero-3 Baseline: Evaluating perplexity, Epoch 4/4: ppl: 1.8326854705810547, loss: 0.6057823896408081
hpz with this PR: ppl: 1.8326854705810547, loss: 0.6057823896408081

Performance test on 32 V100, still using https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/llama2/run_llama2_7b.sh.

gradient accumulation step = 8

master branch with hpz: SamplesPerSec=17.567813158654847
this patch with hpz: SamplesPerSec=24.121657876029225

lora

master branch with hpz: SamplesPerSec=33.88883430864484
this patch with hpz: SamplesPerSec=43.39463460004735

…r as the control instead

samadejacobs

LGTM, good job @HeyangQin

deepspeed/runtime/zero/partitioned_param_coordinator.py

…/DeepSpeed into HeyangQin/mixz_hpz_fix

mrwyattii · 2024-01-25T18:22:49Z

Manually running nightly tests here: https://github.com/microsoft/DeepSpeed/actions/runs/7658819103

tests/unit/runtime/zero/test_zeropp.py

siddharth9820 · 2024-01-26T10:49:58Z

Is hpz safe to use now?

Previously we use a series of forward/backward flags to control if hpz should be enabled on certain allgather call. This PR simplifies this by enabling hpz only when its secondary tensor exists (and invalidating its secondary tensor whenever master weights changes). This should: 1. Prevent potential out-of-sync issue compared with our currently way of overwriting secondary tensor 2. Improve throughput because now hpz will be enabled in a lot of different scenarios including i) activation checkpointing, ii) gradient accumulation, iii)`torch.no_grad` context, iv) `model.eval()` mode, v)LoRA frozen weights, vi) gradient overflow This is to fix deepspeedai#4851 Convergence test: - llama-2-7b random weights, using https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/llama2/run_llama2_7b.sh. > zero-3 Baseline: Evaluating perplexity, Epoch 4/4: ppl: 5.151907920837402, loss: 1.6393671035766602 > hpz with this PR: ppl: 5.081737518310547, loss: 1.6256532669067383 - llama-2-7b pretrained weights with lora, using https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/llama2/run_llama2_7b_lora.sh. > zero-3 Baseline: Evaluating perplexity, Epoch 4/4: ppl: 1.8326854705810547, loss: 0.6057823896408081 > hpz with this PR: ppl: 1.8326854705810547, loss: 0.6057823896408081 Performance test on 32 V100, still using https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/llama2/run_llama2_7b.sh. - gradient accumulation step = 8 > master branch with hpz: SamplesPerSec=17.567813158654847 > this patch with hpz: SamplesPerSec=24.121657876029225 - lora > master branch with hpz: SamplesPerSec=33.88883430864484 > this patch with hpz: SamplesPerSec=43.39463460004735 --------- Co-authored-by: Michael Wyatt <[email protected]>

HeyangQin added 2 commits December 19, 2023 06:13

Fix potential secondary tensor out of sync issue

48ddf31

remove all forward/backward flag and use existence of secondary tenso…

1bc053f

…r as the control instead

tjruwase requested a review from samadejacobs January 6, 2024 03:44

HeyangQin added 3 commits January 6, 2024 12:01

fix conflict with master

03e85c5

fix format

b99d25a

enable hpz for lora frozen weights

165a854

HeyangQin marked this pull request as ready for review January 8, 2024 16:49

HeyangQin requested review from tjruwase and mrwyattii as code owners January 8, 2024 16:49

Merge branch 'master' into HeyangQin/mixz_hpz_fix

aad9403

samadejacobs approved these changes Jan 8, 2024

View reviewed changes

tjruwase reviewed Jan 8, 2024

View reviewed changes

deepspeed/runtime/zero/partitioned_param_coordinator.py Show resolved Hide resolved

HeyangQin added 6 commits January 8, 2024 18:11

change param partitioning logic

a516257

Merge branch 'HeyangQin/mixz_hpz_fix' of https://github.com/microsoft…

f86e880

…/DeepSpeed into HeyangQin/mixz_hpz_fix

bring back profiler

71bbf1c

Merge branch 'master' into HeyangQin/mixz_hpz_fix

9f25158

bring back forward flag for profiling

7d20df7

update unit test

6bbb2ae

HeyangQin requested a review from loadams as a code owner January 14, 2024 14:07

HeyangQin added 8 commits January 17, 2024 14:14

add convergence test

ae50855

tmp change to workflow for test

ecfd60f

relax time out for convergence test

63512ef

revert tmp changes

786eed0

add nightly flag

67d5e2e

Merge branch 'master' into HeyangQin/mixz_hpz_fix

86a72a5

Fix whitespace in nv-torch-latest-v100.yml

b88f001

fix format

1485bfa

HeyangQin requested review from awan-10 and arashb as code owners January 24, 2024 23:49

fix incorrect format by clang

77d227e

HeyangQin and others added 6 commits January 24, 2024 23:59

skip test if datasets is not installed

b418841

fix format

5e684e6

Update nv-nightly.yml

8857c07

remove test skip

d58631f

re-add import

345f653

Merge branch 'master' into HeyangQin/mixz_hpz_fix

d6c4b1a

mrwyattii reviewed Jan 25, 2024

View reviewed changes

tests/unit/runtime/zero/test_zeropp.py Outdated Show resolved Hide resolved

Update tests/unit/runtime/zero/test_zeropp.py

e17de06

mrwyattii approved these changes Jan 25, 2024

View reviewed changes

mrwyattii enabled auto-merge January 25, 2024 19:04

mrwyattii added this pull request to the merge queue Jan 25, 2024

Merged via the queue into master with commit 75ed63c Jan 25, 2024

mrwyattii deleted the HeyangQin/mixz_hpz_fix branch January 25, 2024 22:53

ByronHsu mentioned this pull request Jan 29, 2024

[BUG] Deepspeed zero++ hpz hangs forever #5030

Open

yundai424 mentioned this pull request Feb 2, 2024

[BUG] ZeRO++ hpZ on llama2-7b gets zero loss #5059

Closed

cyr0930 mentioned this pull request Mar 11, 2025

[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz) #7130

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable hpz based on secondary tensor presence #4906

Enable hpz based on secondary tensor presence #4906

Uh oh!

HeyangQin commented Jan 6, 2024 •

edited

Loading

Uh oh!

samadejacobs left a comment

Uh oh!

Uh oh!

mrwyattii commented Jan 25, 2024

Uh oh!

Uh oh!

siddharth9820 commented Jan 26, 2024

Uh oh!

Uh oh!

Enable hpz based on secondary tensor presence #4906

Enable hpz based on secondary tensor presence #4906

Uh oh!

Conversation

HeyangQin commented Jan 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samadejacobs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mrwyattii commented Jan 25, 2024

Uh oh!

Uh oh!

siddharth9820 commented Jan 26, 2024

Uh oh!

Uh oh!

HeyangQin commented Jan 6, 2024 •

edited

Loading