Skip to content

Commit d550814

Browse files
Merge branch 'master' into zenflow_zero3
2 parents 0745b3b + e04fa3e commit d550814

File tree

17 files changed

+719
-129
lines changed

17 files changed

+719
-129
lines changed

COMMITTERS.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,12 @@
22

33
| Name | GitHub ID | Affiliation
44
|--- | ---- | --- |
5-
| Olatunji Ruwase | [tjruwase](https://github.com/tjruwase) | Microsoft |
5+
| Olatunji Ruwase | [tjruwase](https://github.com/tjruwase) | SnowFlake |
66
| Logan Adams | [loadams](https://github.com/loadams) | Microsoft |
7-
| Masahiro Tanaka | [tohtana](https://github.com/tohtana) | Microsoft |
7+
| Masahiro Tanaka | [tohtana](https://github.com/tohtana) | Anyscale |
88
| Jeff Rasley | [jeffra](https://github.com/jeffra) | SnowFlake |
99
| Minjia Zhang | [minjiazhang](https://github.com/minjiazhang) | UIUC |
1010
| Ashwin Aji | [ashwinma](https://github.com/ashwinma) | AMD |
1111
| Sam Foreman | [saforem2](https://github.com/saforem2) | Argonne National Laboratory |
1212
| Zhipeng Wang | [PKUWZP](https://github.com/PKUWZP) | LinkedIn |
13+
| Guokai Ma | [delock](https://github.com/delock) | Intel |

README.md

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,26 +16,32 @@
1616

1717
## Latest News
1818
<b> <span style="color:orange" > DeepSpeed empowers ChatGPT-like model training with a single click, offering 15x speedup over SOTA RLHF systems with unprecedented cost reduction at all scales; [learn how](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/deepspeed-chat)</span>.</b>
19+
20+
* [2025/08] [ZenFlow: Stall-Free Offloading Engine for LLM Training](https://pytorch.org/blog/zenflow-stall-free-offloading-engine-for-llm-training/)
21+
1922
* [2025/06] [Arctic Long Sequence Training (ALST) with DeepSpeed: Scalable And Efficient Training For Multi-Million Token Sequences](https://www.snowflake.com/en/engineering-blog/arctic-long-sequence-training-multi-million-token-ai/)
23+
24+
* [2025/06] [DeepNVMe: Affordable I/O scaling for Deep Learning Applications](https://github.com/deepspeedai/DeepSpeed/blob/master/blogs/deepnvme/06-2025/README.md)
25+
2026
* [2025/04] [DeepCompile: Unlocking Compiler Optimization for Distributed Training](https://github.com/deepspeedai/DeepSpeed/blob/master/blogs/deepcompile/README.md)
21-
* [2025/03] [DeepSpeed-AutoTP: Automatic Tensor Parallel Training of Hugging Face models](https://github.com/deepspeedai/DeepSpeed/blob/master/blogs/huggingface-tp/README.md)
22-
* [2024/12] [Ulysses-Offload: Democratizing Long Context LLM Training ](https://github.com/deepspeedai/DeepSpeed/blob/master/blogs/ulysses-offload/README.md)
23-
* [2024/12] [DeepSpeed-Domino: Communication-Free LLM Training Engine](https://github.com/deepspeedai/DeepSpeed/blob/master/blogs/deepspeed-domino/README.md)
24-
* [2024/08] [DeepSpeed on Windows](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/windows/08-2024/README.md) [[日本語](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/windows/08-2024/japanese/README.md)] [[中文](https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/windows/08-2024/chinese/README.md)]
27+
28+
* [2025/03] [DeepSpeed AutoTP: Automatic Tensor Parallel Training of Hugging Face models](https://github.com/deepspeedai/DeepSpeed/blob/master/blogs/huggingface-tp/README.md)
29+
2530

2631
<!-- NOTE: we must use html for news items otherwise links will be broken in the 'more news' section -->
2732
<details>
33+
<!-- NOTE: Maintain only three items in 'more news' section -->
2834
<summary>More news</summary>
2935
<ul>
30-
<li> [2024/08] <a href="https://github.com/deepspeedai/DeepSpeed/blob/master/blogs/deepspeed-gds/README.md"> DeepNVMe: Improving DL Applications through I/O Optimizations</a> [<a href="ttps://github.com/deepspeedai/DeepSpeed/blob/master/blogs/deepspeed-gds/japanese/README.md"> 日本語 </a>] [<a href="https://github.com/deepspeedai/DeepSpeed/blob/master/blogs/deepspeed-gds/japanese/README.md"> 中文 </a>]</li>
31-
32-
<li> [2024/07] <a href="https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-ucp/README.md"> DeepSpeed Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training</a> [<a href="https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-ucp/japanese/README.md"> 日本語 </a>] </li>
33-
34-
<li> [2024/03] <a href="https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fp6/03-05-2024/README.md"> DeepSpeed-FP6: The Power of FP6-Centric Serving for Large Language Models</a> [<a href="https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fp6/03-05-2024/README-Chinese.md"> 中文 </a>] </li>
35-
36+
<li>[2024/12] <a href="https://github.com/deepspeedai/DeepSpeed/blob/master/blogs/ulysses-offload/README.md">Ulysses-Offload: Democratizing Long Context LLM Training</a></li>
37+
<li>[2024/12] <a href="https://github.com/deepspeedai/DeepSpeed/blob/master/blogs/deepspeed-domino/README.md">DeepSpeed-Domino: Communication-Free LLM Training Engine</a></li>
38+
<li>[2024/08] <a href="https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/windows/08-2024/README.md">DeepSpeed on Windows</a>
39+
[<a href="https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/windows/08-2024/japanese/README.md">日本語</a>]
40+
[<a href="https://github.com/deepspeedai/DeepSpeed/tree/master/blogs/windows/08-2024/chinese/README.md">中文</a>]</li>
3641
</ul>
3742
</details>
3843

44+
3945
---
4046

4147
# Extreme Speed and Scale for DL Training and Inference
@@ -277,7 +283,7 @@ Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
277283
32. Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Reza Yazdani Aminadabi, Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He. (2024) [System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models](https://dl.acm.org/doi/10.1145/3662158.3662806)
278284
33. Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang. (2024) Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training [arXiv:2406.18820](https://arxiv.org/abs/2406.18820)
279285
34. Stas Bekman, Samyam Rajbhandari, Michael Wyatt, Jeff Rasley, Tunji Ruwase, Zhewei Yao, Aurick Qiao, Yuxiong He. (2025) Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences [arXiv:2506.13996](https://arxiv.org/abs/2506.13996)
280-
286+
35. Tingfeng Lan, Yusen Wu, Bin Ma, Zhaoyuan Su, Rui Yang, Tekin Bicer, Masahiro Tanaka, Olatunji Ruwase, Dong Li, Yue Cheng. (2025) ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates [arXiv:2505.12242](https://arxiv.org/abs/2505.12242)
281287

282288
# Videos
283289
1. DeepSpeed KDD 2020 Tutorial

ci/torch_latest.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333
def pytest():
3434
import subprocess
3535
subprocess.run(
36-
"pytest -n 4 --verbose tests/unit/runtime/zero/test_zero.py tests/unit/runtime/half_precision/test_bf16.py --torch_ver=2.6 --cuda_ver=12.4".split(),
36+
"pytest -n 4 --verbose tests/unit/runtime/zero/test_zero.py tests/unit/runtime/half_precision/test_bf16.py tests/unit/runtime/zero/test_zero_autocast.py --torch_ver=2.6 --cuda_ver=12.4".split(),
3737
check=True,
3838
cwd=ROOT_PATH / ".",
3939
)

deepspeed/runtime/config.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,9 +77,11 @@
7777
MUADAMW_OPTIMIZER = 'muadamw'
7878
MUSGD_OPTIMIZER = 'musgd'
7979
LION_OPTIMIZER = 'lion'
80+
MUON_OPTIMIZER = 'muon'
81+
8082
DEEPSPEED_OPTIMIZERS = [
8183
ADAGRAD_OPTIMIZER, ADAM_OPTIMIZER, ADAMW_OPTIMIZER, LAMB_OPTIMIZER, ONEBIT_ADAM_OPTIMIZER, ONEBIT_LAMB_OPTIMIZER,
82-
ZERO_ONE_ADAM_OPTIMIZER, MUADAM_OPTIMIZER, MUADAMW_OPTIMIZER, MUSGD_OPTIMIZER, LION_OPTIMIZER
84+
ZERO_ONE_ADAM_OPTIMIZER, MUADAM_OPTIMIZER, MUADAMW_OPTIMIZER, MUSGD_OPTIMIZER, LION_OPTIMIZER, MUON_OPTIMIZER
8385
]
8486

8587
# extra optimizer parameters for adam/adamw

deepspeed/runtime/constants.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,9 @@
137137
BFLOAT16_IMMEDIATE_GRAD_UPDATE = "immediate_grad_update"
138138
BFLOAT16_IMMEDIATE_GRAD_UPDATE_DEFAULT = True
139139

140+
# DDP variant of BFLOAT16
141+
DDP_BFLOAT16 = "bf16"
142+
140143
#########################################
141144
# FP16 support
142145
#########################################

deepspeed/runtime/engine.py

Lines changed: 43 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -44,16 +44,17 @@
4444
from deepspeed.runtime.config import DEEPSPEED_OPTIMIZERS, \
4545
ADAGRAD_OPTIMIZER, ADAM_OPTIMIZER, ADAMW_OPTIMIZER, LAMB_OPTIMIZER, ONEBIT_ADAM_OPTIMIZER, ONEBIT_LAMB_OPTIMIZER, \
4646
TORCH_ADAM_PARAM, ADAM_W_MODE, ADAM_W_MODE_DEFAULT, ZERO_ONE_ADAM_OPTIMIZER, MUADAM_OPTIMIZER, MUADAMW_OPTIMIZER, \
47-
MUSGD_OPTIMIZER, LION_OPTIMIZER
47+
MUSGD_OPTIMIZER, LION_OPTIMIZER, MUON_OPTIMIZER
4848

4949
from deepspeed.runtime.model_checkpointing.constants import ValidationMode, \
5050
CHECKPOINT_TAG_VALIDATION, CHECKPOINT_WRITER, CHECKPOINT_SERIALIZATION
5151

5252
from deepspeed.runtime.dataloader import DeepSpeedDataLoader
53+
from deepspeed.runtime.zero.muon.muon_optimizer import MuonWithAuxAdam
5354
from deepspeed.runtime.constants import \
5455
ROUTE_TRAIN, ROUTE_PREDICT, ROUTE_EVAL, \
5556
PLD_THETA, PLD_GAMMA, BFLOAT16, FP16, AMP, GRADIENT_ACCUMULATION_STEPS, \
56-
DATA_PARALLEL_GROUP, GLOBAL_RANK
57+
DATA_PARALLEL_GROUP, GLOBAL_RANK, DDP_BFLOAT16
5758
from deepspeed.runtime.zero.config import ZeroStageEnum
5859
from deepspeed.compression import compression_scheduler
5960
from deepspeed.compression.constants import \
@@ -1090,13 +1091,9 @@ def get_data_types(self):
10901091
model_dtype = torch.bfloat16
10911092

10921093
if self._config.grad_accum_dtype is None:
1093-
if model_dtype == torch.bfloat16 and not self.zero_optimization():
1094-
grad_accum_dtype = torch.float32
1095-
else:
1096-
grad_accum_dtype = model_dtype
1094+
grad_accum_dtype = model_dtype
10971095
else:
10981096
grad_accum_dtype = DtypeEnum(self._config.grad_accum_dtype).value
1099-
11001097
return (model_dtype, grad_accum_dtype)
11011098

11021099
def _optimizer_has_ckpt_event_prologue(self):
@@ -1138,7 +1135,7 @@ def _configure_checkpointing(self):
11381135
or (self.zero_optimization_partition_weights() and self.is_first_weights_partition_group()):
11391136
self.save_non_zero_checkpoint = True
11401137

1141-
if self.zero_optimization() or self.bfloat16_enabled():
1138+
if hasattr(self.optimizer, 'dp_process_group'):
11421139
param_rank = dist.get_rank(group=self.optimizer.dp_process_group)
11431140

11441141
# Only the first parameter parallel process needs to store the
@@ -1406,23 +1403,18 @@ def _do_optimizer_sanity_check(self, basic_optimizer):
14061403
return AMP
14071404
# data type checks
14081405
elif model_dtype == grad_accum_dtype:
1409-
if model_dtype == torch.bfloat16:
1410-
if self.pipeline_parallelism:
1411-
logger.warning(
1412-
"**** BF16 gradient accumulation is not safe numerically with large number of accumulation steps, proceed with caution *****"
1413-
)
1414-
return BFLOAT16
1415-
else:
1416-
raise NotImplementedError(
1417-
"Bfloat16 wrapper must use a gradient accumulation type of fp32, enable ZeRO to use Bfloat16 gradient accumulation"
1418-
)
1419-
if model_dtype == torch.float16:
1420-
return FP16
1421-
# else optimizer_wrapper = None
1406+
if model_dtype == torch.float32:
1407+
return None
1408+
if model_dtype == torch.bfloat16 and self.pipeline_parallelism:
1409+
logger.warning(
1410+
"**** BF16 gradient accumulation is not safe numerically with large number of accumulation steps, proceed with caution *****"
1411+
)
1412+
return BFLOAT16
1413+
return FP16 if model_dtype == torch.float16 else DDP_BFLOAT16
14221414
elif model_dtype == torch.bfloat16 and grad_accum_dtype == torch.float32:
14231415
return BFLOAT16
14241416
else:
1425-
raise NotImplementedError("unsupported mix of model dtype and gradient accumulation type")
1417+
raise NotImplementedError(f"unsupported mix of {model_dtype=} and {grad_accum_dtype=}")
14261418

14271419
return None
14281420

@@ -1465,8 +1457,9 @@ def _configure_optimizer(self, client_optimizer, model_parameters):
14651457
self._set_client_model(model)
14661458
self._broadcast_model()
14671459
# TODO: maybe need to broadcast experts differently?
1468-
elif optimizer_wrapper == FP16:
1469-
self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
1460+
elif optimizer_wrapper in [FP16, DDP_BFLOAT16]:
1461+
lp_dtype = torch.float16 if optimizer_wrapper == FP16 else torch.bfloat16
1462+
self.optimizer = self._configure_fp16_optimizer(basic_optimizer, lp_dtype)
14701463
elif optimizer_wrapper == BFLOAT16:
14711464
self.optimizer = self._configure_bf16_optimizer(basic_optimizer)
14721465
else:
@@ -1574,6 +1567,29 @@ def _configure_basic_optimizer(self, model_parameters):
15741567
except ImportError:
15751568
logger.error("Install mup to use MuSGD optimizer")
15761569
optimizer = MuSGD(model_parameters, **optimizer_parameters)
1570+
elif self.optimizer_name() == MUON_OPTIMIZER:
1571+
zero_stage = self.zero_optimization_stage()
1572+
assert zero_stage <= ZeroStageEnum.gradients, "Muon optimizer is not yet compatible with ZeRO Stage 3"
1573+
if not all([hasattr(p, 'use_muon') for p in model_parameters]):
1574+
msg = "Muon optimizer is used, but the use_muon attribute is NOT configured for some of the model parameters, " \
1575+
"please set by `param.use_muon = True / False` for all params"
1576+
logger.error(msg)
1577+
muon_params = [p for p in model_parameters if p.use_muon]
1578+
non_muon_params = [p for p in model_parameters if not p.use_muon]
1579+
param_groups = []
1580+
if muon_params:
1581+
accepted_parameters = dict()
1582+
for key in ["lr", "momentum", "weight_decay"]:
1583+
if key in optimizer_parameters:
1584+
accepted_parameters[key] = optimizer_parameters[key]
1585+
param_groups.append(dict(params=muon_params, use_muon=True, **accepted_parameters))
1586+
if non_muon_params:
1587+
accepted_parameters = dict()
1588+
for key in ["lr", "betas", "eps", "weight_decay"]:
1589+
if key in optimizer_parameters:
1590+
accepted_parameters[key] = optimizer_parameters[key]
1591+
param_groups.append(dict(params=non_muon_params, use_muon=False, **accepted_parameters))
1592+
optimizer = MuonWithAuxAdam(param_groups)
15771593
else:
15781594
torch_optimizer = getattr(torch.optim, self.optimizer_name())
15791595
optimizer = torch_optimizer(model_parameters, **optimizer_parameters)
@@ -1617,7 +1633,7 @@ def _configure_quantization(self):
16171633
)
16181634
return quantizer
16191635

1620-
def _configure_fp16_optimizer(self, optimizer):
1636+
def _configure_fp16_optimizer(self, optimizer, low_precision_dtype):
16211637
initial_dynamic_scale = self.initial_dynamic_scale()
16221638
dynamic_loss_args = self.dynamic_loss_scale_args()
16231639
clip_grad = self.gradient_clipping()
@@ -1635,6 +1651,7 @@ def _configure_fp16_optimizer(self, optimizer):
16351651
optimizer = FP16_Optimizer(
16361652
optimizer,
16371653
deepspeed=self,
1654+
low_precision_dtype=low_precision_dtype,
16381655
dynamic_loss_scale=True,
16391656
initial_dynamic_scale=initial_dynamic_scale,
16401657
dynamic_loss_args=dynamic_loss_args,
@@ -1650,6 +1667,7 @@ def _configure_fp16_optimizer(self, optimizer):
16501667
optimizer = FP16_Optimizer(
16511668
optimizer,
16521669
deepspeed=self,
1670+
low_precision_dtype=low_precision_dtype,
16531671
static_loss_scale=self.loss_scale(),
16541672
mpu=self.mpu,
16551673
clip_grad=clip_grad,

0 commit comments

Comments
 (0)