Skip to content

Conversation

JoshWoo2003
Copy link
Contributor

This PR completes the ZenFlow integration for DeepSpeed ZeRO Stage 3.

Highlights:

  • ZenFlowSelectiveAdamW_stage3: Optimizer with importance-aware selective parameter updates for ZeRO Stage 3.
  • ZenFlowZeroOptimizer_Stage3: Full Stage 3 optimizer integration with partitioned parameters and CPU offload.
  • Configurable via ZenFlowConfig, fully integrated with DeepSpeedZeroConfig for Stage 3.
  • Unit tests for Stage 3 cases ensuring correctness and compatibility.

Note: Intergration with ZeRO Stage 1&2 was introduced in #7391

@JoshWoo2003
Copy link
Contributor Author

Hi @tohtana @sfc-gh-truwase @Antlera, when you have some time, could you please take a look at this PR? Thanks!

@loadams
Copy link
Collaborator

loadams commented Sep 19, 2025

@JoshWoo2003 - could you please resolve merge conflicts?

@JoshWoo2003
Copy link
Contributor Author

Sorry for the very late reply! I’ve resolved the merge conflicts and updated the affinity setting as suggested.
@loadams @sfc-gh-truwase @tohtana @delock @Antlera — could you please review the code when you have some time?

@delock
Copy link
Collaborator

delock commented Sep 28, 2025

Hi @JoshWoo2003, the affinity part looks good to me. Thanks for the change! Can you also fix formatting? Thanks!

JoshWoo2003 and others added 5 commits September 28, 2025 16:01
- Introduced a new file: zenflow/engine_stage3.py to implement ZenFlow-specific Stage 3 logic.
- Modified zero/stage3.py to ensure compatibility with Zenflow's execution flow.
- Updated zero/parameter_offload.py to support the integration of ZenFlow with ZeRO-Stage 3.

Signed-off-by: Yusen Wu <[email protected]>
- Add ZenFlowSelectiveAdamW_stage3 to support ZeRO Stage 3
- Update unit tests for ZeRO-Stage 3 with ZenFlow

Signed-off-by: Yusen Wu <[email protected]>
Signed-off-by: Yusen Wu <[email protected]>
- Add default value (`zenflow=False`) in DeepSpeedZeROOffload.__init__
- Prevents TypeError when instantiating optimizer without zenflow

Signed-off-by: Yusen Wu <[email protected]>
- Resolved merge conflicts with upstream changes
- Unified ZenFlow affinity behavior for Stage 3 with Stage 1 and Stage 2

Signed-off-by: Yusen Wu <[email protected]>
Co-authored-by: Ma, Guokai <[email protected]>
@JoshWoo2003
Copy link
Contributor Author

Thanks for the review, @delock! The formatting issues were due to my branch being behind the base. I’ve rebased onto upstream/master and the latest push should fix them. Please take another look when you have a chance—thanks! @loadams @sfc-gh-truwase @tohtana @Antlera

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants