Skip to content

Conversation

zhengchenyu
Copy link
Contributor

When the world size expands from 2 to 4, then convert to universal checkpoint, and load from universal checkpoint.
The new rank, for example, rank3 will load model file zero_pp_rank_3_mp_rank_00_model_states.pt. But this file was not produced during the last execution.
For stage3, just load the first file, that is zero_pp_rank_0_mp_rank_00_model_states.
The existing unit test TestZeROUniversalCheckpointDP::test_dp_world_size_2to4 can verify this problem.

@zhengchenyu zhengchenyu marked this pull request as draft September 28, 2025 06:49
@zhengchenyu zhengchenyu marked this pull request as ready for review September 28, 2025 12:16
@zhengchenyu zhengchenyu marked this pull request as draft September 29, 2025 02:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant