Skip to content

Conversation

Schwidola0607
Copy link

@Schwidola0607 Schwidola0607 commented Apr 10, 2025

PR for HF2UCP feature

Converting a pytorch_model.bin or .safetensors checkpoint to UCP will

  • zero initialize optimizer states (exp_avg_sq.pt and exp_avg.pt)
  • skip over copying _model_states.pt and optimizer_state.pt files as those are not available to a HF checkpoint

Schwidola0607 and others added 5 commits April 10, 2025 05:08
Signed-off-by: Schwidola0607 <[email protected]>
Signed-off-by: Schwidola0607 <[email protected]>
Signed-off-by: Schwidola0607 <[email protected]>
Signed-off-by: Schwidola0607 <[email protected]>
Signed-off-by: Schwidola0607 <[email protected]>
@Schwidola0607 Schwidola0607 marked this pull request as ready for review April 13, 2025 08:34
Schwidola0607 and others added 4 commits April 15, 2025 03:34
Signed-off-by: Schwidola0607 <[email protected]>
Signed-off-by: Schwidola0607 <[email protected]>
Signed-off-by: Schwidola0607 <[email protected]>
Signed-off-by: Schwidola0607 <[email protected]>
@Schwidola0607
Copy link
Author

deepspeedai/Megatron-DeepSpeed#477
@xylian86 Here is the PR for the document

@Schwidola0607 Schwidola0607 requested a review from xylian86 June 27, 2025 06:49
@xylian86
Copy link
Contributor

@sfc-gh-truwase LGTM

@tjruwase
Copy link
Contributor

@sfc-gh-truwase LGTM

@xylian86 thanks for the review.

@Schwidola0607. can you please add some unit tests? You can use the following for inspiration:
https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/checkpoint/test_universal_checkpoint.py

@tjruwase
Copy link
Contributor

@Schwidola0607, please let me know if I can help with the UTs. Thanks

@Schwidola0607
Copy link
Author

@Schwidola0607, please let me know if I can help with the UTs. Thanks

@tjruwase sorry I was busy with other things. I will be working on the UT this week and will let you know if I need any help. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants