Skip to content

Conversation

mauryaavinash95
Copy link
Contributor

We are a team at Argonne National Laboratory working on low-overhead asynchronous checkpointing approaches for LLMs and transformers. As part of these efforts, we have developed DataStates-LLM, a library that we would like to contribute to the DeepSpeed community:
https://github.com/datastates/datastates-llm

The key idea we leverage is to allow non-blocking tensor copies during the forward and backward pass from the GPU to the host. Only if these copies do not finish until the update phase, then we block. Meanwhile, from the host memory, the tensors are flushed asynchronously to durable storage (parallel file systems, local SSDs, etc).

To enable this capability, our initial implementation makes the scheduler aware of checkpointing, calling a ckpt.wait() primitive before starting the update phase. We illustrated this with the pipeline scheduler. We are also considering a scheduler-independent solution that integrates with DeepSpeed/Megatron and provides a hook for the start of the update phase, which we can leverage to run ckpt.wait().

We appreciate your feedback and look forward to a collaboration in this space.

@mauryaavinash95 mauryaavinash95 changed the title Add DataStates-LLM: Asynchronous Checkpointing Engine Support #5763 Add DataStates-LLM: Asynchronous Checkpointing Engine Support Mar 21, 2025
@loadams
Copy link
Collaborator

loadams commented Mar 24, 2025

Hi @mauryaavinash95 - could you please run the pre-commit formatter? That should fix the formatting errors at least.

@mauryaavinash95
Copy link
Contributor Author

Thanks for the feedback @loadams. I've fixed the pre-commit and DCO issues in 1c701d7.

@loadams
Copy link
Collaborator

loadams commented Mar 24, 2025

Thanks for the feedback @loadams. I've fixed the pre-commit and DCO issues in 1c701d7.

Thanks @mauryaavinash95 - The formatting checks look good, DCO shows failing. You can rebase to fix with the command here or if that might cause issues given the complex git history here, we can manually approve the DCO check if you let us know.

@mauryaavinash95
Copy link
Contributor Author

Thanks for the feedback @loadams. I've fixed the pre-commit and DCO issues in 1c701d7.

Thanks @mauryaavinash95 - The formatting checks look good, DCO shows failing. You can rebase to fix with the command here or if that might cause issues given the complex git history here, we can manually approve the DCO check if you let us know.

I tried using the DCO instructions and this is how I see it on my git log.

commit 1c701d7c61b170eea81dcc637379500a7586b9b2 (HEAD -> dev, origin/dev)
Author: Avinash <[email protected]>
Date:   Mon Mar 24 14:45:11 2025 -0500

    Fix formatting issues for DataStates-LLM
    
    Signed-off-by: Avinash Maurya <[email protected]>

And I think it would be very helpful if you can manually approve the DCO using my email as [email protected].

@mauryaavinash95
Copy link
Contributor Author

Based on the checks now it looks like only the DCO part is pending @loadams. Please let me know if there's anything I can do to fix this quicker than the DeepSpeed team manually approving the DCO.

@tjruwase
Copy link
Contributor

@mauryaavinash95, thanks for this great contribution to DeepSpeed. Do you intend to add a tutorial to help users benefit from this feature?

@saforem2, FYI

@mauryaavinash95
Copy link
Contributor Author

@mauryaavinash95, thanks for this great contribution to DeepSpeed. Do you intend to add a tutorial to help users benefit from this feature?

@saforem2, FYI

@tjruwase @saforem2 : yes, we'd like to set up a tutorial for this. Currently, there is just a short snippet to enable it in deepspeed/runtime/checkpoint_engine/README.md. Could you please point us to a reference and repository that we can use for the tutorial?

@tjruwase
Copy link
Contributor

@mauryaavinash95, DeepSpeed tutorials appear on the deepspeed.ai:

@mauryaavinash95
Copy link
Contributor Author

mauryaavinash95 commented Mar 28, 2025

@tjruwase I've added the preserves_storage_sharing function for the checkpointing engine; fixed the unwanted commit in deepspeed/runtime/swap_tensor/pipelined_optimizer_swapper.py; and uploaded a tutorial for using DataStates-LLM with DeepSpeed. Commit: 09858a7. Please let me know what you think.

# To wait in asynchronous checkpoint engines (e.g. DataStates-LLM) for the previous snapshot to finish
pass

def preserves_storage_sharing(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this API. But I think the meaning is inverted here in the sense that preserves_storage_sharing is what leads to checkpoint blot and requires cloning to fix. Please see the following torch docs. I think it also helpful to add the doc link here.
https://pytorch.org/docs/stable/notes/serialization.html#saving-and-loading-tensors-preserves-views

Further reading of the doc on my part makes me feel that preserves_tensor_views() might be a descriptive name. I am curious to know your thoughts. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjruwase: Thanks for pointing this out and for the helpful doc reference!

You're right that preserve_tensor_views aligns closely with PyTorch's terminology and the default serialization behavior. That said, I was considering whether the API name should emphasize the intent to avoid storage sharing (i.e., debloating checkpoints) rather than reflect the PyTorch mechanism directly.

If the broader goal is to clearly signal "avoid capturing shared storage/views," maybe a name like shared_storage_capture or avoid_tensor_storage_bloat might better convey user intent. Alternatively, we could stick to preserve_tensor_views and clarify the expected effects in the docstring.

Curious to hear your thoughts-- especially if you foresee other use cases for this API beyond just debloating for checkpoints.

Copy link
Contributor

@tjruwase tjruwase Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mauryaavinash95, thanks, I see your points. Based on this, I wonder if it is better to let the individual checkpoint engines handle the decision of whether/how to debloat. Although, this will require more code changes, (i.e., moving existing clone_tensor_.... calls to the torch and nebula engines), I think it would be a win in the long run. It simplifies DS code, avoids a new API, and restricts this torch-specific semantics into the torch-compatible checkpoint engines.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjruwase That sounds like a great direction-- it would definitely make the codebase more modular and maintainable in the long run. Currently, the clone_tensors_for_torch_save also handles blocking GPU-to-CPU data movement during cloning.

So I was thinking: would it make sense to abstract this entire logic under the checkpoint_engine.save() method? That way, each engine could manage both debloating and device transfer optimizations internally, giving more control to engine-specific implementations. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. I think we are aligned. Do you mind updating this PR accordingly? Thanks

@mauryaavinash95
Copy link
Contributor Author

mauryaavinash95 commented Mar 31, 2025

@tjruwase: I have one more question about the way latest checkpoint version is tracked in DeepSpeed engine.py and Megatron-DeepSpeed engine.

Currently, both assume that checkpoints are synchronously flushed to stable storage by the time the function returns, and they immediately update the tracking files for the latest version. However, this assumption doesn't hold for asynchronous checkpointing, where flushes to slower tiers may still be in progress after the function exits.

Do you have thoughts on how best to handle this? One idea could be to move this responsibility into the checkpointing engine itself, allowing it to manage the timing and semantics of when the latest marker is updated.

@tjruwase
Copy link
Contributor

tjruwase commented Apr 1, 2025

Do you have thoughts on how best to handle this? One idea could be to move this responsibility into the checkpointing engine itself, allowing it to manage the timing and semantics of when the latest marker is updated.

@mauryaavinash95, good question. We handle this in our upcoming code release of FastPersist. The idea is to add a bool decoupled() API to checkpoint engine, where Decoupled is the same as Asynchronous. For Decoupled engines, the logic to commit checkpoints including writing latest is called in the engine.step() before optimizer step is called. Coupled engines utilize the existing logic. If you are not blocked on this, then we can revisit sometime next week when our PR is available to align the APIs.

@mauryaavinash95 mauryaavinash95 requested a review from tjruwase April 1, 2025 23:13
@mauryaavinash95
Copy link
Contributor Author

@tjruwase Thanks for the feedback. I've updated the PR as per our discussion and moved the logic to debloat the tensors inside checkpointing engines.
We can revisit the bool decoupled() API next week once the FastPersist engine PR is in place.

@tjruwase
Copy link
Contributor

tjruwase commented Apr 9, 2025

@mauryaavinash95, can you please look into the CI failures?

Also, it seems we are unable to update the branch.

@mauryaavinash95
Copy link
Contributor Author

@mauryaavinash95, can you please look into the CI failures?

Also, it seems we are unable to update the branch.

@tjruwase Thanks for letting me know.
I'll resync with the latest master branch and update the PR within a week. Hopefully, the CI failures should be resolved with it. We don't yet have any Datastates-LLM-specific unit tests, so the checkpointing engine should not fail any other tests, right?

@mauryaavinash95 mauryaavinash95 force-pushed the dev branch 2 times, most recently from 3a82071 to 84f067b Compare April 15, 2025 22:12
@loadams
Copy link
Collaborator

loadams commented Apr 18, 2025

@mauryaavinash95 - is this ready to be merged?

@mauryaavinash95
Copy link
Contributor Author

@mauryaavinash95 - is this ready to be merged?

@loadams: I think it is ready to be merged. The one pending thing we have is bool decoupled() API for asynchronous commit, which @tjruwase said we can discuss when the FastPersist engine PR is in place.

@sfc-gh-truwase
Copy link
Collaborator

@mauryaavinash95 apologies for the delay on this. Since the FastPersist PR has been merged, do you want to resume this integration? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants