Add DataStates-LLM: Asynchronous Checkpointing Engine Support #7166

mauryaavinash95 · 2025-03-21T19:14:22Z

We are a team at Argonne National Laboratory working on low-overhead asynchronous checkpointing approaches for LLMs and transformers. As part of these efforts, we have developed DataStates-LLM, a library that we would like to contribute to the DeepSpeed community:
https://github.com/datastates/datastates-llm

The key idea we leverage is to allow non-blocking tensor copies during the forward and backward pass from the GPU to the host. Only if these copies do not finish until the update phase, then we block. Meanwhile, from the host memory, the tensors are flushed asynchronously to durable storage (parallel file systems, local SSDs, etc).

To enable this capability, our initial implementation makes the scheduler aware of checkpointing, calling a ckpt.wait() primitive before starting the update phase. We illustrated this with the pipeline scheduler. We are also considering a scheduler-independent solution that integrates with DeepSpeed/Megatron and provides a hook for the start of the update phase, which we can leverage to run ckpt.wait().

We appreciate your feedback and look forward to a collaboration in this space.

loadams · 2025-03-24T18:01:18Z

Hi @mauryaavinash95 - could you please run the pre-commit formatter? That should fix the formatting errors at least.

mauryaavinash95 · 2025-03-24T19:50:46Z

Thanks for the feedback @loadams. I've fixed the pre-commit and DCO issues in 1c701d7.

loadams · 2025-03-24T19:54:39Z

Thanks for the feedback @loadams. I've fixed the pre-commit and DCO issues in 1c701d7.

Thanks @mauryaavinash95 - The formatting checks look good, DCO shows failing. You can rebase to fix with the command here or if that might cause issues given the complex git history here, we can manually approve the DCO check if you let us know.

mauryaavinash95 · 2025-03-24T19:59:11Z

Thanks for the feedback @loadams. I've fixed the pre-commit and DCO issues in 1c701d7.

Thanks @mauryaavinash95 - The formatting checks look good, DCO shows failing. You can rebase to fix with the command here or if that might cause issues given the complex git history here, we can manually approve the DCO check if you let us know.

I tried using the DCO instructions and this is how I see it on my git log.

commit 1c701d7c61b170eea81dcc637379500a7586b9b2 (HEAD -> dev, origin/dev)
Author: Avinash <[email protected]>
Date:   Mon Mar 24 14:45:11 2025 -0500

    Fix formatting issues for DataStates-LLM
    
    Signed-off-by: Avinash Maurya <[email protected]>

And I think it would be very helpful if you can manually approve the DCO using my email as [email protected].

mauryaavinash95 · 2025-03-24T22:58:17Z

Based on the checks now it looks like only the DCO part is pending @loadams. Please let me know if there's anything I can do to fix this quicker than the DeepSpeed team manually approving the DCO.

deepspeed/runtime/swap_tensor/pipelined_optimizer_swapper.py

tjruwase · 2025-03-25T15:27:17Z

@mauryaavinash95, thanks for this great contribution to DeepSpeed. Do you intend to add a tutorial to help users benefit from this feature?

@saforem2, FYI

deepspeed/runtime/pipe/module.py

mauryaavinash95 · 2025-03-27T04:38:36Z

@mauryaavinash95, thanks for this great contribution to DeepSpeed. Do you intend to add a tutorial to help users benefit from this feature?

@saforem2, FYI

@tjruwase @saforem2 : yes, we'd like to set up a tutorial for this. Currently, there is just a short snippet to enable it in deepspeed/runtime/checkpoint_engine/README.md. Could you please point us to a reference and repository that we can use for the tutorial?

tjruwase · 2025-03-27T15:34:37Z

@mauryaavinash95, DeepSpeed tutorials appear on the deepspeed.ai:

Listed here: https://www.deepspeed.ai/tutorials/
The docs are edited: https://github.com/deepspeedai/DeepSpeed/tree/master/docs/_tutorials

mauryaavinash95 · 2025-03-28T20:34:34Z

@tjruwase I've added the preserves_storage_sharing function for the checkpointing engine; fixed the unwanted commit in deepspeed/runtime/swap_tensor/pipelined_optimizer_swapper.py; and uploaded a tutorial for using DataStates-LLM with DeepSpeed. Commit: 09858a7. Please let me know what you think.

tjruwase · 2025-03-29T18:31:29Z

deepspeed/runtime/checkpoint_engine/checkpoint_engine.py

+        # To wait in asynchronous checkpoint engines (e.g. DataStates-LLM) for the previous snapshot to finish
+        pass
+
+    def preserves_storage_sharing(self):


Thanks for adding this API. But I think the meaning is inverted here in the sense that preserves_storage_sharing is what leads to checkpoint blot and requires cloning to fix. Please see the following torch docs. I think it also helpful to add the doc link here.
https://pytorch.org/docs/stable/notes/serialization.html#saving-and-loading-tensors-preserves-views

Further reading of the doc on my part makes me feel that preserves_tensor_views() might be a descriptive name. I am curious to know your thoughts. Thanks!

@tjruwase: Thanks for pointing this out and for the helpful doc reference!

You're right that preserve_tensor_views aligns closely with PyTorch's terminology and the default serialization behavior. That said, I was considering whether the API name should emphasize the intent to avoid storage sharing (i.e., debloating checkpoints) rather than reflect the PyTorch mechanism directly.

If the broader goal is to clearly signal "avoid capturing shared storage/views," maybe a name like shared_storage_capture or avoid_tensor_storage_bloat might better convey user intent. Alternatively, we could stick to preserve_tensor_views and clarify the expected effects in the docstring.

Curious to hear your thoughts-- especially if you foresee other use cases for this API beyond just debloating for checkpoints.

@mauryaavinash95, thanks, I see your points. Based on this, I wonder if it is better to let the individual checkpoint engines handle the decision of whether/how to debloat. Although, this will require more code changes, (i.e., moving existing clone_tensor_.... calls to the torch and nebula engines), I think it would be a win in the long run. It simplifies DS code, avoids a new API, and restricts this torch-specific semantics into the torch-compatible checkpoint engines.

What do you think?

@tjruwase That sounds like a great direction-- it would definitely make the codebase more modular and maintainable in the long run. Currently, the clone_tensors_for_torch_save also handles blocking GPU-to-CPU data movement during cloning.

So I was thinking: would it make sense to abstract this entire logic under the checkpoint_engine.save() method? That way, each engine could manage both debloating and device transfer optimizations internally, giving more control to engine-specific implementations. Thoughts?

Awesome. I think we are aligned. Do you mind updating this PR accordingly? Thanks

mauryaavinash95 · 2025-03-31T16:16:28Z

@tjruwase: I have one more question about the way latest checkpoint version is tracked in DeepSpeed engine.py and Megatron-DeepSpeed engine.

Currently, both assume that checkpoints are synchronously flushed to stable storage by the time the function returns, and they immediately update the tracking files for the latest version. However, this assumption doesn't hold for asynchronous checkpointing, where flushes to slower tiers may still be in progress after the function exits.

Do you have thoughts on how best to handle this? One idea could be to move this responsibility into the checkpointing engine itself, allowing it to manage the timing and semantics of when the latest marker is updated.

tjruwase · 2025-04-01T21:04:03Z

Do you have thoughts on how best to handle this? One idea could be to move this responsibility into the checkpointing engine itself, allowing it to manage the timing and semantics of when the latest marker is updated.

@mauryaavinash95, good question. We handle this in our upcoming code release of FastPersist. The idea is to add a bool decoupled() API to checkpoint engine, where Decoupled is the same as Asynchronous. For Decoupled engines, the logic to commit checkpoints including writing latest is called in the engine.step() before optimizer step is called. Coupled engines utilize the existing logic. If you are not blocked on this, then we can revisit sometime next week when our PR is available to align the APIs.

deepspeed/datastates/__init__.py

deepspeed/datastates/config.py

deepspeed/runtime/checkpoint_engine/datastates_checkpoint_engine.py

docs/_tutorials/datastates-async-checkpointing.md

mauryaavinash95 · 2025-04-02T19:03:37Z

@tjruwase Thanks for the feedback. I've updated the PR as per our discussion and moved the logic to debloat the tensors inside checkpointing engines.
We can revisit the bool decoupled() API next week once the FastPersist engine PR is in place.

tjruwase · 2025-04-09T17:43:31Z

@mauryaavinash95, can you please look into the CI failures?

Also, it seems we are unable to update the branch.

mauryaavinash95 · 2025-04-10T17:33:14Z

@mauryaavinash95, can you please look into the CI failures?

Also, it seems we are unable to update the branch.

@tjruwase Thanks for letting me know.
I'll resync with the latest master branch and update the PR within a week. Hopefully, the CI failures should be resolved with it. We don't yet have any Datastates-LLM-specific unit tests, so the checkpointing engine should not fail any other tests, right?

Signed-off-by: amaurya <[email protected]>

loadams · 2025-04-18T15:40:10Z

@mauryaavinash95 - is this ready to be merged?

mauryaavinash95 · 2025-04-18T17:00:30Z

@mauryaavinash95 - is this ready to be merged?

@loadams: I think it is ready to be merged. The one pending thing we have is bool decoupled() API for asynchronous commit, which @tjruwase said we can discuss when the FastPersist engine PR is in place.

sfc-gh-truwase · 2025-08-20T13:29:30Z

@mauryaavinash95 apologies for the delay on this. Since the FastPersist PR has been merged, do you want to resume this integration? Thanks!

mauryaavinash95 requested review from tjruwase, tohtana, jomayeri, loadams, GuanhuaWang and hwchen2017 as code owners March 21, 2025 19:14

mauryaavinash95 changed the title ~~Add DataStates-LLM: Asynchronous Checkpointing Engine Support #5763~~ Add DataStates-LLM: Asynchronous Checkpointing Engine Support Mar 21, 2025

tjruwase reviewed Mar 25, 2025

View reviewed changes

deepspeed/runtime/swap_tensor/pipelined_optimizer_swapper.py Outdated Show resolved Hide resolved

tjruwase reviewed Mar 25, 2025

View reviewed changes

deepspeed/runtime/pipe/module.py Outdated Show resolved Hide resolved

mauryaavinash95 force-pushed the dev branch from 968f6ca to 1c701d7 Compare March 28, 2025 20:22

mauryaavinash95 requested a review from tjruwase March 28, 2025 20:38

tjruwase reviewed Mar 29, 2025

View reviewed changes

mauryaavinash95 requested a review from tjruwase April 1, 2025 23:13

tjruwase reviewed Apr 2, 2025

View reviewed changes

deepspeed/datastates/__init__.py Outdated Show resolved Hide resolved

tjruwase reviewed Apr 2, 2025

View reviewed changes

deepspeed/datastates/config.py Outdated Show resolved Hide resolved

tjruwase reviewed Apr 2, 2025

View reviewed changes

deepspeed/runtime/checkpoint_engine/datastates_checkpoint_engine.py Outdated Show resolved Hide resolved

tjruwase reviewed Apr 2, 2025

View reviewed changes

docs/_tutorials/datastates-async-checkpointing.md Show resolved Hide resolved

tjruwase approved these changes Apr 2, 2025

View reviewed changes

mauryaavinash95 force-pushed the dev branch 2 times, most recently from 3a82071 to 84f067b Compare April 15, 2025 22:12

amaurya added 2 commits April 15, 2025 22:18

Add datastates-llm to runtime/checkpoint_engine/readme

27de542

Signed-off-by: amaurya <[email protected]>

Fix JSON format in readme for datastates-llm

d9df580

Signed-off-by: amaurya <[email protected]>

mauryaavinash95 force-pushed the dev branch from 84f067b to 6160140 Compare April 15, 2025 22:21

amaurya added 4 commits April 15, 2025 22:24

Fix formatting issues for DataStates-LLM

12e65a6

Signed-off-by: amaurya <[email protected]>

Add preserves_storage_sharing for checkpoint engines

59788f8

Signed-off-by: amaurya <[email protected]>

Update to Apache-2.0 License, move debloating to checkpointing engine

b1312d1

Signed-off-by: amaurya <[email protected]>

Fix whitespaces

4651ec2

Signed-off-by: amaurya <[email protected]>

mauryaavinash95 force-pushed the dev branch from 6160140 to 4651ec2 Compare April 15, 2025 22:25

mauryaavinash95 requested a review from tjruwase April 15, 2025 22:26

Add DataStates-LLM: Asynchronous Checkpointing Engine Support #7166

Are you sure you want to change the base?

Add DataStates-LLM: Asynchronous Checkpointing Engine Support #7166

Conversation

mauryaavinash95 commented Mar 21, 2025

Uh oh!

loadams commented Mar 24, 2025

Uh oh!

mauryaavinash95 commented Mar 24, 2025

Uh oh!

loadams commented Mar 24, 2025

Uh oh!

mauryaavinash95 commented Mar 24, 2025

Uh oh!

mauryaavinash95 commented Mar 24, 2025

Uh oh!

Uh oh!

tjruwase commented Mar 25, 2025

Uh oh!

Uh oh!

mauryaavinash95 commented Mar 27, 2025

Uh oh!

tjruwase commented Mar 27, 2025

Uh oh!

mauryaavinash95 commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjruwase Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

mauryaavinash95 Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

tjruwase Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mauryaavinash95 Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

tjruwase Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

mauryaavinash95 commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjruwase commented Apr 1, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mauryaavinash95 commented Apr 2, 2025

Uh oh!

tjruwase commented Apr 9, 2025

Uh oh!

mauryaavinash95 commented Apr 10, 2025

Uh oh!

loadams commented Apr 18, 2025

Uh oh!

mauryaavinash95 commented Apr 18, 2025

Uh oh!

sfc-gh-truwase commented Aug 20, 2025

Uh oh!

Uh oh!

mauryaavinash95 commented Mar 28, 2025 •

edited

Loading

tjruwase Mar 31, 2025 •

edited

Loading

mauryaavinash95 commented Mar 31, 2025 •

edited

Loading