-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Add DataStates-LLM: Asynchronous Checkpointing Engine Support #7166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mauryaavinash95
wants to merge
6
commits into
deepspeedai:master
Choose a base branch
from
DataStates:dev
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
27de542
Add datastates-llm to runtime/checkpoint_engine/readme
d9df580
Fix JSON format in readme for datastates-llm
12e65a6
Fix formatting issues for DataStates-LLM
59788f8
Add preserves_storage_sharing for checkpoint engines
b1312d1
Update to Apache-2.0 License, move debloating to checkpointing engine
4651ec2
Fix whitespaces
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# DataStates-LLM checkpointing engine. | ||
|
||
This feature is not enabled by default. To enable, set the following options in ds_config.json and download [DataStates-LLM checkpointing library](https://github.com/DataStates/datastates-llm/). A detailed tutorial is available [here](../../docs/_tutorials/datastates-async-checkpointing.md). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# Copyright (c) Microsoft Corporation. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
# Apache-2.0 License Copyright (c) UChicago Argonne LLC, operator of Argonne National Laboratory. | ||
|
||
# DeepSpeed Team |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# Copyright (c) Microsoft Corporation. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
# Apache-2.0 License Copyright (c) UChicago Argonne LLC, operator of Argonne National Laboratory. | ||
|
||
# DeepSpeed Team | ||
|
||
from deepspeed.runtime.config_utils import DeepSpeedConfigObject | ||
|
||
|
||
class DeepSpeedDataStatesConfig(DeepSpeedConfigObject): | ||
|
||
def __init__(self, param_dict): | ||
super(DeepSpeedDataStatesConfig, self).__init__() | ||
|
||
self.enabled = None | ||
self.config = {} | ||
|
||
if "datastates_ckpt" in param_dict.keys(): | ||
self.enabled = True | ||
self.config = param_dict["datastates_ckpt"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
34 changes: 34 additions & 0 deletions
34
deepspeed/runtime/checkpoint_engine/datastates_checkpoint_engine.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Copyright (c) Microsoft Corporation. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
# Apache-2.0 License Copyright (c) UChicago Argonne LLC, operator of Argonne National Laboratory. | ||
|
||
# DeepSpeed Team | ||
|
||
from deepspeed.utils import log_dist | ||
from deepspeed.runtime.checkpoint_engine.checkpoint_engine import \ | ||
CheckpointEngine | ||
from datastates.llm import Checkpointing | ||
|
||
|
||
class DataStatesCheckpointEngine(CheckpointEngine): | ||
|
||
def __init__(self, deepspeed_config, rank): | ||
super().__init__(deepspeed_config) | ||
self.ckpt_engine = Checkpointing(deepspeed_config, rank) | ||
|
||
def create(self, tag): | ||
log_dist(f"[DataStates] Checkpoint {tag} is about to be saved!", ranks=[0]) | ||
return None | ||
|
||
def save(self, state_dict, path: str): | ||
return self.ckpt_engine.save(state_dict, path) | ||
|
||
def load(self, path: str, map_location=None): | ||
return self.ckpt_engine.load(path, map_location) | ||
|
||
def commit(self, tag): | ||
return self.ckpt_engine.commit(tag) | ||
|
||
def wait(self): | ||
return self.ckpt_engine.wait() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
--- | ||
title: "DataStates-LLM Checkpointing Engine" | ||
tags: asynchronous checkpointing for minimizing I/O overheads. | ||
--- | ||
This tutorial will show how to use [DataStates-LLM](https://github.com/DataStates/datastates-llm) for asynchronous checkpointing. DataStates-LLM introduces a lazy asynchronous checkpointing mechanism tailored for LLMs, aiming to minimize I/O overhead and enhance training efficiency. This tutorial provides a guide on integrating DataStates-LLM with the DeepSpeed framework. | ||
|
||
## Overview of DataStates-LLM | ||
|
||
DataStates-LLM is designed to address the challenges of frequent checkpointing in LLM training by introducing a lazy asynchronous multi-level approach. It leverages the immutability of model parameters and optimizer states during forward and backward passes to perform non-blocking data transfers, thereby reducing interference with the training process. This method has demonstrated up to 48x faster checkpointing and 2.2x faster end-to-end training times compared to traditional approaches as outlined in [DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models](https://arxiv.org/abs/2406.10707). | ||
|
||
## Prerequisites | ||
|
||
Before integrating DataStates-LLM with DeepSpeed, ensure the following: | ||
|
||
- **DeepSpeed Installation**: DeepSpeed should be installed in your environment. If not, refer to the [DeepSpeed Getting Started Guide](https://github.com/microsoft/DeepSpeed/blob/master/docs/_tutorials/getting-started.md) for installation instructions. | ||
|
||
- **DataStates-LLM Repository**: Access the DataStates-LLM source code from its [GitHub repository](https://github.com/DataStates/datastates-llm) and follow the installation instructions provided therein. | ||
|
||
## Configuring DeepSpeed for DataStates-LLM | ||
|
||
To enable DataStates-LLM's asynchronous checkpointing within DeepSpeed, please modify the `deepspeed_config.json` file to include specific configurations under the `datastates_ckpt` section. Below is an example configuration: | ||
|
||
```json | ||
{ | ||
// ... other DeepSpeed configuration options | ||
"datastates_ckpt": { | ||
"host_cache_size": 16, | ||
"parser_threads": 8 | ||
} | ||
} | ||
``` | ||
|
||
### Configuration Parameters | ||
|
||
- **`host_cache_size`**: Specifies the amount of pinned host memory (in gigabytes) reserved for asynchronous checkpoint flushing. Adjust this value based on your system's memory capacity and the size of your model checkpoints. | ||
|
||
- **`parser_threads`**: Determines the number of threads dedicated to parsing checkpoint file requests in parallel. Increasing this value can enhance parsing throughput but may also increase CPU utilization. | ||
|
||
## Implementing DataStates-LLM in Your Training Script | ||
|
||
After enabling datastates checkpointing the `deepspeed_config.json`, the frequency of checkpointing can be configured by specifying the number of iterations after which the checkpoints should be captured using command-line parameter ` --save-interval`. | ||
|
||
## Performance Results | ||
|
||
The checkpoint acceleration achieved by DataStates-LLM for various models are shown in | ||
|
||
{: .align-center} | ||
|
||
{: .align-center} | ||
|
||
|
||
## Limitations and Ongoing Work | ||
|
||
1. DataStates-LLM currently only supports the CUDA runtime on Nvidia-based GPUs. | ||
|
||
|
||
2. DataStates-LLM has only been tested with ZeRO stage-1 without offloading to any other tiers. | ||
|
||
|
||
3. While the checkpoint layout of datastates matches Huggingface's [safetensor](https://huggingface.co/docs/safetensors/) format, due to pickled objects required by DeepSpeed during restart, it is not fully compatible with safetensor library yet. | ||
|
||
4. DataStates-LLM does not yet support universal or elastic checkpointing. | ||
|
||
|
||
## Questions and Support | ||
|
||
Please use the [DataStates-LLM Github repository](https://github.com/DataStates/datastates-llm) for any questions, issues, or feature requests. |
Binary file added
BIN
+309 KB
docs/assets/images/datastates-async-checkpointing/diff-models-ckpt-throughput.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+168 KB
docs/assets/images/datastates-async-checkpointing/diff-models-iter-times.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.