Skip to content

Commit 847b0df

Browse files
authored
Merge pull request #8 from ProgrammerNeoo/feature/zhdoc
added Chinese documents and fixed some typos and inconsistencies in the English documents
2 parents ccbc918 + 4a37672 commit 847b0df

19 files changed

+1104
-122
lines changed

docs/add_your_parallel.md

Lines changed: 11 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,26 @@
1-
# Add Your Own Parallelism
1+
# Add your own parallelism
22

33
## Overview
44

55
To enable researchers and engineers to extend our framework to other novel large-scale distributed training algorithm
6-
with less effort, we have decoupled the various components in the training lifecycle. You can implement your own
6+
with less effort, we have decoupled various components in the training lifecycle. You can implement your own
77
parallelism by simply inheriting from the base class.
88

9-
The main components are
9+
The main components are:
1010

1111
1. `ProcessGroupInitializer`
1212
2. `GradientHandler`
1313
3. `Schedule`
1414

1515
## Process Group Initializer
1616

17-
Parallelism is often managed by process groups where processes involved in parallel computing are placed in the same
17+
Parallelism is often managed by process groups where processes involved in the same parallel algorithm are placed in the same
1818
process group. For different parallel algorithms, different process groups need to be created. ColossalAI provides a
19-
global context for the user to easily manage their process groups. If you wish to add new process group, you can easily
19+
global context for users to easily manage their process groups. If you wish to add new process group, you can easily
2020
define a new class and set it in your configuration file. To define your own way of creating process groups, you can
21-
follow the steps below to create new distributed initialization.
22-
23-
1. Add your parallel mode in `colossalai.context.parallel_mode.ParallelMode`
21+
follow the steps below to create a new distributed initialization.
2422

23+
1. Add your parallel mode in `colossalai.context.parallel_mode.ParallelMode`.
2524
```python
2625
class ParallelMode(Enum):
2726
GLOBAL = 'global'
@@ -34,11 +33,10 @@ follow the steps below to create new distributed initialization.
3433
NEW_MODE = 'new_mode' # define your mode here
3534
```
3635

37-
2. Create a `ProcessGroupInitializer`. You can refer to examples given in `colossal.context.dist_group_initializer`. The
36+
2. Create a `ProcessGroupInitializer`. You can refer to examples given in `colossalai.context.dist_group_initializer`. The
3837
first six arguments are fixed. `ParallelContext` will pass in these arguments for you. If you need to set other
3938
arguments, you can add it behind like the `arg1, arg2` in the example below. Lastly, register your initializer to the
4039
registry by adding the decorator `@DIST_GROUP_INITIALIZER.register_module`.
41-
4240
```python
4341
# sample initializer class
4442
@DIST_GROUP_INITIALIZER.register_module
@@ -84,18 +82,16 @@ follow the steps below to create new distributed initialization.
8482
## Gradient Handler
8583

8684
Gradient handlers are objects which execute the all-reduce operations on parameters' gradients. As different all-reduce
87-
strategies may be executed for different kinds of parallelism, the user can
88-
inherit `colossal.engine.gradient_handler.BaseGradientHandler` to implement their strategies. Currently, the library
85+
strategies may be executed for different kinds of parallelism, users can
86+
inherit `colossalai.engine.gradient_handler.BaseGradientHandler` to implement their strategies. Currently, the library
8987
uses the normal data parallel gradient handler which all-reduces the gradients across data parallel ranks. The data
9088
parallel gradient handler is added to the engine automatically if data parallel is detected. You can add your own
9189
gradient handler like below:
9290

9391
```python
94-
9592
from colossalai.registry import GRADIENT_HANDLER
9693
from colossalai.engine import BaseGradientHandler
9794

98-
9995
@GRADIENT_HANDLER.register_module
10096
class YourGradientHandler(BaseGradientHandler):
10197

@@ -116,5 +112,5 @@ dist_initializer = [
116112

117113
Schedule entails how to execute a forward and backward pass. Currently, ColossalAI provides pipeline and non-pipeline
118114
schedules. If you want to modify how the forward and backward passes are executed, you can
119-
inherit `colossalai.engine.BaseSchedule` and implement your idea. You can add your schedule to the engine before
115+
inherit `colossalai.engine.BaseSchedule` and implement your idea. You can also add your schedule to the engine before
120116
training.

docs/add_your_parallel_zh.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# 添加新的并行技术
2+
3+
为了方便科研人员和工程师们更方便地拓展我们的框架来兼容一些新的大规模分布式训练算法,我们对训练过程中的几个组件进行了解耦,您可以通过继承基类的方式
4+
来实现新的并行技术。
5+
6+
主要的组件如下所示:
7+
8+
1. `ProcessGroupInitializer`
9+
2. `GradientHandler`
10+
3. `Schedule`
11+
12+
## 进程组初始化器
13+
14+
并行化一般是通过进程组来进行管理的,同属于一个并行化算法的进程将被分到一个进程组中,如果系统中存在多种不同的并行化技术,那么需要创建多个不同的进程组。
15+
ColossalAI为用户提供了一个全局上下文变量来便捷地管理他们的进程组。如果您希望增加新的进程组,您可以定义一个新的类并且在您的配置文件中进行设置。下方的
16+
代码块中介绍了如果在系统中加入您的新并行技术以及如何进行初始化。
17+
18+
1.`colossalai.context.parallel_mode.ParallelMode`中添加新的并行模式。
19+
```python
20+
class ParallelMode(Enum):
21+
GLOBAL = 'global'
22+
DATA = 'data'
23+
PIPELINE = 'pipe'
24+
PIPELINE_PREV = 'pipe_prev'
25+
PIPELINE_NEXT = 'pipe_next'
26+
...
27+
28+
NEW_MODE = 'new_mode' # define your mode here
29+
```
30+
31+
2. 创建一个`ProcessGroupInitializer`的子类,您可以参考`colossalai.context.dist_group_initializer`中给出的例子。前六个参数将由`ParallelContext`
32+
决定。如果您需要设置新的参数,您可以用新的参数替换下面例子中的`arg1``arg2`。最后,您需要使用`@DIST_GROUP_INITIALIZER.register_module`装饰器
33+
在我们的注册表注册您的初始化器。
34+
```python
35+
# sample initializer class
36+
@DIST_GROUP_INITIALIZER.register_module
37+
class MyParallelInitializer(ProcessGroupInitializer):
38+
39+
def __init__(self,
40+
rank: int,
41+
world_size: int,
42+
config: Config,
43+
data_parallel_size: int,
44+
pipeline_parlalel_size: int,
45+
tensor_parallel_size: int,
46+
arg1,
47+
arg2):
48+
super().__init__(rank, world_size, config)
49+
self.arg1 = arg1
50+
self.arg2 = arg2
51+
# ... your variable init
52+
53+
def init_parallel_groups(self):
54+
# initialize your process groups
55+
pass
56+
```
57+
58+
在此之后,您可以将您的初始化器插入到当前的mode-to-initialize映射`colossalai.constants.INITIALIZER_MAPPING`中,您也可以通过更改该文件来动态变更名称与
59+
并行模式的映射。
60+
61+
```python
62+
colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
63+
```
64+
65+
3. 在配置文件中设置您的初始化器,如果您的初始化器需要参数,您可以自行传入,下面的代码可以让`ParallelContext`来创建您的初始化器并初始化您需要的进程组。
66+
67+
```python
68+
parallel = dict(
69+
pipeline=dict(size=1),
70+
tensor=dict(size=x, mode='new_mode') # this is where you enable your new parallel mode
71+
)
72+
```
73+
74+
## 梯度处理器
75+
76+
梯度处理器的功能是对模型参数的梯度进行all-reduce操作。由于不同的并行技术可能需要不同的all-reduce操作,用户们可以通过继承
77+
`colossalai.engine.gradient_handler.BaseGradientHandler`来执行其个性化操作。目前,ColossalAI使用普通的数据并行梯度处理器,该处理器在所有的数据
78+
并行rank上执行all-reduce操作,且当ColossalAI监测到当前系统使用了数据并行时,该处理器会被自动创建。您可以使用下方代码块中的代码添加您自定义的梯度处理器:
79+
80+
```python
81+
from colossalai.registry import GRADIENT_HANDLER
82+
from colossalai.engine import BaseGradientHandler
83+
84+
@GRADIENT_HANDLER.register_module
85+
class YourGradientHandler(BaseGradientHandler):
86+
87+
def handle_gradient(self):
88+
do_something()
89+
90+
```
91+
92+
在此之后,您可以在配置文件中指定您想要使用的梯度处理器。
93+
94+
```python
95+
dist_initializer = [
96+
dict(type='YourGradientHandler'),
97+
]
98+
```
99+
100+
## 调度器
101+
102+
调度器中指定了在前向传播和后向传播时需要执行哪些操作,ColossalAI提供了支持流水线和不支持流水线的调度器。如果您想要修改前向传播和后向传播的执行方式,您可以
103+
继承`colossalai.engine.BaseSchedule`并实现您想要的操作。您也可以在训练模型之前将您的调度器添加到我们的引擎中来。

docs/amp.md

Lines changed: 19 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,24 @@
1-
# Mixed Precision Training
1+
# Mixed precision training
22

3-
In Colossal-AI, we have integrated different implementations of mixed precision training:
3+
In ColossalAI, we have incorporated different implementations of mixed precision training:
44
1. torch.cuda.amp
55
2. apex.amp
66
3. tensor-parallel amp
77

88
The first two rely on the original implementation of [PyTorch](https://pytorch.org/docs/stable/amp.html)
99
(version 1.6 and above) and [Nvidia Apex](https://github.com/NVIDIA/apex). However, these two methods are not compatible
10-
with tensor parallelism. This is because that tensors are split across devices in tensor parallelism, thus, it is needed
11-
to communicate among different processes to check if inf or nan occurs throughout the whole model weights. For the mixed
12-
precision training with tensor parallel, we adapted this feature from [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).
10+
with tensor parallelism. This is because that tensors are split across devices in tensor parallelism, thus, it is required
11+
to communicate among different processes to check if `inf` or `nan` occurs in the whole model weights. For the mixed
12+
precision training with tensor parallelism, we adapted this feature from [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).
1313

14-
To use mixed precision training, you can easily specify the `fp16` field in the configuration file. Currently, torch and
15-
apex amp cannot be guaranteed to work with tensor and pipeline parallelism, thus, only the last one is recommended if you
14+
To use mixed precision training, you can easily specify the `fp16` field in the config file to be True. Currently, PyTorch and
15+
Apex amp cannot be guaranteed to work with tensor and pipeline parallelism, thus, only the last one is recommended if you
1616
are using hybrid parallelism.
1717

18-
## Torch AMP
18+
## PyTorch AMP
1919

20-
PyTorch provides mixed precision training in version 1.6 and above. It provides an easy way to cast data to fp16 format
21-
while keeping some operations such as reductions in fp32. You can configure the gradient scaler in the configuration.
20+
PyTorch provides mixed precision training in version 1.6 and above. It provides an easy way to cast data to `fp16` format
21+
while keeping some operations such as reductions in `fp32`. You can configure the gradient scaler in the config file.
2222

2323
```python
2424
from colossalai.engine import AMP_TYPE
@@ -34,13 +34,14 @@ fp16=dict(
3434
)
3535
```
3636

37-
3837
## Apex AMP
3938

40-
For this mode, we rely on the [Apex](https://nvidia.github.io/apex/) implementation for mixed precision training. We supported this plugin because it allows
41-
for finer control on the granularity of mixed precision. For example, `O2` level (optimization level 2) will keep batch normalization in fp32.
39+
For this mode, we rely on the [Apex](https://nvidia.github.io/apex/) implementation for mixed precision training. We support
40+
this plugin because it allows for finer control on the granularity of mixed precision. For example, `O2` level (optimization level 2)
41+
will keep batch normalization in `fp32`.
42+
43+
The following code block shows a config file for Apex AMP.
4244

43-
The configuration is like below.
4445
```python
4546
from colossalai.engine import AMP_TYPE
4647

@@ -64,8 +65,10 @@ fp16 = dict(
6465

6566
## Tensor Parallel AMP
6667

67-
We leveraged the Megatron-LM implementation to achieve mixed precision training while maintaining compatibility with
68-
complex tensor and pipeline parallel.
68+
We leveraged the Megatron-LM implementation to achieve mixed precision training while maintaining compatibility with complex tensor
69+
and pipeline parallelism.
70+
71+
The following conde block show a config file for this mode.
6972

7073
```python
7174
from colossalai.engine import AMP_TYPE

docs/amp_zh.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# 混合精度训练
2+
3+
ColossalAI可以使用如下三种不同的混合精度训练方式:
4+
1. torch.cuda.amp
5+
2. apex.amp
6+
3. 张量并行AMP
7+
8+
前两种混合精度训练方式依赖于[PyTorch](https://pytorch.org/docs/stable/amp.html)的原生实现(1.6或以上版本)以及
9+
[Nvidia Apex](https://github.com/NVIDIA/apex),但这两种方法与张量并行并不兼容,因为在张量并行中我们需要将张量进行切分并保存在不同的设备上,
10+
因此,实现兼容张量并行的混合精度训练需要在不同进程之间不断通信来交流`inf`以及`nan`是否存在于模型参数中,因此我们才用了
11+
[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)的实现方式。
12+
13+
您可以简单地将配置文件中的`fp16`字段设置为True来使用混合精度训练。目前,PyTorch与Apex的amp不能保证与张量和流水线并行兼容,因此,我们推荐您使用
14+
最后一种混合精度训练方式。
15+
16+
## PyTorch AMP
17+
18+
PyTorch在1.6及以上版本中提供了混合精度训练,其可以在保持一些操作的精度为`fp32`的同时,将数据转换成`fp16`格式,您可以在配置文件中配置使用。
19+
20+
```python
21+
from colossalai.engine import AMP_TYPE
22+
23+
fp16=dict(
24+
mode=AMP_TYPE.TORCH,
25+
# below are default values for grad scaler
26+
init_scale=2.**16,
27+
growth_factor=2.0,
28+
backoff_factor=0.5,
29+
growth_interval=2000,
30+
enabled=True
31+
)
32+
```
33+
34+
## Apex AMP
35+
36+
我们使用了[Apex](https://nvidia.github.io/apex/)中的混合精度训练,因为该模式提供了细粒度的混合精度控制,例如,`O2`级(第二级优化器)将会保持
37+
批标准化在`fp32`上进行。下面的代码块展示了使用Apex AMP的配置文件。
38+
39+
```python
40+
from colossalai.engine import AMP_TYPE
41+
42+
fp16 = dict(
43+
mode=AMP_TYPE.APEX,
44+
# below are the default values
45+
enabled=True,
46+
opt_level='O1',
47+
cast_model_type=None,
48+
patch_torch_functions=None,
49+
keep_batchnorm_fp32=None,
50+
master_weights=None,
51+
loss_scale=None,
52+
cast_model_outputs=None,
53+
num_losses=1,
54+
verbosity=1,
55+
min_loss_scale=None,
56+
max_loss_scale=16777216.0
57+
)
58+
```
59+
60+
## 张量并行AMP
61+
62+
我们借鉴了Megatron-LM的混合精度训练实现,该实现方式与张量并行与流水线并行相兼容。下面的代码块展示了使用张量并行AMP的配置文件。
63+
64+
```python
65+
from colossalai.engine import AMP_TYPE
66+
67+
fp16 = dict(
68+
mode=AMP_TYPE.PARALLEL,
69+
# below are the default values
70+
clip_grad=0,
71+
log_num_zeros_in_grad=False,
72+
initial_scale=2 ** 32,
73+
min_scale=1,
74+
growth_factor=2,
75+
backoff_factor=0.5,
76+
growth_interval=1000,
77+
hysteresis=2
78+
)
79+
```

docs/config.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Config file
22

3-
Here is an example config file of training ViT on cifar:
3+
Here is a config file example showing how to train a ViT model on the CIFAR10 dataset using ColossalAI:
44

55
```python
66
# build train_dataset and train_dataloader from this dictionary

0 commit comments

Comments
 (0)