You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: colossalai/inference/pipeline/README.md
+37-38Lines changed: 37 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@
17
17
Pipeline Inference is composed of three parts: `PPInferEngine`, `MicroBatchManager` and `generate`[schedule](https://github.com/hpcaitech/ColossalAI/blob/feature/pipeline-infer/colossalai/pipeline/schedule/generate.py).
18
18
19
19
1.`PPInderEngine` is the High-Level API for users to use. It is responsible for the following tasks:
20
-
- Initialize the pipeline inference environment with `PipelineStageManager` and mdoel with `ShardFormer`.
20
+
- Initialize the pipeline inference environment with `PipelineStageManager` and model with `ShardFormer`.
21
21
- Run the pipeline inference model.
22
22
23
23
2.`MicroBatchManager` is a structure to manage the micro-batch information. It is responsible for the following tasks:
@@ -31,54 +31,53 @@ Pipeline Inference is composed of three parts: `PPInferEngine`, `MicroBatchManag
31
31
32
32
### Example
33
33
```python
34
-
from colossalai.pipeline import PPInferEngine
35
-
# Suppose the pipeline size is 2, and use fp16 to do infenrence. Use Llama as an example.
36
-
model = LlamaForCausalLM.from_pretrained('/path/to/model')
37
-
inputs = tokenizer("Hello, my dog is cute", "What a good day", return_tensors="pt")
38
-
engine = PPInferEngine(
39
-
pp_size=2,
40
-
dtype='fp16',
41
-
micro_batch_size=1,
42
-
new_length=10,
43
-
model=model,
44
-
model_policy=LlamaForCausalLMPipelinePolicy())
45
-
46
-
output = engine.inference([inputs])
34
+
from colossalai.inference import PPInferEngine
35
+
from colossalai.inference.pipeline.policies import LlamaModelInferPolicy
36
+
import colossalai
37
+
from transformers import LlamaForCausalLM, LlamaTokenizer
47
38
48
-
```
39
+
colossalai.launch_from_torch(config={})
40
+
41
+
model = LlamaForCausalLM.from_pretrained("/path/to/model")
input= ["Introduce a landmark in London","Introduce a landmark in Singapore"]
48
+
data = tokenizer(input, return_tensors='pt')
49
+
output = inferengine.inference(data.to('cuda'))
50
+
print(tokenizer.batch_decode(output))
54
51
```
55
52
56
53
## Performance
57
54
58
-
We conducted multiple benchmark tests to evaluate the performance. We compared the inference `latency` and `throughputs` between `Pipeline Inference` and `hugging face` pipeline. The test environment is 2*A10, 20G.
55
+
We conducted multiple benchmark tests to evaluate the performance. We compared the inference `latency` and `throughputs` between `Pipeline Inference` and `hugging face` pipeline. The test environment is 2 * A10, 20G / 2 * A800, 80G.
0 commit comments