Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
225 commits
Select commit Hold shift + click to select a range
c6cd629
[Inference]ADD Bench Chatglm2 script (#4963)
CjhHa1 Oct 24, 2023
1db6727
[Pipeline inference] Combine kvcache with pipeline inference (#4938)
FoolPlayer Oct 27, 2023
4e4a10c
updated c++17 compiler flags (#4983)
kurisusnowdeng Oct 27, 2023
cf579ff
[Inference] Dynamic Batching Inference, online and offline (#4953)
CjhHa1 Oct 30, 2023
459a88c
[Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding …
tiandiao123 Oct 30, 2023
abe071b
fix ColossalEval (#4992)
chengeharrison Oct 31, 2023
4f0234f
[doc]Update doc for colossal-inference (#4989)
tiandiao123 Oct 31, 2023
be82b5d
[hotfix] Fix the bug where process groups were not being properly rel…
littsk Oct 31, 2023
c040d70
[hotfix] fix the bug of repeatedly storing param group (#4951)
Oct 31, 2023
335cb10
[doc] add supported feature diagram for hybrid parallel plugin (#4996)
ppt0011 Oct 31, 2023
b6696be
[Pipeline Inference] Merge pp with tp (#4993)
FoolPlayer Nov 1, 2023
8993c8a
[release] update version (#4995)
ver217 Nov 1, 2023
dc003c3
[moe] merge moe into main (#4978)
oahzxl Nov 2, 2023
d99b2c9
[hotfix] fix grad accumulation plus clipping for gemini (#5002)
Nov 2, 2023
1a3315e
[hotfix] Add layer norm gradients all-reduce for sequence parallel (#…
littsk Nov 3, 2023
c36e782
[format] applied code formatting on changed files in pull request 492…
github-actions[bot] Nov 6, 2023
ef4c14a
[Inference] Fix bug in ChatGLM2 Tensor Parallelism (#5014)
CjhHa1 Nov 7, 2023
67f5331
[misc] add code owners (#5024)
ver217 Nov 8, 2023
f71e63b
[moe] support optimizer checkpoint (#5015)
oahzxl Nov 8, 2023
239cd92
Support mtbench (#5025)
chengeharrison Nov 9, 2023
7244412
[moe]: fix ep/tp tests, add hierarchical all2all (#4982)
cwher Nov 9, 2023
a448938
[shardformer] Fix serialization error with Tensor Parallel state savi…
imgaojun Nov 9, 2023
576a2f7
[gemini] gemini support tensor parallelism. (#4942)
flybird11111 Nov 10, 2023
70885d7
[hotfix] Suport extra_kwargs in ShardConfig (#5031)
KKZ20 Nov 10, 2023
43ad0d9
fix wrong EOS token in ColossalChat
Orion-Zheng Nov 14, 2023
28052a7
[Kernels]Update triton kernels into 2.1.0 (#5046)
tiandiao123 Nov 16, 2023
b2ad0d9
[pipeline,shardformer] Fix p2p efficiency in pipeline, allow skipping…
zeyugao Nov 16, 2023
3e02154
[gemini] gemini support extra-dp (#5043)
flybird11111 Nov 16, 2023
97cd0cd
[shardformer] fix llama error when transformers upgraded. (#5055)
flybird11111 Nov 16, 2023
3c08f17
[hotfix]: modify create_ep_hierarchical_group and add test (#5032)
cwher Nov 17, 2023
bc09b95
[exampe] fix llama example' loss error when using gemini plugin (#5060)
flybird11111 Nov 18, 2023
fd6482a
[inference] Refactor inference architecture (#5057)
Xu-Kai Nov 19, 2023
bce9197
[Kernels]added flash-decoidng of triton (#5063)
tiandiao123 Nov 20, 2023
8d56c9c
[misc] remove outdated submodule (#5070)
ver217 Nov 20, 2023
e5ce4c8
[npu] add npu support for gemini and zero (#5067)
ver217 Nov 20, 2023
0c7d8be
[hotfix/hybridengine] fix bug when tp*pp size = 1 (#5069)
FoolPlayer Nov 20, 2023
fb103cf
[inference] update examples and engine (#5073)
Xu-Kai Nov 20, 2023
8921a73
[format] applied code formatting on changed files in pull request 506…
github-actions[bot] Nov 20, 2023
4e3959d
[hotfix/hybridengine] Fix init model with random parameters in benchm…
FoolPlayer Nov 20, 2023
1cd7efc
[inference] refactor examples and fix schedule (#5077)
ver217 Nov 21, 2023
dce05da
fix thrust-transform-reduce error (#5078)
imgaojun Nov 21, 2023
fd3567e
[nfc] fix typo in docs/ (#4972)
digger-yu Nov 21, 2023
0d48230
[nfc] fix typo and author name (#5089)
digger-yu Nov 22, 2023
4ccb9de
[gemini]fix gemini optimzer, saving Shardformer in Gemini got list as…
flybird11111 Nov 22, 2023
75af66c
[Hotfix] Fix model policy matching strategy in ShardFormer (#5064)
KKZ20 Nov 22, 2023
aae4966
[shardformer]fix flash attention, when mask is casual, just don't unp…
flybird11111 Nov 22, 2023
3acbf6d
[npu] add npu support for hybrid plugin and llama (#5090)
oahzxl Nov 22, 2023
e53e729
[Feature] Add document retrieval QA (#5020)
YeAnbang Nov 23, 2023
68fcaa2
remove duplicate import (#5100)
oahzxl Nov 23, 2023
2bdf76f
fix typo change lazy_iniy to lazy_init (#5099)
digger-yu Nov 24, 2023
d5661f0
[nfc] fix typo change directoty to directory (#5111)
digger-yu Nov 27, 2023
7b789f4
[FEATURE] Add Safety Eval Datasets to ColossalEval (#5095)
Orion-Zheng Nov 27, 2023
126cf18
[hotfix] fixed memory usage of shardformer module replacement (#5122)
kurisusnowdeng Nov 28, 2023
7172459
[shardformer]: support gpt-j, falcon, Mistral and add interleaved pip…
cwher Nov 28, 2023
177c79f
[doc] add moe news (#5128)
binmakeswell Nov 28, 2023
2899cfd
[doc] updated paper citation (#5131)
FrankLeeeee Nov 29, 2023
9110406
fix typo change JOSNL TO JSONL etc. (#5116)
digger-yu Nov 29, 2023
d10ee42
[format] applied code formatting on changed files in pull request 508…
github-actions[bot] Nov 29, 2023
9b36640
[format] applied code formatting on changed files in pull request 512…
github-actions[bot] Nov 29, 2023
f6731db
[format] applied code formatting on changed files in pull request 511…
github-actions[bot] Nov 29, 2023
f4e72c9
[accelerator] init the accelerator module (#5129)
FrankLeeeee Nov 30, 2023
d6df19b
[npu] support triangle attention for llama (#5130)
oahzxl Nov 30, 2023
2a2ec49
[plugin]fix 3d checkpoint load when booster boost without optimizer. …
flybird11111 Nov 30, 2023
c7fd9a5
[ColossalQA] refactor server and webui & add new feature (#5138)
MichelleMa8 Nov 30, 2023
368b5e3
[doc] fix colossalqa document (#5146)
MichelleMa8 Dec 1, 2023
3dbbf83
fix (#5158)
flybird11111 Dec 5, 2023
b397104
[Colossal-Llama-2] Add finetuning Colossal-Llama-2 example (#4878)
chengeharrison Dec 7, 2023
21aa5de
[gemini] hotfix NaN loss while using Gemini + tensor_parallel (#5150)
flybird11111 Dec 8, 2023
b07a6f4
[colossalqa] fix pangu api (#5170)
MichelleMa8 Dec 11, 2023
cefdc32
[ColossalEval] Support GSM, Data Leakage Evaluation and Tensor Parall…
chengeharrison Dec 12, 2023
79718fa
[shardformer] llama support DistCrossEntropy (#5176)
flybird11111 Dec 12, 2023
3ff60d1
Fix ColossalEval (#5186)
chengeharrison Dec 15, 2023
681d9b1
[doc] update pytorch version in documents. (#5177)
flybird11111 Dec 15, 2023
af95267
polish readme in application/chat (#5194)
ht-zhou Dec 20, 2023
4fa689f
[pipeline]: fix p2p comm, add metadata cache and support llama interl…
cwher Dec 22, 2023
eae01b6
Improve logic for selecting metrics (#5196)
chengeharrison Dec 22, 2023
64519eb
[doc] Update required third-party library list for testing and torch …
KKZ20 Dec 27, 2023
02d2328
support linear accumulation fusion (#5199)
flybird11111 Dec 29, 2023
3c0d82b
[pipeline]: support arbitrary batch size in forward_only mode (#5201)
cwher Jan 2, 2024
d799a30
[pipeline]: add p2p fallback order and fix interleaved pp deadlock (#…
cwher Jan 3, 2024
7f3400b
[devops] update torch versoin in ci (#5217)
ver217 Jan 3, 2024
365671b
fix-test (#5210)
flybird11111 Jan 3, 2024
451e914
fix flash attn (#5209)
flybird11111 Jan 3, 2024
b0b53a1
[nfc] fix typo colossalai/shardformer/ (#5133)
digger-yu Jan 4, 2024
d992b55
[Colossal-LLaMA-2] Release Colossal-LLaMA-2-13b-base model (#5224)
TongLi3701 Jan 5, 2024
915b465
[doc] Update README.md of Colossal-LLAMA2 (#5233)
Camille7777 Jan 6, 2024
ce65127
[doc] Make leaderboard format more uniform and good-looking (#5231)
zhimin-z Jan 6, 2024
b9b32b1
[doc] add Colossal-LLaMA-2-13B (#5234)
binmakeswell Jan 7, 2024
4fb4a22
[format] applied code formatting on changed files in pull request 523…
github-actions[bot] Jan 7, 2024
7bc6969
[doc] SwiftInfer release (#5236)
binmakeswell Jan 8, 2024
dd2c28a
[npu] use extension for op builder (#5172)
oahzxl Jan 8, 2024
d565df3
[pipeline] A more general _communicate in p2p (#5062)
zeyugao Jan 8, 2024
d202cc2
[npu] change device to accelerator api (#5239)
ver217 Jan 9, 2024
9102d65
[hotfix] removed unused flag (#5242)
FrankLeeeee Jan 9, 2024
41e52c1
[doc] fix typo in Colossal-LLaMA-2/README.md (#5247)
digger-yu Jan 10, 2024
edf94a3
[workflow] fixed build CI (#5240)
FrankLeeeee Jan 10, 2024
d5eeeb1
[ci] fixed booster test (#5251)
FrankLeeeee Jan 11, 2024
2b83418
[ci] fixed ddp test (#5254)
FrankLeeeee Jan 11, 2024
756c400
fix typo in applications/ColossalEval/README.md (#5250)
digger-yu Jan 11, 2024
e830ef9
[ci] fix shardformer tests. (#5255)
flybird11111 Jan 11, 2024
c174c4f
[doc] fix doc typo (#5256)
binmakeswell Jan 11, 2024
ef4f0ee
[hotfix]: add pp sanity check and fix mbs arg (#5268)
cwher Jan 15, 2024
04244aa
[workflow] fixed incomplete bash command (#5272)
FrankLeeeee Jan 16, 2024
d69cd2e
[workflow] fixed oom tests (#5275)
FrankLeeeee Jan 16, 2024
2a0558d
[ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276)
flybird11111 Jan 17, 2024
46e0916
[shardformer] hybridparallelplugin support gradients accumulation. (#…
flybird11111 Jan 17, 2024
5d9a0ae
[hotfix] Fix ShardFormer test execution path when using sequence para…
KKZ20 Jan 17, 2024
1484693
Merge branch 'main' into sync/npu
ver217 Jan 18, 2024
d66e698
Merge pull request #5278 from ver217/sync/npu
FrankLeeeee Jan 18, 2024
32cb744
fix auto loading gpt2 tokenizer (#5279)
MichelleMa8 Jan 18, 2024
6a56967
[doc] add llama2-13B disyplay (#5285)
Desperado-Jia Jan 19, 2024
f7e3f82
fix llama pretrain (#5287)
flybird11111 Jan 19, 2024
d7f8db8
[hotfix] fix 3d plugin test (#5292)
ver217 Jan 22, 2024
ddf879e
fix bug for mefture (#5299)
Desperado-Jia Jan 22, 2024
ec912b1
[NFC] polish applications/Colossal-LLaMA-2/colossal_llama2/tokenizer/…
liwenjuna Jan 25, 2024
bce9499
fix some typo (#5307)
digger-yu Jan 25, 2024
7cfed5f
[feat] refactored extension module (#5298)
FrankLeeeee Jan 25, 2024
73f4dc5
[workflow] updated CI image (#5318)
FrankLeeeee Jan 29, 2024
8823cc4
Merge pull request #5310 from hpcaitech/feature/npu
FrankLeeeee Jan 29, 2024
087d0cb
[accelerator] fixed npu api
FrankLeeeee Jan 29, 2024
a6709af
Merge pull request #5321 from FrankLeeeee/hotfix/accelerator-api
FrankLeeeee Jan 29, 2024
388179f
[tests] fix t5 test. (#5322)
flybird11111 Jan 29, 2024
febed23
[doc] added docs for extensions (#5324)
FrankLeeeee Jan 29, 2024
6a3086a
fix typo under extensions/ (#5330)
digger-yu Jan 30, 2024
71321a0
fix typo change dosen't to doesn't (#5308)
digger-yu Jan 30, 2024
abd8e77
[extension] fixed exception catch (#5342)
FrankLeeeee Jan 31, 2024
c523984
[Chat] fix sft loss nan (#5345)
YeAnbang Feb 1, 2024
ffffc32
[checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347)
ver217 Feb 1, 2024
1c790c0
[fix] remove unnecessary dp_size assert (#5351)
cwher Feb 2, 2024
2dd01e3
[gemini] fix param op hook when output is tuple (#5355)
ver217 Feb 4, 2024
6c0fa7b
[llama] fix dataloader for hybrid parallel (#5358)
ver217 Feb 5, 2024
73f9f23
[llama] update training script (#5360)
ver217 Feb 5, 2024
a4cec17
[llama] add flash attn patch for npu (#5362)
ver217 Feb 5, 2024
44ca61a
[llama] fix neftune & pbar with start_step (#5364)
Camille7777 Feb 5, 2024
a5756a8
[eval] update llama npu eval (#5366)
Camille7777 Feb 6, 2024
eb4f2d9
[llama] polish training script and fix optim ckpt (#5368)
ver217 Feb 6, 2024
c53ddda
[lr-scheduler] fix load state dict and add test (#5369)
ver217 Feb 6, 2024
084c912
[llama] fix memory issue (#5371)
ver217 Feb 6, 2024
7d8e033
[moe] init mixtral impl
oahzxl Dec 14, 2023
c904d2a
[moe] update capacity computing (#5253)
ver217 Jan 11, 2024
da39d21
[moe] support mixtral (#5309)
ver217 Jan 25, 2024
b60be18
[moe] fix mixtral checkpoint io (#5314)
ver217 Jan 27, 2024
956b561
[moe] fix mixtral forward default value (#5329)
ver217 Jan 30, 2024
65e5d6b
[moe] fix mixtral optim checkpoint (#5344)
ver217 Feb 1, 2024
06db94f
[moe] fix tests
ver217 Feb 8, 2024
4c03347
Merge pull request #5377 from hpcaitech/example/llama-npu
FrankLeeeee Feb 8, 2024
efef43b
Merge pull request #5372 from hpcaitech/exp/mixtral
FrankLeeeee Feb 8, 2024
adae123
[release] update version (#5380)
ver217 Feb 8, 2024
7303801
[llama] fix training and inference scripts (#5384)
ver217 Feb 19, 2024
69e3ad0
[doc] Fix typo (#5361)
yixiaoer Feb 19, 2024
705a62a
[doc] updated installation command (#5389)
FrankLeeeee Feb 19, 2024
b833153
[hotfix] fix variable type for top_p (#5313)
CZYCW Feb 19, 2024
5d380a1
[hotfix] Fix wrong import in meta_registry (#5392)
stephankoe Feb 20, 2024
95c21e3
[extension] hotfix jit extension setup (#5402)
ver217 Feb 26, 2024
d882d18
[example] reuse flash attn patch (#5400)
ver217 Feb 27, 2024
bf34c6f
[fsdp] impl save/load shard model/optimizer (#5357)
airlsyn Feb 27, 2024
dcdd8a5
[setup] fixed nightly release (#5388)
FrankLeeeee Feb 27, 2024
0a25e16
[shardformer]gather llama logits (#5398)
flybird11111 Feb 27, 2024
a28c971
update requirements (#5407)
TongLi3701 Feb 28, 2024
2461f37
[workflow] added pypi channel (#5412)
FrankLeeeee Feb 29, 2024
5de940d
[doc] fix blog link
binmakeswell Feb 29, 2024
a1c6cdb
[doc] fix blog link
binmakeswell Feb 29, 2024
4b8312c
fix sft single turn inference example (#5416)
Camille7777 Mar 1, 2024
29695cf
[example]add gpt2 benchmark example script. (#5295)
flybird11111 Mar 4, 2024
822241a
[doc] sora release (#5425)
binmakeswell Mar 5, 2024
070df68
[devops] fix extention building (#5427)
ver217 Mar 5, 2024
e304e4d
[hotfix] fix sd vit import error (#5420)
danyow-cheung Mar 5, 2024
e239cf9
[hotfix] fix typo of openmoe model source (#5403)
Luo-Yihang Mar 5, 2024
70cce5c
[doc] update some translations with README-zh-Hans.md (#5382)
digger-yu Mar 5, 2024
16c96d4
[hotfix] fix typo change _descrption to _description (#5331)
digger-yu Mar 5, 2024
049121d
[hotfix] fix typo change enabel to enable under colossalai/shardforme…
digger-yu Mar 5, 2024
a7ae2b5
[eval-hotfix] set few_shot_data to None when few shot is disabled (#5…
starcatmeow Mar 5, 2024
5e1c93d
[hotfix] fix typo change MoECheckpintIO to MoECheckpointIO (#5335)
digger-yu Mar 5, 2024
c8003d4
[doc] Fix typo s/infered/inferred/ (#5288)
hugo-syn Mar 5, 2024
68f55a7
[hotfix] fix stable diffusion inference bug. (#5289)
Youngon Mar 5, 2024
743e7fa
[colossal-llama2] add stream chat examlple for chat version model (#5…
Camille7777 Mar 7, 2024
8020f42
[release] update version (#5411)
ver217 Mar 7, 2024
da885ed
fix tensor data update for gemini loss caluculation (#5442)
Camille7777 Mar 11, 2024
385e85a
[hotfix] fix typo s/keywrods/keywords etc. (#5429)
digger-yu Mar 12, 2024
f2e8b9e
[devops] fix compatibility (#5444)
ver217 Mar 13, 2024
5e16bf7
[shardformer] fix gathering output when using tensor parallelism (#5431)
flybird11111 Mar 18, 2024
bd998ce
[doc] release Open-Sora 1.0 with model weights (#5468)
binmakeswell Mar 18, 2024
d158fc0
[doc] update open-sora demo (#5479)
binmakeswell Mar 20, 2024
848a574
[example] add grok-1 inference (#5485)
ver217 Mar 21, 2024
6df844b
[release] grok-1 314b inference (#5490)
binmakeswell Mar 22, 2024
5fcd779
[example] update Grok-1 inference (#5495)
yuanheng-zhao Mar 24, 2024
bb0a668
[hotfix] set return_outputs=False in examples and polish code (#5404)
cwher Mar 25, 2024
34e9092
[release] grok-1 inference benchmark (#5500)
binmakeswell Mar 25, 2024
0688d92
[shardformer]Fix lm parallel. (#5480)
flybird11111 Mar 25, 2024
131f32a
[fix] fix grok-1 example typo (#5506)
yuanheng-zhao Mar 26, 2024
a7790a9
[devops] fix example test ci (#5504)
ver217 Mar 26, 2024
cbe34c5
Fix ColoTensorSpec for py11 (#5440)
dementrock Mar 26, 2024
61da3fb
fixed layout converter caching and updated tester
Edenzzzz Mar 26, 2024
18edcd5
Empty-Commit
Edenzzzz Mar 26, 2024
9a3321e
Merge pull request #5515 from Edenzzzz/fix_layout_convert
Edenzzzz Mar 26, 2024
19e1a5c
[shardformer] update colo attention to support custom mask (#5510)
ver217 Mar 27, 2024
e6707a6
[format] applied code formatting on changed files in pull request 551…
github-actions[bot] Mar 27, 2024
00525f7
[shardformer] fix pipeline forward error if custom layer distribution…
insujang Mar 27, 2024
36c4bb2
[Fix] Grok-1 use tokenizer from the same pretrained path (#5532)
yuanheng-zhao Mar 28, 2024
df5e9c5
[ColossalChat] Update RLHF V2 (#5286)
YeAnbang Mar 29, 2024
e614aa3
[shardformer, pipeline] add `gradient_checkpointing_ratio` and hetero…
cwher Apr 1, 2024
7e0ec5a
fix incorrect sharding without zero (#5545)
Edenzzzz Apr 2, 2024
8e412a5
[shardformer] Sequence Parallelism Optimization (#5533)
KKZ20 Apr 3, 2024
15055f9
[hotfix] quick fixes to make legacy tutorials runnable (#5559)
Edenzzzz Apr 7, 2024
a799ca3
[fix] fix typo s/muiti-node /multi-node etc. (#5448)
digger-yu Apr 7, 2024
341263d
[hotfix] fix typo s/get_defualt_parser /get_default_parser (#5548)
digger-yu Apr 7, 2024
641b1ee
[devops] remove post commit ci (#5566)
ver217 Apr 8, 2024
89049b0
[doc] fix ColossalMoE readme (#5599)
Camille7777 Apr 15, 2024
3788fef
[zero] support multiple (partial) backward passes (#5596)
ver217 Apr 16, 2024
a0ad587
[shardformer] refactor embedding resize (#5603)
flybird11111 Apr 18, 2024
d83c633
[hotfix] Fix examples no pad token & auto parallel codegen bug; (#5606)
Edenzzzz Apr 18, 2024
e094933
[shardformer] fix pipeline grad ckpt (#5620)
ver217 Apr 22, 2024
8019d1d
[lora] add lora APIs for booster, support lora for TorchDDP (#4981)
Oct 31, 2023
502e2ca
[LowLevelZero] low level zero support lora (#5153)
flybird11111 Dec 21, 2023
f08fecd
[feature] qlora support
linsj20 Apr 11, 2024
208047c
qlora follow commit
linsj20 Apr 11, 2024
e7a10a4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 11, 2024
767e558
migrate qutization folder to colossalai/
linsj20 Apr 15, 2024
4be48f4
minor fixes
linsj20 Apr 22, 2024
4a322e4
Merge branch 'feature/lora' of github.com:hpcaitech/ColossalAI into r…
linsj20 Apr 22, 2024
5aac8e4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 22, 2024
e24f986
gptj sp fix
linsj20 Apr 23, 2024
eee384f
remove redundancies from pre-commit
linsj20 Apr 23, 2024
eeeceb9
minor fixes
linsj20 Apr 23, 2024
ffd55dc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 1 addition & 3 deletions .compatibility
Original file line number Diff line number Diff line change
@@ -1,3 +1 @@
1.12.0-11.3.0
1.13.0-11.6.0
2.0.0-11.7.0
2.1.0-12.1.0
12 changes: 6 additions & 6 deletions .cuda_ext.json
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
{
"build": [
{
"torch_command": "pip install torch==1.12.1+cu102 torchvision==0.13.1+cu102 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu102",
"cuda_image": "hpcaitech/cuda-conda:10.2"
"torch_command": "pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121",
"cuda_image": "hpcaitech/cuda-conda:12.1"
},
{
"torch_command": "pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113",
"cuda_image": "hpcaitech/cuda-conda:11.3"
"torch_command": "pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118",
"cuda_image": "hpcaitech/cuda-conda:11.8"
},
{
"torch_command": "pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116",
"cuda_image": "hpcaitech/cuda-conda:11.6"
"torch_command": "pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1",
"cuda_image": "hpcaitech/cuda-conda:11.7"
}
]
}
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
* @hpcaitech/colossalai-qa
1 change: 1 addition & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
- [ ] I have created an issue for this PR for traceability
- [ ] The title follows the standard format: `[doc/gemini/tensor/...]: A concise description`
- [ ] I have added relevant tags if possible for us to better distinguish different PRs
- [ ] I have installed pre-commit: `pip install pre-commit && pre-commit install`


## 🚨 Issue number
Expand Down
142 changes: 17 additions & 125 deletions .github/workflows/build_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,57 +22,6 @@ on:
delete:

jobs:
prepare_cache:
name: Prepare testmon cache
if: |
github.event_name == 'create' &&
github.event.ref_type == 'branch' &&
github.event.repository.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
steps:
- name: Copy testmon cache
run: | # branch name may contain slash, we need to replace it with space
export REF_BRANCH=$(echo ${{ github.event.ref }} | sed "s/\// /")
if [ -d /github/home/testmon_cache/${MAIN_BRANCH} ]; then
cp -p -r /github/home/testmon_cache/${MAIN_BRANCH} "/github/home/testmon_cache/${REF_BRANCH}"
fi
env:
MAIN_BRANCH: ${{ github.event.master_branch }}

prepare_cache_for_pr:
name: Prepare testmon cache for PR
if: |
github.event_name == 'pull_request' &&
(github.event.action == 'opened' || github.event.action == 'reopened' || (github.event.action == 'edited' && github.event.changes.base != null)) &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-repare-cache
cancel-in-progress: true
steps:
- name: Copy testmon cache
run: | # branch name may contain slash, we need to replace it with space
export BASE=$(echo ${{ github.event.pull_request.base.ref }} | sed "s/\// /")
if [ -d "/github/home/testmon_cache/${BASE}" ] && [ ! -z "$(ls -A "/github/home/testmon_cache/${BASE}")" ]; then
mkdir -p /github/home/testmon_cache/_pull/${PR_NUMBER} && cp -p -r "/github/home/testmon_cache/${BASE}"/.testmondata* /github/home/testmon_cache/_pull/${PR_NUMBER}
fi
env:
PR_NUMBER: ${{ github.event.number }}

detect:
name: Detect file change
if: |
Expand Down Expand Up @@ -140,8 +89,8 @@ jobs:
if: needs.detect.outputs.anyLibraryFileChanged == 'true'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --gpus all --rm -v /data/scratch/cifar-10:/data/scratch/cifar-10 -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
options: --gpus all --rm -v /dev/shm -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
timeout-minutes: 60
defaults:
run:
Expand All @@ -168,12 +117,13 @@ jobs:
cd TensorNVMe
conda install cmake
pip install -r requirements.txt
pip install -v .
DISABLE_URING=1 pip install -v .

- name: Store TensorNVMe Cache
run: |
cd TensorNVMe
cp -p -r ./build /github/home/tensornvme_cache/
cp -p -r ./cmake-build /github/home/tensornvme_cache/

- name: Checkout Colossal-AI
uses: actions/checkout@v2
Expand All @@ -190,39 +140,32 @@ jobs:

- name: Install Colossal-AI
run: |
CUDA_EXT=1 pip install -v -e .
BUILD_EXT=1 pip install -v -e .
pip install -r requirements/requirements-test.txt

- name: Store Colossal-AI Cache
run: |
# -p flag is required to preserve the file timestamp to avoid ninja rebuild
cp -p -r /__w/ColossalAI/ColossalAI/build /github/home/cuda_ext_cache/

- name: Restore Testmon Cache
run: |
if [ -d /github/home/testmon_cache/_pull/${PR_NUMBER} ] && [ ! -z "$(ls -A /github/home/testmon_cache/_pull/${PR_NUMBER})" ]; then
cp -p -r /github/home/testmon_cache/_pull/${PR_NUMBER}/.testmondata* /__w/ColossalAI/ColossalAI/
fi
env:
PR_NUMBER: ${{ github.event.number }}

- name: Execute Unit Testing
run: |
CURL_CA_BUNDLE="" PYTHONPATH=$PWD pytest -m "not largedist" --testmon --testmon-forceselect --testmon-cov=. --durations=10 tests/
CURL_CA_BUNDLE="" PYTHONPATH=$PWD FAST_TEST=1 pytest \
-m "not largedist" \
--durations=0 \
--ignore tests/test_analyzer \
--ignore tests/test_auto_parallel \
--ignore tests/test_fx \
--ignore tests/test_autochunk \
--ignore tests/test_gptq \
--ignore tests/test_infer_ops \
--ignore tests/test_legacy \
--ignore tests/test_smoothquant \
tests/
env:
DATA: /data/scratch/cifar-10
NCCL_SHM_DISABLE: 1
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
TESTMON_CORE_PKGS: /__w/ColossalAI/ColossalAI/requirements/requirements.txt,/__w/ColossalAI/ColossalAI/requirements/requirements-test.txt
LLAMA_PATH: /data/scratch/llama-tiny

- name: Store Testmon Cache
run: |
mkdir -p /github/home/testmon_cache/_pull/${PR_NUMBER}
cp -p -r /__w/ColossalAI/ColossalAI/.testmondata* /github/home/testmon_cache/_pull/${PR_NUMBER}/
env:
PR_NUMBER: ${{ github.event.number }}

- name: Collate artifact
env:
PR_NUMBER: ${{ github.event.number }}
Expand Down Expand Up @@ -259,54 +202,3 @@ jobs:
with:
name: report
path: report/

store_cache:
name: Store testmon cache for PR
if: |
github.event_name == 'pull_request' &&
github.event.action == 'closed' &&
github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
steps:
- name: Store testmon cache if possible
if: github.event.pull_request.merged == true
run: | # branch name may contain slash, we need to replace it with space
export BASE=$(echo ${{ github.event.pull_request.base.ref }} | sed "s/\// /")
if [ -d /github/home/testmon_cache/_pull/${PR_NUMBER} ] && [ ! -z "$(ls -A /github/home/testmon_cache/_pull/${PR_NUMBER})" ]; then
cp -p -r /github/home/testmon_cache/_pull/${PR_NUMBER}/.testmondata* "/github/home/testmon_cache/${BASE}/"
fi
env:
PR_NUMBER: ${{ github.event.pull_request.number }}

- name: Remove testmon cache
run: |
rm -rf /github/home/testmon_cache/_pull/${PR_NUMBER}
env:
PR_NUMBER: ${{ github.event.pull_request.number }}

remove_cache:
name: Remove testmon cache
if: |
github.event_name == 'delete' &&
github.event.ref_type == 'branch' &&
github.event.repository.full_name == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --rm
timeout-minutes: 5
defaults:
run:
shell: bash
steps:
- name: Remove testmon cache
run: | # branch name may contain slash, we need to replace it with space
export BASE=$(echo ${{ github.event.ref }} | sed "s/\// /")
rm -rf "/github/home/testmon_cache/${BASE}"
26 changes: 15 additions & 11 deletions .github/workflows/build_on_schedule.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,22 @@ jobs:
build:
name: Build and Test Colossal-AI
if: github.repository == 'hpcaitech/ColossalAI'
runs-on: [self-hosted, 8-gpu]
runs-on: [self-hosted, gpu]
container:
image: hpcaitech/pytorch-cuda:1.12.0-11.3.0
options: --gpus all --rm -v /data/scratch/cifar-10:/data/scratch/cifar-10 -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
timeout-minutes: 40
image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
options: --gpus all --rm -v /dev/shm -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
timeout-minutes: 90
steps:
- name: Check GPU Availability # ensure all GPUs have enough memory
id: check-avai
run: |
avai=true
for i in $(seq 0 7);
ngpu=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
endIndex=$(($ngpu-1))
for i in $(seq 0 $endIndex);
do
gpu_used=$(nvidia-smi -i $i --query-gpu=memory.used --format=csv,noheader,nounits)
[ "$gpu_used" -gt "10000" ] && avai=false
[ "$gpu_used" -gt "2000" ] && avai=false
done

echo "GPU is available: $avai"
Expand All @@ -42,7 +44,7 @@ jobs:
cd TensorNVMe
conda install cmake
pip install -r requirements.txt
pip install -v .
DISABLE_URING=1 pip install -v .

- uses: actions/checkout@v2
if: steps.check-avai.outputs.avai == 'true'
Expand All @@ -53,16 +55,18 @@ jobs:
if: steps.check-avai.outputs.avai == 'true'
run: |
[ ! -z "$(ls -A /github/home/cuda_ext_cache/)" ] && cp -r /github/home/cuda_ext_cache/* /__w/ColossalAI/ColossalAI/
CUDA_EXT=1 pip install -v -e .
BUILD_EXT=1 pip install -v -e .
cp -r /__w/ColossalAI/ColossalAI/build /github/home/cuda_ext_cache/
pip install -r requirements/requirements-test.txt

- name: Unit Testing
if: steps.check-avai.outputs.avai == 'true'
run: |
PYTHONPATH=$PWD pytest --durations=0 tests
PYTHONPATH=$PWD pytest \
-m "not largedist" \
--durations=0 \
tests/
env:
DATA: /data/scratch/cifar-10
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
LLAMA_PATH: /data/scratch/llama-tiny

Expand All @@ -71,7 +75,7 @@ jobs:
if: ${{ failure() }}
run: |
url=$SERVER_URL/$REPO/actions/runs/$RUN_ID
msg="Scheduled Build and Test failed on 8 GPUs, please visit $url for details"
msg="Scheduled Build and Test failed, please visit $url for details"
echo $msg
python .github/workflows/scripts/send_message_to_lark.py -m "$msg" -u $WEBHOOK_URL
env:
Expand Down
9 changes: 4 additions & 5 deletions .github/workflows/compatiblity_test_on_dispatch.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ jobs:
matrix: ${{fromJson(needs.matrix_preparation.outputs.matrix)}}
container:
image: ${{ matrix.container }}
options: --gpus all --rm -v /data/scratch/cifar-10:/data/scratch/cifar-10 -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
options: --gpus all --rm -v /dev/shm -v /data/scratch/cifar-10:/data/scratch/cifar-10 -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
timeout-minutes: 120
steps:
- name: Install dependencies
Expand All @@ -66,7 +66,7 @@ jobs:
cd TensorNVMe
apt update && apt install -y cmake
pip install -r requirements.txt
pip install -v .
DISABLE_URING=1 pip install -v .
- uses: actions/checkout@v2
with:
ssh-key: ${{ secrets.SSH_KEY_FOR_CI }}
Expand All @@ -83,13 +83,12 @@ jobs:
fi
- name: Install Colossal-AI
run: |
CUDA_EXT=1 pip install -v .
BUILD_EXT=1 pip install -v .
pip install -r requirements/requirements-test.txt
- name: Unit Testing
run: |
PYTHONPATH=$PWD pytest tests
PYTHONPATH=$PWD pytest --durations=0 tests
env:
DATA: /data/scratch/cifar-10
NCCL_SHM_DISABLE: 1
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
LLAMA_PATH: /data/scratch/llama-tiny
9 changes: 4 additions & 5 deletions .github/workflows/compatiblity_test_on_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ jobs:
matrix: ${{fromJson(needs.matrix_preparation.outputs.matrix)}}
container:
image: ${{ matrix.container }}
options: --gpus all --rm -v /data/scratch/cifar-10:/data/scratch/cifar-10 -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
options: --gpus all --rm -v /dev/shm -v /data/scratch/cifar-10:/data/scratch/cifar-10 -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
timeout-minutes: 120
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-run-test-${{ matrix.container }}
Expand All @@ -60,7 +60,7 @@ jobs:
cd TensorNVMe
apt update && apt install -y cmake
pip install -r requirements.txt
pip install -v .
DISABLE_URING=1 pip install -v .
- uses: actions/checkout@v2
with:
ssh-key: ${{ secrets.SSH_KEY_FOR_CI }}
Expand All @@ -78,13 +78,12 @@ jobs:

- name: Install Colossal-AI
run: |
CUDA_EXT=1 pip install -v .
BUILD_EXT=1 pip install -v .
pip install -r requirements/requirements-test.txt
- name: Unit Testing
run: |
PYTHONPATH=$PWD pytest tests
PYTHONPATH=$PWD pytest --durations=0 tests
env:
DATA: /data/scratch/cifar-10
NCCL_SHM_DISABLE: 1
LD_LIBRARY_PATH: /github/home/.tensornvme/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
LLAMA_PATH: /data/scratch/llama-tiny
Loading