forked from hpcaitech/ColossalAI
-
Notifications
You must be signed in to change notification settings - Fork 0
Most #189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jamesthesnake
wants to merge
56
commits into
main
Choose a base branch
from
most
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Most #189
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* add bench chatglm * fix bug and make utils --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
mchange bug
…ch#4938) * merge kvcache with pipeline inference and refactor the code structure * support ppsize > 2 * refactor pipeline code * do pre-commit * modify benchmark * fix bench mark * polish code * add docstring and update readme * refactor the code * fix some logic bug of ppinfer * polish readme * fix typo * skip infer test
…#4953) * [inference] Dynamic Batching for Single and Multiple GPUs (hpcaitech#4831) * finish batch manager * 1 * first * fix * fix dynamic batching * llama infer * finish test * support different lengths generating * del prints * del prints * fix * fix bug --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [inference] Async dynamic batching (hpcaitech#4894) * finish input and output logic * add generate * test forward * 1 * [inference]Re push async dynamic batching (hpcaitech#4901) * adapt to ray server * finish async * finish test * del test --------- Co-authored-by: yuehuayingxueluo <[email protected]> * Revert "[inference]Re push async dynamic batching (hpcaitech#4901)" (hpcaitech#4905) This reverts commit fbf3c09. * Revert "[inference] Async dynamic batching (hpcaitech#4894)" This reverts commit fced140. * Revert "[inference] Async dynamic batching (hpcaitech#4894)" (hpcaitech#4909) This reverts commit fced140. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * [infer]Add Ray Distributed Environment Init Scripts (hpcaitech#4911) * Revert "[inference] Async dynamic batching (hpcaitech#4894)" This reverts commit fced140. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * support dynamic batch for bloom model and is_running function * [Inference]Test for new Async engine (hpcaitech#4935) * infer engine * infer engine * test engine * test engine * new manager * change step * add * test * fix * fix * finish test * finish test * finish test * finish test * add license --------- Co-authored-by: yuehuayingxueluo <[email protected]> * add assertion for config (hpcaitech#4947) * [Inference] Finish dynamic batching offline test (hpcaitech#4948) * test * fix test * fix quant * add default * fix * fix some bugs * fix some bugs * fix * fix bug * fix bugs * reset param --------- Co-authored-by: yuehuayingxueluo <[email protected]> Co-authored-by: Cuiqing Li <[email protected]> Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
…for llama token attention (hpcaitech#4965) * adding flash-decoding * clean * adding kernel * adding flash-decoding * add integration * add * adding kernel * adding kernel * adding triton 2.1.0 features for inference * update bloom triton kernel * remove useless vllm kernels * clean codes * fix * adding files * fix readme * update llama flash-decoding --------- Co-authored-by: cuiqing.li <[email protected]>
Co-authored-by: Xu Yuanchen <[email protected]>
* update doc * Update README.md --------- Co-authored-by: cuiqing.li <[email protected]>
…eased. (hpcaitech#4940) * Fix the bug where process groups were not being properly released. * test * Revert "test" This reverts commit 479900c.
* refactor pipeline into new CaiInferEngine * updata llama modeling forward * merge tp with pp * update docstring * optimize test workflow and example * fix typo * add assert and todo
* [release] update version * [hotfix] fix ci
* update moe module * support openmoe
…pcaitech#4926) * [hotfix] Add layer norm gradients all-reduce for sequence parallel. (hpcaitech#4915) * Add layer norm gradients all-reduce for sequence parallel. * skip pipeline inference test * [hotfix] fixing polices of sequence parallel (hpcaitech#4922) * Add layer norm gradients all-reduce for sequence parallel. * fix parameter passing when calling get_autopolicy --------- Co-authored-by: littsk <[email protected]> * Hotfix/add grad all reduce for sequence parallel (hpcaitech#4927) * Add layer norm gradients all-reduce for sequence parallel. * fix parameter passing when calling get_autopolicy * fix bug using wrong variables --------- Co-authored-by: littsk <[email protected]> * fix policy initialization * fix bloom and chatglm policices * polish code of handling layernorm * fix moe module * polish code of class initializing --------- Co-authored-by: Zhongkai Zhao <[email protected]>
hpcaitech#5007) Co-authored-by: github-actions <[email protected]>
* fix bug * fix * fix multiquery * fix multiquery --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
* Refactor MoE Manager setup method * unshard optim ckpt * optim io * update transformer version * update requirements * update ckpt * update ckpt * update ckpt * fix engine * fix engine
Co-authored-by: Xu Yuanchen <[email protected]>
* fix: add warning for EP different behavior * fix: use shard_data in ep & tp model * to: add used_capacity * fix: fix router test * feat: add create_ep_node_group * feat: add create_ep_hierarchical_group fn * feat: add HierarchicalAllToAll * test: add hierarchical all2all test * fix: fix test errors * fix: simplify create_ep_hierarchical_group * fix: add hierarchical_alltoall arg * fix: fix environ typo * revert: revert process mesh order * to: add todo mark * fix: skip hierarchical_comm if torch < 1.13.1
…ng (hpcaitech#5018) * Fix serialization error with Tensor Parallel state saving * Refactor state_dict CPU transfer using tree_map
* [colossalai]fix typo * [inference] Add smmoothquant for llama (hpcaitech#4904) * [inference] add int8 rotary embedding kernel for smoothquant (hpcaitech#4843) * [inference] add smoothquant llama attention (hpcaitech#4850) * add smoothquant llama attention * remove uselss code * remove useless code * fix import error * rename file name * [inference] add silu linear fusion for smoothquant llama mlp (hpcaitech#4853) * add silu linear * update skip condition * catch smoothquant cuda lib exception * prcocess exception for tests * [inference] add llama mlp for smoothquant (hpcaitech#4854) * add llama mlp for smoothquant * fix down out scale * remove duplicate lines * add llama mlp check * delete useless code * [inference] add smoothquant llama (hpcaitech#4861) * add smoothquant llama * fix attention accuracy * fix accuracy * add kv cache and save pretrained * refactor example * delete smooth * refactor code * [inference] add smooth function and delete useless code for smoothquant (hpcaitech#4895) * add smooth function and delete useless code * update datasets * remove duplicate import * delete useless file * refactor codes (hpcaitech#4902) * rafactor code * add license * add torch-int and smoothquant license * Update flash_attention_patch.py To be compatible with the new change in the Transformers library, where a new argument 'padding_mask' was added to forward function of attention layer. huggingface/transformers#25598 * [kernel] support pure fp16 for cpu adam and update gemini optim tests (hpcaitech#4921) * [kernel] support pure fp16 for cpu adam (hpcaitech#4896) * [kernel] fix cpu adam kernel for pure fp16 and update tests (hpcaitech#4919) * [kernel] fix cpu adam * [test] update gemini optim test * [format] applied code formatting on changed files in pull request 4908 (hpcaitech#4918) Co-authored-by: github-actions <[email protected]> * [gemini] support gradient accumulation (hpcaitech#4869) * add test * fix no_sync bug in low level zero plugin * fix test * add argument for grad accum * add grad accum in backward hook for gemini * finish implementation, rewrite tests * fix test * skip stuck model in low level zero test * update doc * optimize communication & fix gradient checkpoint * modify doc * cleaning codes * update cpu adam fp16 case * [hotfix] fix torch 2.0 compatibility (hpcaitech#4936) * [hotfix] fix launch * [test] fix test gemini optim * [shardformer] fix vit * [test] add no master test for low level zero plugin (hpcaitech#4934) * [format] applied code formatting on changed files in pull request 4820 (hpcaitech#4886) Co-authored-by: github-actions <[email protected]> * [nfc] fix some typo with colossalai/ docs/ etc. (hpcaitech#4920) * [Refactor] Integrated some lightllm kernels into token-attention (hpcaitech#4946) * add some req for inference * clean codes * add codes * add some lightllm deps * clean codes * hello * delete rms files * add some comments * add comments * add doc * add lightllm deps * add lightllm cahtglm2 kernels * add lightllm cahtglm2 kernels * replace rotary embedding with lightllm kernel * add some commnets * add some comments * add some comments * add * replace fwd kernel att1 * fix a arg * add * add * fix token attention * add some comments * clean codes * modify comments * fix readme * fix bug * fix bug --------- Co-authored-by: cuiqing.li <[email protected]> Co-authored-by: CjhHa1 <[email protected]> * [test] merge old components to test to model zoo (hpcaitech#4945) * [test] add custom models in model zoo * [test] update legacy test * [test] update model zoo * [test] update gemini test * [test] remove components to test * [inference] add reference and fix some bugs (hpcaitech#4937) * add reference and fix some bugs * update gptq init --------- Co-authored-by: Xu Kai <[email protected]> * [Inference]ADD Bench Chatglm2 script (hpcaitech#4963) * add bench chatglm * fix bug and make utils --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Pipeline inference] Combine kvcache with pipeline inference (hpcaitech#4938) * merge kvcache with pipeline inference and refactor the code structure * support ppsize > 2 * refactor pipeline code * do pre-commit * modify benchmark * fix bench mark * polish code * add docstring and update readme * refactor the code * fix some logic bug of ppinfer * polish readme * fix typo * skip infer test * updated c++17 compiler flags (hpcaitech#4983) * [Inference] Dynamic Batching Inference, online and offline (hpcaitech#4953) * [inference] Dynamic Batching for Single and Multiple GPUs (hpcaitech#4831) * finish batch manager * 1 * first * fix * fix dynamic batching * llama infer * finish test * support different lengths generating * del prints * del prints * fix * fix bug --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [inference] Async dynamic batching (hpcaitech#4894) * finish input and output logic * add generate * test forward * 1 * [inference]Re push async dynamic batching (hpcaitech#4901) * adapt to ray server * finish async * finish test * del test --------- Co-authored-by: yuehuayingxueluo <[email protected]> * Revert "[inference]Re push async dynamic batching (hpcaitech#4901)" (hpcaitech#4905) This reverts commit fbf3c09. * Revert "[inference] Async dynamic batching (hpcaitech#4894)" This reverts commit fced140. * Revert "[inference] Async dynamic batching (hpcaitech#4894)" (hpcaitech#4909) This reverts commit fced140. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * [infer]Add Ray Distributed Environment Init Scripts (hpcaitech#4911) * Revert "[inference] Async dynamic batching (hpcaitech#4894)" This reverts commit fced140. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * support dynamic batch for bloom model and is_running function * [Inference]Test for new Async engine (hpcaitech#4935) * infer engine * infer engine * test engine * test engine * new manager * change step * add * test * fix * fix * finish test * finish test * finish test * finish test * add license --------- Co-authored-by: yuehuayingxueluo <[email protected]> * add assertion for config (hpcaitech#4947) * [Inference] Finish dynamic batching offline test (hpcaitech#4948) * test * fix test * fix quant * add default * fix * fix some bugs * fix some bugs * fix * fix bug * fix bugs * reset param --------- Co-authored-by: yuehuayingxueluo <[email protected]> Co-authored-by: Cuiqing Li <[email protected]> Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (hpcaitech#4965) * adding flash-decoding * clean * adding kernel * adding flash-decoding * add integration * add * adding kernel * adding kernel * adding triton 2.1.0 features for inference * update bloom triton kernel * remove useless vllm kernels * clean codes * fix * adding files * fix readme * update llama flash-decoding --------- Co-authored-by: cuiqing.li <[email protected]> * fix ColossalEval (hpcaitech#4992) Co-authored-by: Xu Yuanchen <[email protected]> * [doc]Update doc for colossal-inference (hpcaitech#4989) * update doc * Update README.md --------- Co-authored-by: cuiqing.li <[email protected]> * [hotfix] Fix the bug where process groups were not being properly released. (hpcaitech#4940) * Fix the bug where process groups were not being properly released. * test * Revert "test" This reverts commit 479900c. * [hotfix] fix the bug of repeatedly storing param group (hpcaitech#4951) * [doc] add supported feature diagram for hybrid parallel plugin (hpcaitech#4996) * [Pipeline Inference] Merge pp with tp (hpcaitech#4993) * refactor pipeline into new CaiInferEngine * updata llama modeling forward * merge tp with pp * update docstring * optimize test workflow and example * fix typo * add assert and todo * [release] update version (hpcaitech#4995) * [release] update version * [hotfix] fix ci * [gemini] gemini support tp [gemini] gemini support tp [gemini] gemini support tp [gemini] gemini support tp [gemini] gemini support tp * fix fix fix * update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO update checkpointIO * support fused layernorm support fused layernorm support fused layernorm * update fusedlayernorm update fusedlayernorm update fusedlayernorm * add sequence parallel to gemini add sequence parallel to gemini * fix * fix comments fix comments fix comments * fix * fix t5 * clear cache * fix * activate ci * activate ci * fix * fix * fix * fix * revert * modify tp gather method modify tp gather method modify tp gather method modify tp gather method * fix test --------- Co-authored-by: Xu Kai <[email protected]> Co-authored-by: Zian(Andy) Zheng <[email protected]> Co-authored-by: Hongxin Liu <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions <[email protected]> Co-authored-by: Baizhou Zhang <[email protected]> Co-authored-by: Zhongkai Zhao <[email protected]> Co-authored-by: digger yu <[email protected]> Co-authored-by: Cuiqing Li <[email protected]> Co-authored-by: cuiqing.li <[email protected]> Co-authored-by: CjhHa1 <[email protected]> Co-authored-by: Xu Kai <[email protected]> Co-authored-by: Jianghai <[email protected]> Co-authored-by: Bin Jia <[email protected]> Co-authored-by: アマデウス <[email protected]> Co-authored-by: yuehuayingxueluo <[email protected]> Co-authored-by: Yuanchen <[email protected]> Co-authored-by: Xu Yuanchen <[email protected]> Co-authored-by: littsk <[email protected]> Co-authored-by: ppt0011 <[email protected]>
* [refactor]: replace inference args with extra_kwargs in ShardConfig * modify shardconfig * polish code * fix policy bug in llama * fix bug in auto policy * remove setattr in ShardConfig
* update flash-context-attention * adding kernels * fix * reset * add build script * add building process * add llama2 exmaple * add colossal-llama2 test * clean * fall back test setting * fix test file * clean * clean * clean --------- Co-authored-by: cuiqing.li <[email protected]>
… loading weight not in weight_map when `strict=False`, fix llama flash attention forward, add flop estimation by megatron in llama benchmark (hpcaitech#5017) * Use p2p * Cannot bidirectonal send p2p * Refactor tensor creation and serialization in P2P communication * Fix llama forward args in flash attention * Add flop estimate from megatron * Support loading weight not in weight_map when strict=False in hybrid_parallel * Use send_forward_recv_backward, etc in 1f1b * Use dataclass for metdata Remove torch.cuda.synchronize() as suggested * Add comment about the torch.cuda.synchronize for potential error * Typo * Update hybrid_parallel_checkpoint_io.py * Update p2p.py * Update one_f_one_b.py * Update p2p.py --------- Co-authored-by: flybird11111 <[email protected]>
* support ddp * fix * fix * fix fix * support ddp * fix * fix * fix fix * simplify tests * fix * fix * fix fix fix * fix
…5055) * fix-llama * Update llama.py
…#5032) * feat: modify create_ep_hierarchical_group args * test: add ep tests * fix: remove get_process_group_ranks * fix: fix src_rank
* [inference] support only TP (hpcaitech#4998) * support only tp * enable tp * add support for bloom (hpcaitech#5008) * [refactor] refactor gptq and smoothquant llama (hpcaitech#5012) * refactor gptq and smoothquant llama * fix import error * fix linear import torch-int * fix smoothquant llama import error * fix import accelerate error * fix bug * fix import smooth cuda * fix smoothcuda * [Inference Refactor] Merge chatglm2 with pp and tp (hpcaitech#5023) merge chatglm with pp and tp * [Refactor] remove useless inference code (hpcaitech#5022) * remove useless code * fix quant model * fix test import bug * mv original inference legacy * fix chatglm2 * [Refactor] refactor policy search and quant type controlling in inference (hpcaitech#5035) * [Refactor] refactor policy search and quant type controling in inference * [inference] update readme (hpcaitech#5051) * update readme * update readme * fix architecture * fix table * fix table * [inference] udpate example (hpcaitech#5053) * udpate example * fix run.sh * fix rebase bug * fix some errors * update readme * add some features * update interface * update readme * update benchmark * add requirements-infer --------- Co-authored-by: Bin Jia <[email protected]> Co-authored-by: Zhongkai Zhao <[email protected]>
* added flash-decoidng of triton based on lightllm kernel * add req * clean * clean * delete build.sh --------- Co-authored-by: cuiqing.li <[email protected]>
* [npu] setup device utils (hpcaitech#5047) * [npu] add npu device support * [npu] support low level zero * [test] update npu zero plugin test * [hotfix] fix import * [test] recover tests * [npu] gemini support npu (hpcaitech#5052) * [npu] refactor device utils * [gemini] support npu * [example] llama2+gemini support npu * [kernel] add arm cpu adam kernel (hpcaitech#5065) * [kernel] add arm cpu adam * [optim] update adam optimizer * [kernel] arm cpu adam remove bf16 support
* update examples and engine * fix choices * update example
hpcaitech#5072) Co-authored-by: github-actions <[email protected]>
…ark (hpcaitech#5074) * fix init model with random parameters * fix example
* [setup] refactor infer setup * [hotfix] fix infenrece behavior on 1 1 gpu * [exmaple] refactor inference examples
…signment index out of range (hpcaitech#5085)
…#5064) * hotfix/Fix get model policy strategy in ShardFormer * fix bug in auto policy
…ad it (hpcaitech#5084) * fix flash attn * fix fix
* llama 3d * update * fix autocast
* add langchain * add langchain * Add files via upload * add langchain * fix style * fix style: remove extra space * add pytest; modified retriever * add pytest; modified retriever * add tests to build_on_pr.yml * fix build_on_pr.yml * fix build on pr; fix environ vars * seperate unit tests for colossalqa from build from pr * fix container setting; fix environ vars * commented dev code * add incremental update * remove stale code * fix style * change to sha3 224 * fix retriever; fix style; add unit test for document loader * fix ci workflow config * fix ci workflow config * add set cuda visible device script in ci * fix doc string * fix style; update readme; refactored * add force log info * change build on pr, ignore colossalqa * fix docstring, captitalize all initial letters * fix indexing; fix text-splitter * remove debug code, update reference * reset previous commit * update LICENSE update README add key-value mode, fix bugs * add files back * revert force push * remove junk file * add test files * fix retriever bug, add intent classification * change conversation chain design * rewrite prompt and conversation chain * add ui v1 * ui v1 * fix atavar * add header * Refactor the RAG Code and support Pangu * Refactor the ColossalQA chain to Object-Oriented Programming and the UI demo. * resolved conversation. tested scripts under examples. web demo still buggy * fix ci tests * Some modifications to add ChatGPT api * modify llm.py and remove unnecessary files * Delete applications/ColossalQA/examples/ui/test_frontend_input.json * Remove OpenAI api key * add colossalqa * move files * move files * move files * move files * fix style * Add Readme and fix some bugs. * Add something to readme and modify some code * modify a directory name for clarity * remove redundant directory * Correct a type in llm.py * fix AI prefix * fix test_memory.py * fix conversation * fix some erros and typos * Fix a missing import in RAG_ChatBot.py * add colossalcloud LLM wrapper, correct issues in code review --------- Co-authored-by: YeAnbang <[email protected]> Co-authored-by: Orion-Zheng <[email protected]> Co-authored-by: Zian(Andy) Zheng <[email protected]> Co-authored-by: Orion-Zheng <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📌 Checklist before creating the PR
[doc/gemini/tensor/...]: A concise description🚨 Issue number
📝 What does this PR do?
💥 Checklist before requesting a review
⭐️ Do you enjoy contributing to Colossal-AI?
Tell us more if you don't enjoy contributing to Colossal-AI.