We also train the reward model based on LLaMA-7B, which reaches the ACC of 72.06% after 1 epoch, performing almost the same as Anthropic's best RM.
-
-### Arg List
-
-- `--strategy`: the strategy using for training, choices=['ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
-- `--model`: model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
-- `--pretrain`: pretrain model, type=str, default=None
-- `--model_path`: the path of rm model(if continue to train), type=str, default=None
-- `--save_path`: path to save the model, type=str, default='output'
-- `--need_optim_ckpt`: whether to save optim ckpt, type=bool, default=False
-- `--max_epochs`: max epochs for training, type=int, default=3
-- `--dataset`: dataset name, type=str, choices=['Anthropic/hh-rlhf', 'Dahoas/rm-static']
-- `--subset`: subset of the dataset, type=str, default=None
-- `--batch_size`: batch size while training, type=int, default=4
-- `--lora_rank`: low-rank adaptation matrices rank, type=int, default=0
-- `--loss_func`: which kind of loss function, choices=['log_sig', 'log_exp']
-- `--max_len`: max sentence length for generation, type=int, default=512
-
-## Stage3 - Training model using prompts with RL
-
-Stage3 uses reinforcement learning algorithm, which is the most complex part of the training process, as shown below:
-
-
-
-
-
-You can run the `examples/train_prompts.sh` to start PPO training.
-
-You can also use the cmd following to start PPO training.
-[[Stage3 tutorial video]](https://www.youtube.com/watch?v=Z8wwSHxPL9g)
-
-```bash
-torchrun --standalone --nproc_per_node=4 train_prompts.py \
- --pretrain "/path/to/LLaMa-7B/" \
- --model 'llama' \
- --strategy colossalai_zero2 \
- --prompt_dataset /path/to/your/prompt_dataset \
- --pretrain_dataset /path/to/your/pretrain_dataset \
- --rm_pretrain /your/pretrain/rm/definition \
- --rm_path /your/rm/model/path
-```
-
-Prompt dataset: the instruction dataset mentioned in the above figure which includes the instructions, e.g. you can use the [script](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples/generate_prompt_dataset.py) which samples `instinwild_en.json` or `instinwild_ch.json` in [InstructionWild](https://github.com/XueFuzhao/InstructionWild/tree/main/data#instructwild-data) to generate the prompt dataset.
-Pretrain dataset: the pretrain dataset including the instruction and corresponding response, e.g. you can use the [InstructWild Data](https://github.com/XueFuzhao/InstructionWild/tree/main/data) in stage 1 supervised instructs tuning.
-
-**Note**: the required datasets follow the following format,
-
-- `pretrain dataset`
-
- ```json
- [
- {
- "instruction": "Provide a list of the top 10 most popular mobile games in Asia",
- "input": "",
- "output": "The top 10 most popular mobile games in Asia are:\n1) PUBG Mobile\n2) Pokemon Go\n3) Candy Crush Saga\n4) Free Fire\n5) Clash of Clans\n6) Mario Kart Tour\n7) Arena of Valor\n8) Fantasy Westward Journey\n9) Subway Surfers\n10) ARK Survival Evolved",
- "id": 0
- },
- ...
- ]
- ```
-
-- `prompt dataset`
-
- ```json
- [
- {
- "instruction": "Edit this paragraph to make it more concise: \"Yesterday, I went to the store and bought some things. Then, I came home and put them away. After that, I went for a walk and met some friends.\"",
- "id": 0
- },
- {
- "instruction": "Write a descriptive paragraph about a memorable vacation you went on",
- "id": 1
- },
- ...
- ]
- ```
-
-### Arg List
-
-- `--strategy`: the strategy using for training, choices=['ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
-- `--model`: model type of actor, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
-- `--pretrain`: pretrain model, type=str, default=None
-- `--rm_model`: reward model type, type=str, choices=['gpt2', 'bloom', 'opt', 'llama'], default=None
-- `--rm_pretrain`: pretrain model for reward model, type=str, default=None
-- `--rm_path`: the path of rm model, type=str, default=None
-- `--save_path`: path to save the model, type=str, default='output'
-- `--prompt_dataset`: path of the prompt dataset, type=str, default=None
-- `--pretrain_dataset`: path of the ptx dataset, type=str, default=None
-- `--need_optim_ckpt`: whether to save optim ckpt, type=bool, default=False
-- `--num_episodes`: num of episodes for training, type=int, default=10
-- `--num_update_steps`: number of steps to update policy per episode, type=int
-- `--num_collect_steps`: number of steps to collect experience per episode, type=int
-- `--train_batch_size`: batch size while training, type=int, default=8
-- `--ptx_batch_size`: batch size to compute ptx loss, type=int, default=1
-- `--experience_batch_size`: batch size to make experience, type=int, default=8
-- `--lora_rank`: low-rank adaptation matrices rank, type=int, default=0
-- `--kl_coef`: kl_coef using for computing reward, type=float, default=0.1
-- `--ptx_coef`: ptx_coef using for computing policy loss, type=float, default=0.9
-
-## Inference example - After Stage3
-
-We support different inference options, including int8 and int4 quantization.
-For details, see [`inference/`](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/inference).
-
-## Attention
-
-The examples are demos for the whole training process.You need to change the hyper-parameters to reach great performance.
-
-#### data
-
-- [x] [rm-static](https://huggingface.co/datasets/Dahoas/rm-static)
-- [x] [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
-- [ ] [openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
-- [ ] [openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)
-- [ ] [Dahoas/instruct-synthetic-prompt-responses](https://huggingface.co/datasets/Dahoas/instruct-synthetic-prompt-responses)
-
-## Support Model
-
-### GPT
-
-- [x] GPT2-S (s)
-- [x] GPT2-M (m)
-- [x] GPT2-L (l)
-- [x] GPT2-XL (xl)
-- [x] GPT2-4B (4b)
-- [ ] GPT2-6B (6b)
-
-### BLOOM
-
-- [x] [BLOOM-560m](https://huggingface.co/bigscience/bloom-560m)
-- [x] [BLOOM-1b1](https://huggingface.co/bigscience/bloom-1b1)
-- [x] [BLOOM-3b](https://huggingface.co/bigscience/bloom-3b)
-- [x] [BLOOM-7b](https://huggingface.co/bigscience/bloom-7b1)
-- [ ] [BLOOM-175b](https://huggingface.co/bigscience/bloom)
-
-### OPT
-
-- [x] [OPT-125M](https://huggingface.co/facebook/opt-125m)
-- [x] [OPT-350M](https://huggingface.co/facebook/opt-350m)
-- [x] [OPT-1.3B](https://huggingface.co/facebook/opt-1.3b)
-- [x] [OPT-2.7B](https://huggingface.co/facebook/opt-2.7b)
-- [x] [OPT-6.7B](https://huggingface.co/facebook/opt-6.7b)
-- [ ] [OPT-13B](https://huggingface.co/facebook/opt-13b)
-- [ ] [OPT-30B](https://huggingface.co/facebook/opt-30b)
-
-### [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)
-
-- [x] LLaMA-7B
-- [x] LLaMA-13B
-- [ ] LLaMA-33B
-- [ ] LLaMA-65B
-
-## Add your own models
-
-If you want to support your own model in Coati, please refer the pull request for RoBERTa support as an example --[[chatgpt] add pre-trained model RoBERTa for RLHF stage 2 & 3](https://github.com/hpcaitech/ColossalAI/pull/3223), and submit a PR to us.
-
-You should complete the implementation of four model classes, including Reward model, Critic model, LM model, Actor model
-
-here are some example code for a NewModel named `Coati`.
-if it is supported in huggingface [transformers](https://github.com/huggingface/transformers), you can load it by `from_pretrained`, o
-r you can build your own model by yourself.
-
-### Actor model
-
-```python
-from ..base import Actor
-from transformers.models.coati import CoatiModel
-
-class CoatiActor(Actor):
- def __init__(self,
- pretrained: Optional[str] = None,
- checkpoint: bool = False,
- lora_rank: int = 0,
- lora_train_bias: str = 'none') -> None:
- if pretrained is not None:
- model = CoatiModel.from_pretrained(pretrained)
- else:
- model = build_model() # load your own model if it is not support in transformers
-
- super().__init__(model, lora_rank, lora_train_bias)
-```
-
-### Reward model
-
-```python
-from ..base import RewardModel
-from transformers.models.coati import CoatiModel
-
-class CoatiRM(RewardModel):
-
- def __init__(self,
- pretrained: Optional[str] = None,
- checkpoint: bool = False,
- lora_rank: int = 0,
- lora_train_bias: str = 'none') -> None:
- if pretrained is not None:
- model = CoatiModel.from_pretrained(pretrained)
- else:
- model = build_model() # load your own model if it is not support in transformers
-
- value_head = nn.Linear(model.config.n_embd, 1)
- value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
- super().__init__(model, value_head, lora_rank, lora_train_bias)
-```
-
-### Critic model
-
-```python
-from ..base import Critic
-from transformers.models.coati import CoatiModel
-
-class CoatiCritic(Critic):
- def __init__(self,
- pretrained: Optional[str] = None,
- checkpoint: bool = False,
- lora_rank: int = 0,
- lora_train_bias: str = 'none') -> None:
- if pretrained is not None:
- model = CoatiModel.from_pretrained(pretrained)
- else:
- model = build_model() # load your own model if it is not support in transformers
-
- value_head = nn.Linear(model.config.n_embd, 1)
- value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
- super().__init__(model, value_head, lora_rank, lora_train_bias)
-```
diff --git a/applications/Chat/examples/download_model.py b/applications/Chat/examples/download_model.py
deleted file mode 100644
index ec3482b5f789..000000000000
--- a/applications/Chat/examples/download_model.py
+++ /dev/null
@@ -1,79 +0,0 @@
-import argparse
-import dataclasses
-import os
-import parser
-from typing import List
-
-import tqdm
-from coati.models.bloom import BLOOMRM, BLOOMActor, BLOOMCritic
-from coati.models.gpt import GPTRM, GPTActor, GPTCritic
-from coati.models.opt import OPTRM, OPTActor, OPTCritic
-from huggingface_hub import hf_hub_download, snapshot_download
-from transformers import AutoConfig, AutoTokenizer, BloomConfig, BloomTokenizerFast, GPT2Config, GPT2Tokenizer
-
-
-@dataclasses.dataclass
-class HFRepoFiles:
- repo_id: str
- files: List[str]
-
- def download(self, dir_path: str):
- for file in self.files:
- file_path = hf_hub_download(self.repo_id, file, local_dir=dir_path)
-
- def download_all(self):
- snapshot_download(self.repo_id)
-
-
-def test_init(model: str, dir_path: str):
- if model == "gpt2":
- config = GPT2Config.from_pretrained(dir_path)
- actor = GPTActor(config=config)
- critic = GPTCritic(config=config)
- reward_model = GPTRM(config=config)
- GPT2Tokenizer.from_pretrained(dir_path)
- elif model == "bloom":
- config = BloomConfig.from_pretrained(dir_path)
- actor = BLOOMActor(config=config)
- critic = BLOOMCritic(config=config)
- reward_model = BLOOMRM(config=config)
- BloomTokenizerFast.from_pretrained(dir_path)
- elif model == "opt":
- config = AutoConfig.from_pretrained(dir_path)
- actor = OPTActor(config=config)
- critic = OPTCritic(config=config)
- reward_model = OPTRM(config=config)
- AutoTokenizer.from_pretrained(dir_path)
- else:
- raise NotImplementedError(f"Model {model} not implemented")
-
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument("--model-dir", type=str, default="test_models")
- parser.add_argument("--config-only", default=False, action="store_true")
- args = parser.parse_args()
-
- if os.path.exists(args.model_dir):
- print(f"[INFO]: {args.model_dir} already exists")
- exit(0)
-
- repo_list = {
- "gpt2": HFRepoFiles(repo_id="gpt2", files=["config.json", "tokenizer.json", "vocab.json", "merges.txt"]),
- "bloom": HFRepoFiles(
- repo_id="bigscience/bloom-560m", files=["config.json", "tokenizer.json", "tokenizer_config.json"]
- ),
- "opt": HFRepoFiles(
- repo_id="facebook/opt-350m", files=["config.json", "tokenizer_config.json", "vocab.json", "merges.txt"]
- ),
- }
-
- os.mkdir(args.model_dir)
- for model_name in tqdm.tqdm(repo_list):
- dir_path = os.path.join(args.model_dir, model_name)
- if args.config_only:
- os.mkdir(dir_path)
- repo_list[model_name].download(dir_path)
- else:
- repo_list[model_name].download_all()
- test_init(model_name, dir_path)
diff --git a/applications/Chat/examples/generate_conversation_dataset.py b/applications/Chat/examples/generate_conversation_dataset.py
deleted file mode 100644
index 7e03b2d54260..000000000000
--- a/applications/Chat/examples/generate_conversation_dataset.py
+++ /dev/null
@@ -1,82 +0,0 @@
-import argparse
-import json
-
-from datasets import load_dataset
-
-
-def generate_alpaca():
- # We can convert dataset with the same format("instruction", "input", "output") as Alpaca into a one-round conversation.
- conversation_dataset = []
- dataset = load_dataset("tatsu-lab/alpaca", split="train")
-
- instructions = dataset["instruction"]
- inputs = dataset["input"]
- outputs = dataset["output"]
-
- assert len(instructions) == len(inputs) == len(outputs)
-
- for idx in range(len(instructions)):
- human_utterance = instructions[idx] + "\n\n" + inputs[idx] if inputs[idx] else instructions[idx]
- human = {"from": "human", "value": human_utterance}
-
- gpt_utterance = outputs[idx]
- gpt = {"from": "gpt", "value": gpt_utterance}
-
- conversation = dict(type="instruction", language="English", dataset="Alpaca", conversations=[human, gpt])
- conversation_dataset.append(conversation)
-
- return conversation_dataset
-
-
-def generate_sharegpt():
- # ShareGPT data requires less processing.
- conversation_dataset = []
- dataset = load_dataset(
- "anon8231489123/ShareGPT_Vicuna_unfiltered",
- data_files="ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json",
- split="train",
- )
-
- conversations = dataset["conversations"]
-
- for idx in range(len(conversations)):
- for conv in conversations[idx]:
- # We don't need markdown and text value.
- del conv["markdown"]
- del conv["text"]
-
- conversation = dict(
- type="conversation", language="Multilingual", dataset="ShareGPT", conversations=conversations[idx]
- )
- conversation_dataset.append(conversation)
-
- return conversation_dataset
-
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument(
- "--dataset",
- type=str,
- default="All",
- choices=["Alpaca", "ShareGPT", "All"],
- help="which dataset to convert, All will combine Alpaca and ShareGPT",
- )
- parser.add_argument("--save_path", type=str, default="dataset.json", help="path to save the converted dataset")
- args = parser.parse_args()
-
- conversation_dataset = []
-
- if args.dataset == "Alpaca":
- conversation_dataset.extend(generate_alpaca())
- elif args.dataset == "ShareGPT":
- conversation_dataset.extend(generate_sharegpt())
- else:
- conversation_dataset.extend(generate_alpaca())
- conversation_dataset.extend(generate_sharegpt())
-
- for idx, sample in enumerate(conversation_dataset):
- sample["id"] = idx + 1
-
- with open(args.save_path, mode="w") as f:
- json.dump(conversation_dataset, f, indent=4, default=str, ensure_ascii=False)
diff --git a/applications/Chat/examples/generate_prompt_dataset.py b/applications/Chat/examples/generate_prompt_dataset.py
deleted file mode 100644
index 4eec6feae505..000000000000
--- a/applications/Chat/examples/generate_prompt_dataset.py
+++ /dev/null
@@ -1,27 +0,0 @@
-import argparse
-import json
-import random
-
-random.seed(42)
-
-
-def sample(args):
- with open(args.dataset_path, mode="r") as f:
- dataset_list = json.load(f)
-
- sampled_dataset = [
- {"instruction": sample["instruction"], "id": idx}
- for idx, sample in enumerate(random.sample(dataset_list, args.sample_size))
- ]
-
- with open(args.save_path, mode="w") as f:
- json.dump(sampled_dataset, f, indent=4, default=str, ensure_ascii=False)
-
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument("--dataset_path", type=str, default=None, required=True, help="path to the pretrain dataset")
- parser.add_argument("--save_path", type=str, default="prompt.json", help="path to save the prompt dataset")
- parser.add_argument("--sample_size", type=int, default=16384, help="size of the prompt dataset")
- args = parser.parse_args()
- sample(args)
diff --git a/applications/Chat/examples/inference.py b/applications/Chat/examples/inference.py
deleted file mode 100644
index 62e06bf7b3bb..000000000000
--- a/applications/Chat/examples/inference.py
+++ /dev/null
@@ -1,73 +0,0 @@
-import argparse
-
-import torch
-from coati.models.bloom import BLOOMActor
-from coati.models.generation import generate
-from coati.models.gpt import GPTActor
-from coati.models.llama import LlamaActor
-from coati.models.opt import OPTActor
-from transformers import AutoTokenizer, BloomTokenizerFast, GPT2Tokenizer, LlamaTokenizer
-
-
-def eval(args):
- # configure model
- if args.model == "gpt2":
- actor = GPTActor(pretrained=args.pretrain)
- elif args.model == "bloom":
- actor = BLOOMActor(pretrained=args.pretrain)
- elif args.model == "opt":
- actor = OPTActor(pretrained=args.pretrain)
- elif args.model == "llama":
- actor = LlamaActor(pretrained=args.pretrain)
- else:
- raise ValueError(f'Unsupported model "{args.model}"')
-
- actor.to(torch.cuda.current_device())
- if args.model_path is not None:
- state_dict = torch.load(args.model_path)
- actor.load_state_dict(state_dict)
-
- # configure tokenizer
- if args.model == "gpt2":
- tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
- tokenizer.pad_token = tokenizer.eos_token
- elif args.model == "bloom":
- tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-560m")
- tokenizer.pad_token = tokenizer.eos_token
- elif args.model == "opt":
- tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
- tokenizer.pad_token = tokenizer.eos_token
- elif args.model == "llama":
- tokenizer = LlamaTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer")
- tokenizer.eos_token = "<\s>"
- tokenizer.pad_token = tokenizer.unk_token
- else:
- raise ValueError(f'Unsupported model "{args.model}"')
-
- actor.eval()
- tokenizer.padding_side = "left"
- input_ids = tokenizer.encode(args.input, return_tensors="pt").to(torch.cuda.current_device())
- outputs = generate(
- actor,
- input_ids,
- tokenizer=tokenizer,
- max_length=args.max_length,
- do_sample=True,
- top_k=50,
- top_p=0.95,
- num_return_sequences=1,
- )
- output = tokenizer.batch_decode(outputs[0], skip_special_tokens=True)
- print(f"[Output]: {''.join(output)}")
-
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument("--model", default="gpt2", choices=["gpt2", "bloom", "opt", "llama"])
- # We suggest to use the pretrained model from HuggingFace, use pretrain to configure model
- parser.add_argument("--pretrain", type=str, default=None)
- parser.add_argument("--model_path", type=str, default=None)
- parser.add_argument("--input", type=str, default="Question: How are you ? Answer:")
- parser.add_argument("--max_length", type=int, default=100)
- args = parser.parse_args()
- eval(args)
diff --git a/applications/Chat/examples/train_prompts.py b/applications/Chat/examples/train_prompts.py
deleted file mode 100644
index 8868e278d85e..000000000000
--- a/applications/Chat/examples/train_prompts.py
+++ /dev/null
@@ -1,249 +0,0 @@
-import argparse
-import warnings
-
-import torch
-import torch.distributed as dist
-from coati.dataset import PromptDataset, SupervisedDataset
-from coati.models.bloom import BLOOMRM, BLOOMActor, BLOOMCritic
-from coati.models.gpt import GPTRM, GPTActor, GPTCritic
-from coati.models.llama import LlamaActor, LlamaCritic, LlamaRM
-from coati.models.opt import OPTRM, OPTActor, OPTCritic
-from coati.trainer import PPOTrainer
-from coati.trainer.strategies import DDPStrategy, GeminiStrategy, LowLevelZeroStrategy
-from torch.optim import Adam
-from torch.utils.data import DataLoader
-from torch.utils.data.distributed import DistributedSampler
-from transformers import AutoTokenizer, BloomTokenizerFast, GPT2Tokenizer, LlamaTokenizer
-
-from colossalai.nn.optimizer import HybridAdam
-
-
-def main(args):
- # configure strategy
- if args.strategy == "ddp":
- strategy = DDPStrategy()
- elif args.strategy == "colossalai_gemini":
- strategy = GeminiStrategy(placement_policy="static", initial_scale=2**5)
- elif args.strategy == "colossalai_zero2":
- strategy = LowLevelZeroStrategy(stage=2, placement_policy="cuda")
- else:
- raise ValueError(f'Unsupported strategy "{args.strategy}"')
-
- if args.rm_path is not None:
- warnings.warn("LoRA weights should be merged with the model weights")
- state_dict = torch.load(args.rm_path, map_location="cpu")
-
- if args.lora_rank > 0:
- warnings.warn("Lora is not supported yet.")
- args.lora_rank = 0
-
- with strategy.model_init_context():
- # configure model
- if args.model == "gpt2":
- initial_model = GPTActor(pretrained=args.pretrain)
- elif args.model == "bloom":
- initial_model = BLOOMActor(pretrained=args.pretrain)
- elif args.model == "opt":
- initial_model = OPTActor(pretrained=args.pretrain)
- elif args.model == "llama":
- initial_model = LlamaActor(pretrained=args.pretrain)
- else:
- raise ValueError(f'Unsupported actor model "{args.model}"')
-
- if args.rm_model is None:
- rm_model_name = args.model
- else:
- rm_model_name = args.rm_model
-
- if rm_model_name == "gpt2":
- reward_model = GPTRM(pretrained=args.rm_pretrain, lora_rank=args.lora_rank)
- elif rm_model_name == "bloom":
- reward_model = BLOOMRM(pretrained=args.rm_pretrain, lora_rank=args.lora_rank)
- elif rm_model_name == "opt":
- reward_model = OPTRM(pretrained=args.rm_pretrain, lora_rank=args.lora_rank)
- elif rm_model_name == "llama":
- reward_model = LlamaRM(pretrained=args.rm_pretrain, lora_rank=args.lora_rank)
- else:
- raise ValueError(f'Unsupported reward model "{rm_model_name}"')
-
- if args.rm_path is not None:
- reward_model.load_state_dict(state_dict, strict=False)
-
- initial_model.to(torch.bfloat16).to(torch.cuda.current_device())
- reward_model.to(torch.bfloat16).to(torch.cuda.current_device())
-
- if args.model == "gpt2":
- actor = GPTActor(pretrained=args.pretrain, lora_rank=args.lora_rank)
- elif args.model == "bloom":
- actor = BLOOMActor(pretrained=args.pretrain, lora_rank=args.lora_rank)
- elif args.model == "opt":
- actor = OPTActor(pretrained=args.pretrain, lora_rank=args.lora_rank)
- elif args.model == "llama":
- actor = LlamaActor(pretrained=args.pretrain, lora_rank=args.lora_rank)
- else:
- raise ValueError(f'Unsupported actor model "{args.model}"')
-
- if rm_model_name == "gpt2":
- critic = GPTCritic(pretrained=args.rm_pretrain, lora_rank=args.lora_rank)
- elif rm_model_name == "bloom":
- critic = BLOOMCritic(pretrained=args.rm_pretrain, lora_rank=args.lora_rank)
- elif rm_model_name == "opt":
- critic = OPTCritic(pretrained=args.rm_pretrain, lora_rank=args.lora_rank)
- elif rm_model_name == "llama":
- critic = LlamaCritic(pretrained=args.rm_pretrain, lora_rank=args.lora_rank)
- else:
- raise ValueError(f'Unsupported reward model "{rm_model_name}"')
-
- if args.rm_path is not None:
- critic.load_state_dict(state_dict, strict=False)
- del state_dict
-
- actor.to(torch.bfloat16).to(torch.cuda.current_device())
- critic.to(torch.bfloat16).to(torch.cuda.current_device())
-
- # configure optimizer
- if args.strategy.startswith("colossalai"):
- actor_optim = HybridAdam(actor.parameters(), lr=args.lr)
- critic_optim = HybridAdam(critic.parameters(), lr=args.lr)
- else:
- actor_optim = Adam(actor.parameters(), lr=args.lr)
- critic_optim = Adam(critic.parameters(), lr=args.lr)
-
- # configure tokenizer
- if args.model == "gpt2":
- tokenizer = GPT2Tokenizer.from_pretrained("gpt2" if args.tokenizer is None else args.tokenizer)
- tokenizer.pad_token = tokenizer.eos_token
- elif args.model == "bloom":
- tokenizer = BloomTokenizerFast.from_pretrained(
- "bigscience/bloom-560m" if args.tokenizer is None else args.tokenizer
- )
- tokenizer.pad_token = tokenizer.eos_token
- elif args.model == "opt":
- tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m" if args.tokenizer is None else args.tokenizer)
- tokenizer.pad_token = tokenizer.eos_token
- elif args.model == "llama":
- tokenizer = LlamaTokenizer.from_pretrained(
- "hf-internal-testing/llama-tokenizer" if args.tokenizer is None else args.tokenizer
- )
- tokenizer.eos_token = "<\s>"
- tokenizer.pad_token = tokenizer.unk_token
- else:
- raise ValueError(f'Unsupported model "{args.model}"')
- # NOTE: generate() requires padding_side to be "left"
- tokenizer.padding_side = "left"
-
- prompt_dataset = PromptDataset(
- tokenizer=tokenizer,
- data_path=args.prompt_dataset,
- max_datasets_size=args.max_datasets_size,
- max_length=args.max_input_len,
- )
- if dist.is_initialized() and dist.get_world_size() > 1:
- prompt_sampler = DistributedSampler(prompt_dataset, shuffle=True, seed=42, drop_last=True)
- else:
- prompt_sampler = None
- prompt_dataloader = DataLoader(
- prompt_dataset, shuffle=(prompt_sampler is None), sampler=prompt_sampler, batch_size=args.experience_batch_size
- )
-
- pretrain_dataset = SupervisedDataset(
- tokenizer=tokenizer,
- data_path=args.pretrain_dataset,
- max_datasets_size=args.max_datasets_size,
- max_length=args.max_input_len,
- )
- if dist.is_initialized() and dist.get_world_size() > 1:
- pretrain_sampler = DistributedSampler(pretrain_dataset, shuffle=True, seed=42, drop_last=True)
- else:
- pretrain_sampler = None
- pretrain_dataloader = DataLoader(
- pretrain_dataset, shuffle=(pretrain_sampler is None), sampler=pretrain_sampler, batch_size=args.ptx_batch_size
- )
-
- # NOTE: For small models like opt-1.3b, reward model and initial model are not required to be parallelized.
- (actor, actor_optim), (critic, critic_optim), reward_model, initial_model = strategy.prepare(
- (actor, actor_optim), (critic, critic_optim), reward_model, initial_model
- )
-
- # configure trainer
- trainer = PPOTrainer(
- strategy,
- actor,
- critic,
- reward_model,
- initial_model,
- actor_optim,
- critic_optim,
- tokenizer=tokenizer,
- kl_coef=args.kl_coef,
- ptx_coef=args.ptx_coef,
- train_batch_size=args.train_batch_size,
- max_length=args.max_seq_len,
- use_cache=True,
- do_sample=True,
- temperature=1.0,
- top_k=50,
- offload_inference_models=args.strategy != "colossalai_gemini",
- )
-
- trainer.fit(
- num_episodes=args.num_episodes,
- num_collect_steps=args.num_collect_steps,
- num_update_steps=args.num_update_steps,
- prompt_dataloader=prompt_dataloader,
- pretrain_dataloader=pretrain_dataloader,
- log_dir=args.log_dir,
- use_wandb=args.use_wandb,
- )
-
- if args.lora_rank > 0 and args.merge_lora_weights:
- from coati.models.lora import LORA_MANAGER
-
- # NOTE: set model to eval to merge LoRA weights
- LORA_MANAGER.merge_weights = True
- actor.eval()
- # save model checkpoint after fitting
- strategy.save_pretrained(actor, path=args.save_path)
- # save optimizer checkpoint on all ranks
- if args.need_optim_ckpt:
- strategy.save_optimizer(
- actor_optim, "actor_optim_checkpoint_prompts_%d.pt" % (torch.cuda.current_device()), only_rank0=False
- )
-
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument("--prompt_dataset", type=str, default=None, help="path to the prompt dataset")
- parser.add_argument("--pretrain_dataset", type=str, default=None, help="path to the pretrained dataset")
- parser.add_argument("--max_datasets_size", type=int, default=50000)
- parser.add_argument(
- "--strategy",
- choices=["ddp", "colossalai_gemini", "colossalai_zero2"],
- default="colossalai_zero2",
- help="strategy to use",
- )
- parser.add_argument("--model", default="gpt2", choices=["gpt2", "bloom", "opt", "llama"])
- parser.add_argument("--tokenizer", type=str, default=None)
- parser.add_argument("--pretrain", type=str, default=None)
- parser.add_argument("--rm_model", default=None, choices=["gpt2", "bloom", "opt", "llama"])
- parser.add_argument("--rm_path", type=str, default=None)
- parser.add_argument("--rm_pretrain", type=str, default=None)
- parser.add_argument("--save_path", type=str, default="actor_checkpoint_prompts")
- parser.add_argument("--need_optim_ckpt", type=bool, default=False)
- parser.add_argument("--num_episodes", type=int, default=10)
- parser.add_argument("--num_collect_steps", type=int, default=10)
- parser.add_argument("--num_update_steps", type=int, default=5)
- parser.add_argument("--train_batch_size", type=int, default=8)
- parser.add_argument("--ptx_batch_size", type=int, default=1)
- parser.add_argument("--experience_batch_size", type=int, default=8)
- parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank")
- parser.add_argument("--merge_lora_weights", type=bool, default=True)
- parser.add_argument("--lr", type=float, default=1e-7)
- parser.add_argument("--kl_coef", type=float, default=0.1)
- parser.add_argument("--ptx_coef", type=float, default=0.9)
- parser.add_argument("--max_input_len", type=int, default=96)
- parser.add_argument("--max_seq_len", type=int, default=128)
- parser.add_argument("--log_dir", default="logs", type=str)
- parser.add_argument("--use_wandb", default=False, action="store_true")
- args = parser.parse_args()
- main(args)
diff --git a/applications/Chat/examples/train_prompts.sh b/applications/Chat/examples/train_prompts.sh
deleted file mode 100755
index d04c416015b1..000000000000
--- a/applications/Chat/examples/train_prompts.sh
+++ /dev/null
@@ -1,25 +0,0 @@
-set_n_least_used_CUDA_VISIBLE_DEVICES() {
- local n=${1:-"9999"}
- echo "GPU Memory Usage:"
- local FIRST_N_GPU_IDS=$(nvidia-smi --query-gpu=memory.used --format=csv |
- tail -n +2 |
- nl -v 0 |
- tee /dev/tty |
- sort -g -k 2 |
- awk '{print $1}' |
- head -n $n)
- export CUDA_VISIBLE_DEVICES=$(echo $FIRST_N_GPU_IDS | sed 's/ /,/g')
- echo "Now CUDA_VISIBLE_DEVICES is set to:"
- echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
-}
-
-set_n_least_used_CUDA_VISIBLE_DEVICES 2
-
-# torchrun --standalone --nproc_per_node=2 train_prompts.py prompts.csv --strategy colossalai_zero2
-
-torchrun --standalone --nproc_per_node=2 train_prompts.py \
- --pretrain_dataset /path/to/data.json \
- --prompt_dataset /path/to/data.json \
- --strategy colossalai_zero2 \
- --num_episodes 1 --num_collect_steps 2 --num_update_steps 1 \
- --train_batch_size 2
diff --git a/applications/Chat/examples/train_reward_model.py b/applications/Chat/examples/train_reward_model.py
deleted file mode 100644
index df6e8b6bdc26..000000000000
--- a/applications/Chat/examples/train_reward_model.py
+++ /dev/null
@@ -1,208 +0,0 @@
-import argparse
-import warnings
-
-import torch
-import torch.distributed as dist
-from coati.dataset import HhRlhfDataset, RmStaticDataset
-from coati.models import LogExpLoss, LogSigLoss
-from coati.models.bloom import BLOOMRM
-from coati.models.gpt import GPTRM
-from coati.models.llama import LlamaRM
-from coati.models.opt import OPTRM
-from coati.trainer import RewardModelTrainer
-from coati.trainer.strategies import DDPStrategy, GeminiStrategy, LowLevelZeroStrategy
-from datasets import load_dataset
-from torch.optim import Adam
-from torch.optim.lr_scheduler import CosineAnnealingLR
-from torch.utils.data import DataLoader
-from torch.utils.data.distributed import DistributedSampler
-from transformers import AutoTokenizer, BloomTokenizerFast, LlamaTokenizer
-from transformers.models.gpt2.tokenization_gpt2 import GPT2Tokenizer
-
-from colossalai.nn.optimizer import HybridAdam
-
-
-def train(args):
- # configure strategy
- if args.strategy == "ddp":
- strategy = DDPStrategy()
- elif args.strategy == "colossalai_gemini":
- strategy = GeminiStrategy(placement_policy="auto")
- elif args.strategy == "colossalai_zero2":
- strategy = LowLevelZeroStrategy(stage=2, placement_policy="cuda")
- else:
- raise ValueError(f'Unsupported strategy "{args.strategy}"')
-
- # configure model
- if args.lora_rank > 0:
- warnings.warn("Lora is not supported yet.")
- args.lora_rank = 0
-
- with strategy.model_init_context():
- if args.model == "bloom":
- model = BLOOMRM(pretrained=args.pretrain, lora_rank=args.lora_rank)
- elif args.model == "opt":
- model = OPTRM(pretrained=args.pretrain, lora_rank=args.lora_rank)
- elif args.model == "gpt2":
- model = GPTRM(pretrained=args.pretrain, lora_rank=args.lora_rank)
- elif args.model == "llama":
- model = LlamaRM(pretrained=args.pretrain, lora_rank=args.lora_rank)
- else:
- raise ValueError(f'Unsupported model "{args.model}"')
-
- model.to(torch.bfloat16).to(torch.cuda.current_device())
-
- if args.model_path is not None:
- state_dict = torch.load(args.model_path)
- model.load_state_dict(state_dict)
-
- # configure tokenizer
- if args.model == "gpt2":
- tokenizer = GPT2Tokenizer.from_pretrained("gpt2" if args.tokenizer is None else args.tokenizer)
- tokenizer.pad_token = tokenizer.eos_token
- elif args.model == "bloom":
- tokenizer = BloomTokenizerFast.from_pretrained(
- "bigscience/bloom-560m" if args.tokenizer is None else args.tokenizer
- )
- tokenizer.pad_token = tokenizer.eos_token
- elif args.model == "opt":
- tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m" if args.tokenizer is None else args.tokenizer)
- tokenizer.pad_token = tokenizer.eos_token
- elif args.model == "llama":
- tokenizer = LlamaTokenizer.from_pretrained(
- "hf-internal-testing/llama-tokenizer" if args.tokenizer is None else args.tokenizer
- )
- tokenizer.eos_token = "<\s>"
- tokenizer.pad_token = tokenizer.unk_token
- else:
- raise ValueError(f'Unsupported model "{args.model}"')
-
- # configure optimizer
- if args.strategy.startswith("colossalai"):
- optim = HybridAdam(model.parameters(), lr=args.lr)
- else:
- optim = Adam(model.parameters(), lr=args.lr)
-
- # configure loss function
- if args.loss_fn == "log_sig":
- loss_fn = LogSigLoss()
- elif args.loss_fn == "log_exp":
- loss_fn = LogExpLoss()
- else:
- raise ValueError(f'Unsupported loss function "{args.loss_fn}"')
-
- # prepare for data and dataset
- if args.subset is not None:
- data = load_dataset(args.dataset, data_dir=args.subset)
- else:
- data = load_dataset(args.dataset)
-
- train_data = data["train"].select(range(min(args.max_datasets_size, len(data["train"]))))
- eval_data = data["test"].select(range(min(args.max_datasets_size, len(data["test"]))))
-
- if args.dataset == "Dahoas/rm-static":
- train_dataset = RmStaticDataset(train_data, tokenizer, args.max_len)
- eval_dataset = RmStaticDataset(eval_data, tokenizer, args.max_len)
- elif args.dataset == "Anthropic/hh-rlhf":
- train_dataset = HhRlhfDataset(train_data, tokenizer, args.max_len)
- eval_dataset = HhRlhfDataset(eval_data, tokenizer, args.max_len)
- else:
- raise ValueError(f'Unsupported dataset "{args.dataset}"')
-
- if dist.is_initialized() and dist.get_world_size() > 1:
- train_sampler = DistributedSampler(
- train_dataset,
- shuffle=True,
- seed=42,
- drop_last=True,
- rank=dist.get_rank(),
- num_replicas=dist.get_world_size(),
- )
- eval_sampler = DistributedSampler(
- eval_dataset,
- shuffle=True,
- seed=42,
- drop_last=True,
- rank=dist.get_rank(),
- num_replicas=dist.get_world_size(),
- )
- else:
- train_sampler = None
- eval_sampler = None
-
- train_dataloader = DataLoader(
- train_dataset,
- shuffle=(train_sampler is None),
- sampler=train_sampler,
- batch_size=args.batch_size,
- pin_memory=True,
- )
-
- eval_dataloader = DataLoader(
- eval_dataset, shuffle=(eval_sampler is None), sampler=eval_sampler, batch_size=args.batch_size, pin_memory=True
- )
-
- lr_scheduler = CosineAnnealingLR(optim, train_dataloader.__len__() // 100)
- strategy_dict = strategy.prepare(dict(model=model, optimizer=optim, lr_scheduler=lr_scheduler))
- model = strategy_dict["model"]
- optim = strategy_dict["optimizer"]
- lr_scheduler = strategy_dict["lr_scheduler"]
- trainer = RewardModelTrainer(
- model=model,
- strategy=strategy,
- optim=optim,
- lr_scheduler=lr_scheduler,
- loss_fn=loss_fn,
- max_epochs=args.max_epochs,
- )
-
- trainer.fit(
- train_dataloader=train_dataloader,
- eval_dataloader=eval_dataloader,
- log_dir=args.log_dir,
- use_wandb=args.use_wandb,
- )
-
- if args.lora_rank > 0 and args.merge_lora_weights:
- from coati.models.lora import LORA_MANAGER
-
- # NOTE: set model to eval to merge LoRA weights
- LORA_MANAGER.merge_weights = True
- model.eval()
- # save model checkpoint after fitting on only rank0
- state_dict = model.state_dict()
- torch.save(state_dict, args.save_path)
- # save optimizer checkpoint on all ranks
- if args.need_optim_ckpt:
- strategy.save_optimizer(
- trainer.optimizer, "rm_optim_checkpoint_%d.pt" % (torch.cuda.current_device()), only_rank0=False
- )
-
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument(
- "--strategy", choices=["ddp", "colossalai_gemini", "colossalai_zero2"], default="colossalai_zero2"
- )
- parser.add_argument("--model", choices=["gpt2", "bloom", "opt", "llama"], default="bloom")
- parser.add_argument("--tokenizer", type=str, default=None)
- parser.add_argument("--pretrain", type=str, default=None)
- parser.add_argument("--model_path", type=str, default=None)
- parser.add_argument("--need_optim_ckpt", type=bool, default=False)
- parser.add_argument(
- "--dataset", type=str, choices=["Anthropic/hh-rlhf", "Dahoas/rm-static"], default="Dahoas/rm-static"
- )
- parser.add_argument("--subset", type=lambda x: None if x == "None" else x, default=None)
- parser.add_argument("--max_datasets_size", type=int, default=1000000)
- parser.add_argument("--save_path", type=str, default="rm_ckpt")
- parser.add_argument("--max_epochs", type=int, default=1)
- parser.add_argument("--batch_size", type=int, default=1)
- parser.add_argument("--max_len", type=int, default=512)
- parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank")
- parser.add_argument("--merge_lora_weights", type=bool, default=True)
- parser.add_argument("--lr", type=float, default=9e-6)
- parser.add_argument("--loss_fn", type=str, default="log_sig", choices=["log_sig", "log_exp"])
- parser.add_argument("--log_dir", default="logs", type=str)
- parser.add_argument("--use_wandb", default=False, action="store_true")
- args = parser.parse_args()
- train(args)
diff --git a/applications/Chat/examples/train_rm.sh b/applications/Chat/examples/train_rm.sh
deleted file mode 100755
index c5ebaf708ddc..000000000000
--- a/applications/Chat/examples/train_rm.sh
+++ /dev/null
@@ -1,25 +0,0 @@
-set_n_least_used_CUDA_VISIBLE_DEVICES() {
- local n=${1:-"9999"}
- echo "GPU Memory Usage:"
- local FIRST_N_GPU_IDS=$(nvidia-smi --query-gpu=memory.used --format=csv |
- tail -n +2 |
- nl -v 0 |
- tee /dev/tty |
- sort -g -k 2 |
- awk '{print $1}' |
- head -n $n)
- export CUDA_VISIBLE_DEVICES=$(echo $FIRST_N_GPU_IDS | sed 's/ /,/g')
- echo "Now CUDA_VISIBLE_DEVICES is set to:"
- echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
-}
-
-set_n_least_used_CUDA_VISIBLE_DEVICES 2
-
-torchrun --standalone --nproc_per_node=2 train_reward_model.py \
- --pretrain 'gpt2' \
- --model 'gpt2' \
- --strategy colossalai_zero2 \
- --loss_fn 'log_exp' \
- --dataset 'Anthropic/hh-rlhf' \
- --batch_size 16 \
- --max_epochs 10
diff --git a/applications/Chat/examples/train_sft.py b/applications/Chat/examples/train_sft.py
deleted file mode 100644
index 66d08da30120..000000000000
--- a/applications/Chat/examples/train_sft.py
+++ /dev/null
@@ -1,221 +0,0 @@
-import argparse
-import math
-import warnings
-
-import torch
-import torch.distributed as dist
-from coati.dataset import SFTDataset, SupervisedDataset
-from coati.models.bloom import BLOOMActor
-from coati.models.chatglm import ChatGLMActor
-from coati.models.chatglm.chatglm_tokenizer import ChatGLMTokenizer
-from coati.models.gpt import GPTActor
-from coati.models.llama import LlamaActor
-from coati.models.opt import OPTActor
-from coati.trainer import SFTTrainer
-from coati.trainer.strategies import DDPStrategy, GeminiStrategy, LowLevelZeroStrategy
-from datasets import load_dataset
-from torch.optim import Adam
-from torch.utils.data import DataLoader
-from torch.utils.data.distributed import DistributedSampler
-from transformers import AutoTokenizer, BloomTokenizerFast, LlamaTokenizer
-from transformers.models.gpt2.tokenization_gpt2 import GPT2Tokenizer
-from transformers.trainer import get_scheduler
-
-from colossalai.logging import get_dist_logger
-from colossalai.nn.optimizer import HybridAdam
-
-
-def train(args):
- # configure strategy
- if args.strategy == "ddp":
- strategy = DDPStrategy()
- elif args.strategy == "colossalai_gemini":
- strategy = GeminiStrategy(placement_policy="auto")
- elif args.strategy == "colossalai_zero2":
- strategy = LowLevelZeroStrategy(stage=2, placement_policy="cuda")
- elif args.strategy == "colossalai_zero2_cpu":
- strategy = LowLevelZeroStrategy(stage=2, placement_policy="cpu")
- else:
- raise ValueError(f'Unsupported strategy "{args.strategy}"')
-
- # configure model
- if args.lora_rank > 0:
- warnings.warn("Lora is not supported yet.")
- args.lora_rank = 0
-
- with strategy.model_init_context():
- if args.model == "bloom":
- model = BLOOMActor(pretrained=args.pretrain, lora_rank=args.lora_rank, checkpoint=args.grad_checkpoint)
- elif args.model == "opt":
- model = OPTActor(pretrained=args.pretrain, lora_rank=args.lora_rank, checkpoint=args.grad_checkpoint)
- elif args.model == "gpt2":
- model = GPTActor(pretrained=args.pretrain, lora_rank=args.lora_rank, checkpoint=args.grad_checkpoint)
- elif args.model == "llama":
- model = LlamaActor(pretrained=args.pretrain, lora_rank=args.lora_rank, checkpoint=args.grad_checkpoint)
- elif args.model == "chatglm":
- model = ChatGLMActor(pretrained=args.pretrain)
- else:
- raise ValueError(f'Unsupported model "{args.model}"')
-
- model.to(torch.bfloat16).to(torch.cuda.current_device())
-
- # configure tokenizer
- if args.model == "gpt2":
- tokenizer = GPT2Tokenizer.from_pretrained("gpt2" if args.tokenizer is None else args.tokenizer)
- tokenizer.pad_token = tokenizer.eos_token
- elif args.model == "bloom":
- tokenizer = BloomTokenizerFast.from_pretrained(
- "bigscience/bloom-560m" if args.tokenizer is None else args.tokenizer
- )
- tokenizer.pad_token = tokenizer.eos_token
- elif args.model == "opt":
- tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m" if args.tokenizer is None else args.tokenizer)
- tokenizer.pad_token = tokenizer.eos_token
- elif args.model == "llama":
- tokenizer = LlamaTokenizer.from_pretrained(
- "hf-internal-testing/llama-tokenizer" if args.tokenizer is None else args.tokenizer
- )
- tokenizer.eos_token = "<\s>"
- tokenizer.pad_token = tokenizer.unk_token
- elif args.model == "chatglm":
- tokenizer = ChatGLMTokenizer.from_pretrained(
- "THUDM/chatglm-6b" if args.tokenizer is None else args.tokenizer, trust_remote_code=True
- )
- else:
- raise ValueError(f'Unsupported model "{args.model}"')
-
- # configure optimizer
- if args.strategy.startswith("colossalai"):
- optim = HybridAdam(model.parameters(), lr=args.lr, clipping_norm=1.0)
- else:
- optim = Adam(model.parameters(), lr=args.lr)
-
- # configure dataset
- if args.dataset == "yizhongw/self_instruct":
- train_data = load_dataset(args.dataset, "super_natural_instructions", split="train")
- eval_data = load_dataset(args.dataset, "super_natural_instructions", split="test")
-
- if args.max_datasets_size is not None:
- train_data = train_data.select(range(min(args.max_datasets_size, len(train_data))))
- eval_data = eval_data.select(range(min(args.max_datasets_size, len(eval_data))))
-
- train_dataset = SFTDataset(train_data, tokenizer, args.max_len)
- eval_dataset = SFTDataset(eval_data, tokenizer, args.max_len)
-
- else:
- train_dataset = SupervisedDataset(
- tokenizer=tokenizer,
- data_path=args.dataset,
- max_datasets_size=args.max_datasets_size,
- max_length=args.max_len,
- )
- eval_dataset = None
-
- if dist.is_initialized() and dist.get_world_size() > 1:
- train_sampler = DistributedSampler(
- train_dataset,
- shuffle=True,
- seed=42,
- drop_last=True,
- rank=dist.get_rank(),
- num_replicas=dist.get_world_size(),
- )
- if eval_dataset is not None:
- eval_sampler = DistributedSampler(
- eval_dataset,
- shuffle=False,
- seed=42,
- drop_last=False,
- rank=dist.get_rank(),
- num_replicas=dist.get_world_size(),
- )
- else:
- train_sampler = None
- eval_sampler = None
-
- train_dataloader = DataLoader(
- train_dataset,
- shuffle=(train_sampler is None),
- sampler=train_sampler,
- batch_size=args.batch_size,
- pin_memory=True,
- )
- if eval_dataset is not None:
- eval_dataloader = DataLoader(
- eval_dataset,
- shuffle=(eval_sampler is None),
- sampler=eval_sampler,
- batch_size=args.batch_size,
- pin_memory=True,
- )
- else:
- eval_dataloader = None
-
- num_update_steps_per_epoch = len(train_dataloader) // args.accumulation_steps
- max_steps = math.ceil(args.max_epochs * num_update_steps_per_epoch)
- lr_scheduler = get_scheduler(
- "cosine", optim, num_warmup_steps=math.ceil(max_steps * 0.03), num_training_steps=max_steps
- )
- strategy_dict = strategy.prepare(dict(model=model, optimizer=optim, lr_scheduler=lr_scheduler))
- model = strategy_dict["model"]
- optim = strategy_dict["optimizer"]
- lr_scheduler = strategy_dict["lr_scheduler"]
- trainer = SFTTrainer(
- model=model,
- strategy=strategy,
- optim=optim,
- lr_scheduler=lr_scheduler,
- max_epochs=args.max_epochs,
- accumulation_steps=args.accumulation_steps,
- )
-
- logger = get_dist_logger()
- trainer.fit(
- train_dataloader=train_dataloader,
- eval_dataloader=eval_dataloader,
- logger=logger,
- log_dir=args.log_dir,
- use_wandb=args.use_wandb,
- )
-
- if args.lora_rank > 0 and args.merge_lora_weights:
- from coati.models.lora import LORA_MANAGER
-
- # NOTE: set model to eval to merge LoRA weights
- LORA_MANAGER.merge_weights = True
- model.eval()
- # save model checkpoint after fitting on only rank0
- strategy.save_pretrained(model, path=args.save_path, tokenizer=tokenizer)
- # save optimizer checkpoint on all ranks
- if args.need_optim_ckpt:
- strategy.save_optimizer(
- trainer.optimizer, "rm_optim_checkpoint_%d.pt" % (torch.cuda.current_device()), only_rank0=False
- )
-
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument(
- "--strategy",
- choices=["ddp", "colossalai_gemini", "colossalai_zero2", "colossalai_zero2_cpu"],
- default="colossalai_zero2",
- )
- parser.add_argument("--model", choices=["gpt2", "bloom", "opt", "llama", "chatglm"], default="bloom")
- parser.add_argument("--tokenizer", type=str, default=None)
- parser.add_argument("--pretrain", type=str, default=None)
- parser.add_argument("--dataset", type=str, default=None)
- parser.add_argument("--max_datasets_size", type=int, default=None)
- parser.add_argument("--save_path", type=str, default="output")
- parser.add_argument("--need_optim_ckpt", type=bool, default=False)
- parser.add_argument("--max_epochs", type=int, default=3)
- parser.add_argument("--batch_size", type=int, default=4)
- parser.add_argument("--max_len", type=int, default=512)
- parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank")
- parser.add_argument("--merge_lora_weights", type=bool, default=True)
- parser.add_argument("--lr", type=float, default=5e-6)
- parser.add_argument("--accumulation_steps", type=int, default=8)
- parser.add_argument("--log_dir", default="logs", type=str)
- parser.add_argument("--use_wandb", default=False, action="store_true")
- parser.add_argument("--grad_checkpoint", default=False, action="store_true")
- args = parser.parse_args()
- train(args)
diff --git a/applications/Chat/examples/train_sft.sh b/applications/Chat/examples/train_sft.sh
deleted file mode 100755
index 0fb4da3d3ce8..000000000000
--- a/applications/Chat/examples/train_sft.sh
+++ /dev/null
@@ -1,28 +0,0 @@
-set_n_least_used_CUDA_VISIBLE_DEVICES() {
- local n=${1:-"9999"}
- echo "GPU Memory Usage:"
- local FIRST_N_GPU_IDS=$(nvidia-smi --query-gpu=memory.used --format=csv |
- tail -n +2 |
- nl -v 0 |
- tee /dev/tty |
- sort -g -k 2 |
- awk '{print $1}' |
- head -n $n)
- export CUDA_VISIBLE_DEVICES=$(echo $FIRST_N_GPU_IDS | sed 's/ /,/g')
- echo "Now CUDA_VISIBLE_DEVICES is set to:"
- echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
-}
-
-set_n_least_used_CUDA_VISIBLE_DEVICES 4
-
-torchrun --standalone --nproc_per_node=4 train_sft.py \
- --pretrain "/path/to/LLaMa-7B/" \
- --model 'llama' \
- --strategy colossalai_zero2 \
- --save_path /path/to/Coati-7B \
- --dataset /path/to/data.json \
- --batch_size 4 \
- --accumulation_steps 8 \
- --lr 2e-5 \
- --max_datasets_size 512 \
- --max_epochs 1
diff --git a/applications/Chat/inference/benchmark.py b/applications/Chat/inference/benchmark.py
deleted file mode 100644
index dbb5490a63dc..000000000000
--- a/applications/Chat/inference/benchmark.py
+++ /dev/null
@@ -1,141 +0,0 @@
-# Adapted from https://github.com/tloen/alpaca-lora/blob/main/generate.py
-
-import argparse
-from time import time
-
-import torch
-from coati.quant import llama_load_quant, low_resource_init
-from transformers import AutoTokenizer, GenerationConfig, LlamaConfig, LlamaForCausalLM
-
-
-def generate_prompt(instruction, input=None):
- if input:
- return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
-
-### Instruction:
-{instruction}
-
-### Input:
-{input}
-
-### Response:"""
- else:
- return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
-
-### Instruction:
-{instruction}
-
-### Response:"""
-
-
-@torch.no_grad()
-def evaluate(
- model,
- tokenizer,
- instruction,
- input=None,
- temperature=0.1,
- top_p=0.75,
- top_k=40,
- num_beams=4,
- max_new_tokens=128,
- **kwargs,
-):
- prompt = generate_prompt(instruction, input)
- inputs = tokenizer(prompt, return_tensors="pt")
- input_ids = inputs["input_ids"].cuda()
- generation_config = GenerationConfig(
- temperature=temperature,
- top_p=top_p,
- top_k=top_k,
- num_beams=num_beams,
- **kwargs,
- )
- generation_output = model.generate(
- input_ids=input_ids,
- generation_config=generation_config,
- return_dict_in_generate=True,
- output_scores=True,
- max_new_tokens=max_new_tokens,
- do_sample=True,
- )
- s = generation_output.sequences[0]
- output = tokenizer.decode(s)
- n_new_tokens = s.size(0) - input_ids.size(1)
- return output.split("### Response:")[1].strip(), n_new_tokens
-
-
-instructions = [
- "Tell me about alpacas.",
- "Tell me about the president of Mexico in 2019.",
- "Tell me about the king of France in 2019.",
- "List all Canadian provinces in alphabetical order.",
- "Write a Python program that prints the first 10 Fibonacci numbers.",
- "Write a program that prints the numbers from 1 to 100. But for multiples of three print 'Fizz' instead of the number and for the multiples of five print 'Buzz'. For numbers which are multiples of both three and five print 'FizzBuzz'.",
- "Tell me five words that rhyme with 'shock'.",
- "Translate the sentence 'I have no mouth but I must scream' into Spanish.",
- "Count up from 1 to 500.",
- # ===
- "How to play support in legends of league",
- "Write a Python program that calculate Fibonacci numbers.",
-]
-inst = [instructions[0]] * 4
-
-if __name__ == "__main__":
- parser = argparse.ArgumentParser()
- parser.add_argument(
- "pretrained",
- help="Path to pretrained model. Can be a local path or a model name from the HuggingFace model hub.",
- )
- parser.add_argument(
- "--quant",
- choices=["8bit", "4bit"],
- default=None,
- help="Quantization mode. Default: None (no quantization, fp16).",
- )
- parser.add_argument(
- "--gptq_checkpoint",
- default=None,
- help="Path to GPTQ checkpoint. This is only useful when quantization mode is 4bit. Default: None.",
- )
- parser.add_argument(
- "--gptq_group_size",
- type=int,
- default=128,
- help="Group size for GPTQ. This is only useful when quantization mode is 4bit. Default: 128.",
- )
- args = parser.parse_args()
-
- if args.quant == "4bit":
- assert args.gptq_checkpoint is not None, "Please specify a GPTQ checkpoint."
-
- tokenizer = AutoTokenizer.from_pretrained(args.pretrained)
-
- if args.quant == "4bit":
- with low_resource_init():
- config = LlamaConfig.from_pretrained(args.pretrained)
- model = LlamaForCausalLM(config)
- model = llama_load_quant(model, args.gptq_checkpoint, 4, args.gptq_group_size)
- model.cuda()
- else:
- model = LlamaForCausalLM.from_pretrained(
- args.pretrained,
- load_in_8bit=(args.quant == "8bit"),
- torch_dtype=torch.float16,
- device_map="auto",
- )
- if args.quant != "8bit":
- model.half() # seems to fix bugs for some users.
- model.eval()
-
- total_tokens = 0
- start = time()
- for instruction in instructions:
- print(f"Instruction: {instruction}")
- resp, tokens = evaluate(model, tokenizer, instruction, temperature=0.2, num_beams=1)
- total_tokens += tokens
- print(f"Response: {resp}")
- print("\n----------------------------\n")
- duration = time() - start
- print(f"Total time: {duration:.3f} s, {total_tokens/duration:.3f} tokens/s")
- print(f"Peak CUDA mem: {torch.cuda.max_memory_allocated()/1024**3:.3f} GB")
diff --git a/applications/Chat/inference/tests/test_chat_prompt.py b/applications/Chat/inference/tests/test_chat_prompt.py
deleted file mode 100644
index 9835e71894c6..000000000000
--- a/applications/Chat/inference/tests/test_chat_prompt.py
+++ /dev/null
@@ -1,61 +0,0 @@
-import os
-
-from transformers import AutoTokenizer
-from utils import ChatPromptProcessor, Dialogue
-
-CONTEXT = "Below is an instruction that describes a task. Write a response that appropriately completes the request. Do not generate new instructions."
-tokenizer = AutoTokenizer.from_pretrained(os.environ["PRETRAINED_PATH"])
-
-samples = [
- (
- [
- Dialogue(
- instruction="Who is the best player in the history of NBA?",
- response="The best player in the history of the NBA is widely considered to be Michael Jordan. He is one of the most successful players in the league, having won 6 NBA championships with the Chicago Bulls and 5 more with the Washington Wizards. He is a 5-time MVP, 1",
- ),
- Dialogue(instruction="continue this talk", response=""),
- ],
- 128,
- "Below is an instruction that describes a task. Write a response that appropriately completes the request. Do not generate new instructions.\n\n### Instruction:\nWho is the best player in the history of NBA?\n\n### Response:\nThe best player in the history of the NBA is widely considered to be Michael Jordan. He is one of the most successful players in the league, having won 6 NBA championships with the Chicago Bulls and 5 more with the Washington Wizards. He is a 5-time MVP, 1\n\n### Instruction:\ncontinue this talk\n\n### Response:\n",
- ),
- (
- [
- Dialogue(
- instruction="Who is the best player in the history of NBA?",
- response="The best player in the history of the NBA is widely considered to be Michael Jordan. He is one of the most successful players in the league, having won 6 NBA championships with the Chicago Bulls and 5 more with the Washington Wizards. He is a 5-time MVP, 1",
- ),
- Dialogue(instruction="continue this talk", response=""),
- ],
- 200,
- "Below is an instruction that describes a task. Write a response that appropriately completes the request. Do not generate new instructions.\n\n### Instruction:\ncontinue this talk\n\n### Response:\n",
- ),
- (
- [
- Dialogue(
- instruction="Who is the best player in the history of NBA?",
- response="The best player in the history of the NBA is widely considered to be Michael Jordan. He is one of the most successful players in the league, having won 6 NBA championships with the Chicago Bulls and 5 more with the Washington Wizards. He is a 5-time MVP, 1",
- ),
- Dialogue(instruction="continue this talk", response=""),
- ],
- 211,
- "Below is an instruction that describes a task. Write a response that appropriately completes the request. Do not generate new instructions.\n\n### Instruction:\ncontinue this\n\n### Response:\n",
- ),
- (
- [
- Dialogue(instruction="Who is the best player in the history of NBA?", response=""),
- ],
- 128,
- "Below is an instruction that describes a task. Write a response that appropriately completes the request. Do not generate new instructions.\n\n### Instruction:\nWho is the best player in the history of NBA?\n\n### Response:\n",
- ),
-]
-
-
-def test_chat_prompt_processor():
- processor = ChatPromptProcessor(tokenizer, CONTEXT, 256)
- for history, max_new_tokens, result in samples:
- prompt = processor.preprocess_prompt(history, max_new_tokens)
- assert prompt == result
-
-
-if __name__ == "__main__":
- test_chat_prompt_processor()
diff --git a/applications/Chat/inference/utils.py b/applications/Chat/inference/utils.py
deleted file mode 100644
index af018adf6e9d..000000000000
--- a/applications/Chat/inference/utils.py
+++ /dev/null
@@ -1,209 +0,0 @@
-import json
-import re
-from threading import Lock
-from typing import Any, Callable, Generator, List, Optional
-
-import jieba
-import torch
-import torch.distributed as dist
-import torch.nn as nn
-from pydantic import BaseModel, Field
-
-try:
- from transformers.generation_logits_process import (
- LogitsProcessorList,
- TemperatureLogitsWarper,
- TopKLogitsWarper,
- TopPLogitsWarper,
- )
-except ImportError:
- from transformers.generation import LogitsProcessorList, TemperatureLogitsWarper, TopKLogitsWarper, TopPLogitsWarper
-
-
-def prepare_logits_processor(
- top_k: Optional[int] = None, top_p: Optional[float] = None, temperature: Optional[float] = None
-) -> LogitsProcessorList:
- processor_list = LogitsProcessorList()
- if temperature is not None and temperature != 1.0:
- processor_list.append(TemperatureLogitsWarper(temperature))
- if top_k is not None and top_k != 0:
- processor_list.append(TopKLogitsWarper(top_k))
- if top_p is not None and top_p < 1.0:
- processor_list.append(TopPLogitsWarper(top_p))
- return processor_list
-
-
-def _is_sequence_finished(unfinished_sequences: torch.Tensor) -> bool:
- if dist.is_initialized() and dist.get_world_size() > 1:
- # consider DP
- unfinished_sequences = unfinished_sequences.clone()
- dist.all_reduce(unfinished_sequences)
- return unfinished_sequences.max() == 0
-
-
-def sample_streamingly(
- model: nn.Module,
- input_ids: torch.Tensor,
- max_generate_tokens: int,
- early_stopping: bool = False,
- eos_token_id: Optional[int] = None,
- pad_token_id: Optional[int] = None,
- top_k: Optional[int] = None,
- top_p: Optional[float] = None,
- temperature: Optional[float] = None,
- prepare_inputs_fn: Optional[Callable[[torch.Tensor, Any], dict]] = None,
- update_model_kwargs_fn: Optional[Callable[[dict, Any], dict]] = None,
- **model_kwargs,
-) -> Generator:
- logits_processor = prepare_logits_processor(top_k, top_p, temperature)
- unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
-
- for _ in range(max_generate_tokens):
- model_inputs = (
- prepare_inputs_fn(input_ids, **model_kwargs) if prepare_inputs_fn is not None else {"input_ids": input_ids}
- )
- outputs = model(**model_inputs)
-
- next_token_logits = outputs["logits"][:, -1, :]
- # pre-process distribution
- next_token_logits = logits_processor(input_ids, next_token_logits)
- # sample
- probs = torch.softmax(next_token_logits, dim=-1, dtype=torch.float)
- next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
-
- # finished sentences should have their next token be a padding token
- if eos_token_id is not None:
- if pad_token_id is None:
- raise ValueError("If `eos_token_id` is defined, make sure that `pad_token_id` is defined.")
- next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)
-
- yield next_tokens
-
- # update generated ids, model inputs for next step
- input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
- if update_model_kwargs_fn is not None:
- model_kwargs = update_model_kwargs_fn(outputs, **model_kwargs)
-
- # if eos_token was found in one sentence, set sentence to finished
- if eos_token_id is not None:
- unfinished_sequences = unfinished_sequences.mul((next_tokens != eos_token_id).long())
-
- # stop when each sentence is finished if early_stopping=True
- if early_stopping and _is_sequence_finished(unfinished_sequences):
- break
-
-
-def update_model_kwargs_fn(outputs: dict, **model_kwargs) -> dict:
- if "past_key_values" in outputs:
- model_kwargs["past"] = outputs["past_key_values"]
- else:
- model_kwargs["past"] = None
-
- # update token_type_ids with last value
- if "token_type_ids" in model_kwargs:
- token_type_ids = model_kwargs["token_type_ids"]
- model_kwargs["token_type_ids"] = torch.cat([token_type_ids, token_type_ids[:, -1].unsqueeze(-1)], dim=-1)
-
- # update attention mask
- if "attention_mask" in model_kwargs:
- attention_mask = model_kwargs["attention_mask"]
- model_kwargs["attention_mask"] = torch.cat(
- [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))], dim=-1
- )
-
- return model_kwargs
-
-
-class Dialogue(BaseModel):
- instruction: str = Field(min_length=1, example="Count up from 1 to 500.")
- response: str = Field(example="")
-
-
-def _format_dialogue(instruction: str, response: str = ""):
- return f"\n\n### Instruction:\n{instruction}\n\n### Response:\n{response}"
-
-
-STOP_PAT = re.compile(r"(###|instruction:).*", flags=(re.I | re.S))
-
-
-class ChatPromptProcessor:
- SAFE_RESPONSE = "The input/response contains inappropriate content, please rephrase your prompt."
-
- def __init__(self, tokenizer, context: str, max_len: int = 2048, censored_words: List[str] = []):
- self.tokenizer = tokenizer
- self.context = context
- self.max_len = max_len
- self.censored_words = set([word.lower() for word in censored_words])
- # These will be initialized after the first call of preprocess_prompt()
- self.context_len: Optional[int] = None
- self.dialogue_placeholder_len: Optional[int] = None
-
- def preprocess_prompt(self, history: List[Dialogue], max_new_tokens: int) -> str:
- if self.context_len is None:
- self.context_len = len(self.tokenizer(self.context)["input_ids"])
- if self.dialogue_placeholder_len is None:
- self.dialogue_placeholder_len = len(
- self.tokenizer(_format_dialogue(""), add_special_tokens=False)["input_ids"]
- )
- prompt = self.context
- # the last dialogue must be in the prompt
- last_dialogue = history.pop()
- # the response of the last dialogue is empty
- assert last_dialogue.response == ""
- if (
- len(self.tokenizer(_format_dialogue(last_dialogue.instruction), add_special_tokens=False)["input_ids"])
- + max_new_tokens
- + self.context_len
- >= self.max_len
- ):
- # to avoid truncate placeholder, apply truncate to the original instruction
- instruction_truncated = self.tokenizer(
- last_dialogue.instruction,
- add_special_tokens=False,
- truncation=True,
- max_length=(self.max_len - max_new_tokens - self.context_len - self.dialogue_placeholder_len),
- )["input_ids"]
- instruction_truncated = self.tokenizer.decode(instruction_truncated).lstrip()
- prompt += _format_dialogue(instruction_truncated)
- return prompt
-
- res_len = self.max_len - max_new_tokens - len(self.tokenizer(prompt)["input_ids"])
-
- rows = []
- for dialogue in history[::-1]:
- text = _format_dialogue(dialogue.instruction, dialogue.response)
- cur_len = len(self.tokenizer(text, add_special_tokens=False)["input_ids"])
- if res_len - cur_len < 0:
- break
- res_len -= cur_len
- rows.insert(0, text)
- prompt += "".join(rows) + _format_dialogue(last_dialogue.instruction)
- return prompt
-
- def postprocess_output(self, output: str) -> str:
- output = STOP_PAT.sub("", output)
- return output.strip()
-
- def has_censored_words(self, text: str) -> bool:
- if len(self.censored_words) == 0:
- return False
- intersection = set(jieba.cut(text.lower())) & self.censored_words
- return len(intersection) > 0
-
-
-class LockedIterator:
- def __init__(self, it, lock: Lock) -> None:
- self.lock = lock
- self.it = iter(it)
-
- def __iter__(self):
- return self
-
- def __next__(self):
- with self.lock:
- return next(self.it)
-
-
-def load_json(path: str):
- with open(path) as f:
- return json.load(f)
diff --git a/applications/Chat/requirements-test.txt b/applications/Chat/requirements-test.txt
deleted file mode 100644
index 93d48bcb6f79..000000000000
--- a/applications/Chat/requirements-test.txt
+++ /dev/null
@@ -1,2 +0,0 @@
-pytest
-colossalai==0.3.3
diff --git a/applications/Chat/requirements.txt b/applications/Chat/requirements.txt
deleted file mode 100644
index e56aaca0e7cb..000000000000
--- a/applications/Chat/requirements.txt
+++ /dev/null
@@ -1,14 +0,0 @@
-transformers>=4.20.1
-tqdm
-datasets
-loralib
-colossalai==0.3.3
-torch<2.0.0, >=1.12.1
-langchain
-tokenizers
-fastapi
-sse_starlette
-wandb
-sentencepiece
-gpustat
-tensorboard
diff --git a/applications/Chat/tests/test_benchmarks.sh b/applications/Chat/tests/test_benchmarks.sh
deleted file mode 100755
index 3fdb25181342..000000000000
--- a/applications/Chat/tests/test_benchmarks.sh
+++ /dev/null
@@ -1,33 +0,0 @@
-#!/bin/bash
-
-set -xue
-
-echo "Hint: You can run this script with 'verbose' as the first argument to run all strategies."
-
-if [[ $# -ne 0 && "$1" == "verbose" ]]; then
- STRATEGIES=(
- 'ddp'
- 'colossalai_gemini'
- 'colossalai_gemini_cpu'
- 'colossalai_zero2'
- 'colossalai_zero2_cpu'
- 'colossalai_zero1'
- 'colossalai_zero1_cpu'
- )
-else
- STRATEGIES=(
- 'colossalai_zero2'
- )
-fi
-
-BASE_DIR=$(dirname $(dirname $(realpath $BASH_SOURCE)))
-BENCHMARKS_DIR=$BASE_DIR/benchmarks
-
-echo "[Test]: testing benchmarks ..."
-
-for strategy in ${STRATEGIES[@]}; do
- torchrun --standalone --nproc_per_node 1 $BENCHMARKS_DIR/benchmark_opt_lora_dummy.py \
- --model 125m --critic_model 125m --strategy ${strategy} --lora_rank 4 \
- --num_episodes 2 --num_collect_steps 4 --num_update_steps 2 \
- --train_batch_size 2 --experience_batch_size 4
-done
diff --git a/applications/Chat/tests/test_checkpoint.py b/applications/Chat/tests/test_checkpoint.py
deleted file mode 100644
index 9c08aa36c9b4..000000000000
--- a/applications/Chat/tests/test_checkpoint.py
+++ /dev/null
@@ -1,91 +0,0 @@
-import os
-import tempfile
-from contextlib import nullcontext
-
-import pytest
-import torch
-import torch.distributed as dist
-from coati.models.gpt import GPTActor
-from coati.models.utils import calc_action_log_probs
-from coati.trainer.strategies import DDPStrategy, GeminiStrategy, LowLevelZeroStrategy, Strategy
-from transformers.models.gpt2.configuration_gpt2 import GPT2Config
-
-from colossalai.nn.optimizer import HybridAdam
-from colossalai.testing import rerun_if_address_is_in_use, spawn
-
-GPT_CONFIG = GPT2Config(n_embd=128, n_layer=4, n_head=4)
-
-
-def get_data(batch_size: int, seq_len: int = 10) -> dict:
- input_ids = torch.randint(0, 50257, (batch_size, seq_len), device="cuda")
- attention_mask = torch.ones_like(input_ids)
- return dict(input_ids=input_ids, attention_mask=attention_mask)
-
-
-def train_step(strategy: Strategy, actor: GPTActor, actor_optim: HybridAdam, batch_size: int = 8):
- data = get_data(batch_size)
- action_mask = torch.ones_like(data["attention_mask"], dtype=torch.bool)
- actor_logits = actor(data["input_ids"], data["attention_mask"])["logits"]
- action_log_probs = calc_action_log_probs(actor_logits, data["input_ids"], action_mask.size(1))
- loss = action_log_probs.sum()
- strategy.backward(loss, actor, actor_optim)
- strategy.optimizer_step(actor_optim)
-
-
-def run_test_checkpoint(strategy_name: str, shard: bool):
- if strategy_name == "ddp":
- strategy = DDPStrategy()
- elif strategy_name == "colossalai_gemini":
- strategy = GeminiStrategy(placement_policy="auto", initial_scale=2**5)
- elif strategy_name == "colossalai_zero2":
- strategy = LowLevelZeroStrategy(stage=2, placement_policy="cuda")
- else:
- raise ValueError(f"Unsupported strategy '{strategy_name}'")
-
- with strategy.model_init_context():
- actor = GPTActor(config=GPT_CONFIG).cuda()
- actor_optim = HybridAdam(actor.parameters())
- actor, actor_optim = strategy.prepare((actor, actor_optim))
-
- train_step(strategy, actor, actor_optim)
-
- ctx = tempfile.TemporaryDirectory() if dist.get_rank() == 0 else nullcontext()
-
- with ctx as dirname:
- rank0_dirname = [dirname]
- dist.broadcast_object_list(rank0_dirname)
- rank0_dirname = rank0_dirname[0]
-
- model_path = os.path.join(rank0_dirname, "model" if shard else f"model.pt")
- strategy.save_model(actor, model_path)
- optim_path = os.path.join(rank0_dirname, "optim" if shard else "optim.pt")
- strategy.save_optimizer(actor_optim, optim_path)
- dist.barrier()
-
- strategy.load_model(actor, model_path, strict=False)
- strategy.load_optimizer(actor_optim, optim_path)
- dist.barrier()
-
- train_step(strategy, actor, actor_optim)
-
-
-def run_dist(rank: int, world_size: int, port: int, strategy_name: str, shard: bool):
- os.environ["RANK"] = str(rank)
- os.environ["LOCAL_RANK"] = str(rank)
- os.environ["WORLD_SIZE"] = str(world_size)
- os.environ["MASTER_ADDR"] = "localhost"
- os.environ["MASTER_PORT"] = str(port)
- run_test_checkpoint(strategy_name, shard)
-
-
-@pytest.mark.dist
-@pytest.mark.parametrize("world_size", [4])
-@pytest.mark.parametrize("strategy_name", ["ddp", "colossalai_gemini", "colossalai_zero2"])
-@pytest.mark.parametrize("shard", [False, True])
-@rerun_if_address_is_in_use()
-def test_checkpoint(world_size: int, strategy_name: str, shard: bool):
- spawn(run_dist, world_size, strategy_name=strategy_name, shard=shard)
-
-
-if __name__ == "__main__":
- test_checkpoint(2, "colossalai_gemini", shard=False)
diff --git a/applications/Chat/tests/test_dataset.py b/applications/Chat/tests/test_dataset.py
deleted file mode 100644
index ec61bbb13fd7..000000000000
--- a/applications/Chat/tests/test_dataset.py
+++ /dev/null
@@ -1,241 +0,0 @@
-import json
-import os
-import tempfile
-from typing import Optional
-
-import pytest
-import torch
-from coati.dataset.prompt_dataset import PromptDataset
-from coati.dataset.reward_dataset import HhRlhfDataset, RmStaticDataset
-from coati.dataset.sft_dataset import IGNORE_INDEX, SFTDataset, SupervisedDataset
-from coati.models.chatglm.chatglm_tokenizer import ChatGLMTokenizer
-from datasets import load_dataset
-from transformers import AutoTokenizer, BloomTokenizerFast, LlamaTokenizer, PreTrainedTokenizer
-from transformers.models.gpt2.tokenization_gpt2 import GPT2Tokenizer
-
-SFT_DATASET = [
- {
- "instruction": "Provide a list of the top 10 most popular mobile games in Asia",
- "input": "",
- "output": "The top 10 most popular mobile games in Asia are:\n1) PUBG Mobile\n2) Pokemon Go\n3) Candy Crush Saga\n4) Free Fire\n5) Clash of Clans\n6) Mario Kart Tour\n7) Arena of Valor\n8) Fantasy Westward Journey\n9) Subway Surfers\n10) ARK Survival Evolved",
- "id": 0,
- },
- {
- "instruction": "Please provide an action plan for reducing carbon footprint on a corporate level",
- "input": "",
- "output": "An action plan for reducing carbon footprint on a corporate level could include switching to renewable energy sources (e.g., solar or wind powered), investing in energy efficient infrastructure, implementing green production practices (e.g., reduce water and paper usage, dispose of materials responsibly, recycle/reuse materials), reducing or eliminating travel, encouraging employees to be conscious of and reduce their own energy usage, and incentivizing employees to participate in sustainable work practices.",
- "id": 1,
- },
- {
- "instruction": "Write a persuasive email to your boss explaining why you should have a pay raise",
- "input": "",
- "output": "Dear [Boss], \n\nI hope this message finds you well. I am writing to request a pay raise.\n\nAs you know, I have been a dedicated and hardworking member of the team since I started working here [insert number] of months/years ago. My enthusiasm and passion for my job has remained consistent over the years, and I have always given 100% to my role. \n\nI understand that the current financial situation is challenging, however, I would sincerely appreciate you taking the time to consider my request. I believe that my dedication to the job and the value that I bring to the organization warrants a raise. I work diligently and am confident that I can continue to be an asset to the company. \n\nI hope my request is taken into account and I thank you in advance for your understanding. I look forward to our conversation. \n\nSincerely,\n[Your Name]",
- "id": 2,
- },
-]
-
-PROMPT_DATASET = [
- {
- "instruction": 'Edit this paragraph to make it more concise: "Yesterday, I went to the store and bought some things. Then, I came home and put them away. After that, I went for a walk and met some friends."',
- "id": 0,
- },
- {"instruction": "Write a descriptive paragraph about a memorable vacation you went on", "id": 1},
- {"instruction": "Write a persuasive essay arguing why homework should be banned in schools", "id": 2},
- {"instruction": "Create a chart comparing the statistics on student debt in the United States.", "id": 3},
-]
-
-
-def make_tokenizer(model: str):
- if model == "gpt2":
- tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
- tokenizer.pad_token = tokenizer.eos_token
- elif model == "bloom":
- tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-560m")
- tokenizer.pad_token = tokenizer.eos_token
- elif model == "opt":
- tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
- tokenizer.pad_token = tokenizer.eos_token
- elif model == "llama":
- tokenizer = LlamaTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer")
- tokenizer.pad_token = tokenizer.unk_token
- elif model == "chatglm":
- tokenizer = ChatGLMTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
- else:
- raise ValueError(f"Unsupported model '{model}'")
- return tokenizer
-
-
-def check_content(input_ids_stripped: torch.Tensor, tokenizer: PreTrainedTokenizer, model: str):
- if model == "opt":
- # NOTE: Contrary to GPT2, OPT adds the EOS token to the beginning of every prompt.
- assert input_ids_stripped[0] == tokenizer.eos_token_id
- input_ids_stripped = input_ids_stripped[1:]
- elif model == "llama":
- assert input_ids_stripped[0] == tokenizer.bos_token_id
- input_ids_stripped = input_ids_stripped[1:]
- elif model == "chatglm":
- assert input_ids_stripped[0] == tokenizer.bos_token_id
- assert input_ids_stripped[-1] == tokenizer.eos_token_id
- input_ids_stripped = input_ids_stripped[1:-1]
- assert torch.all(input_ids_stripped != tokenizer.pad_token_id)
- assert torch.all(input_ids_stripped != tokenizer.bos_token_id)
- assert torch.all(input_ids_stripped != tokenizer.eos_token_id)
- assert input_ids_stripped != tokenizer.sep_token_id
- assert input_ids_stripped != tokenizer.cls_token_id
- if model == "chatglm":
- assert torch.all(input_ids_stripped != tokenizer.mask_token_id)
- else:
- assert input_ids_stripped != tokenizer.mask_token_id
-
-
-@pytest.mark.parametrize("model", ["gpt2", "bloom", "opt", "llama"])
-@pytest.mark.parametrize("max_length", [32, 1024])
-@pytest.mark.parametrize("max_datasets_size", [2])
-def test_prompt_dataset(model: str, max_datasets_size: int, max_length: int):
- with tempfile.TemporaryDirectory() as tmp_dir:
- dataset_name = "prompt_dataset.json"
- with open(os.path.join(tmp_dir, dataset_name), "w") as f:
- json.dump(PROMPT_DATASET, f)
- tokenizer = make_tokenizer(model)
- assert tokenizer.padding_side in ("left", "right")
- prompt_dataset = PromptDataset(
- data_path=os.path.join(tmp_dir, dataset_name),
- tokenizer=tokenizer,
- max_datasets_size=max_datasets_size,
- max_length=max_length,
- )
- assert len(prompt_dataset) == min(max_datasets_size, len(PROMPT_DATASET))
- for i in range(len(prompt_dataset)):
- assert isinstance(prompt_dataset[i], dict)
- assert list(prompt_dataset[i].keys()) == ["input_ids", "attention_mask"]
- input_ids = prompt_dataset[i]["input_ids"]
- attention_mask = prompt_dataset[i]["attention_mask"]
- attention_mask = attention_mask.bool()
- assert input_ids.shape == attention_mask.shape == torch.Size([max_length])
- assert torch.all(input_ids[torch.logical_not(attention_mask)] == tokenizer.pad_token_id)
- check_content(input_ids.masked_select(attention_mask), tokenizer, model)
-
-
-@pytest.mark.parametrize("model", ["gpt2", "bloom", "opt", "llama"])
-@pytest.mark.parametrize(
- ["dataset_path", "subset"], [("Anthropic/hh-rlhf", "harmless-base"), ("Dahoas/rm-static", None)]
-)
-@pytest.mark.parametrize("max_datasets_size", [32])
-@pytest.mark.parametrize("max_length", [32, 1024])
-def test_reward_dataset(model: str, dataset_path: str, subset: Optional[str], max_datasets_size: int, max_length: int):
- data = load_dataset(dataset_path, data_dir=subset)
- assert max_datasets_size <= len(data["train"]) and max_datasets_size <= len(data["test"])
- train_data = data["train"].select(range(max_datasets_size))
- test_data = data["test"].select(range(max_datasets_size))
- tokenizer = make_tokenizer(model)
- assert tokenizer.padding_side in ("left", "right")
-
- if dataset_path == "Anthropic/hh-rlhf":
- train_dataset = HhRlhfDataset(train_data, tokenizer, max_length)
- test_dataset = HhRlhfDataset(test_data, tokenizer, max_length)
- elif dataset_path == "Dahoas/rm-static":
- train_dataset = RmStaticDataset(train_data, tokenizer, max_length)
- test_dataset = RmStaticDataset(test_data, tokenizer, max_length)
- else:
- raise ValueError(f'Unsupported dataset "{dataset_path}"')
-
- assert len(train_dataset) == len(test_dataset) == max_datasets_size
- for i in range(max_datasets_size):
- chosen_ids, c_mask, reject_ids, r_mask = train_dataset[i]
- assert chosen_ids.shape == c_mask.shape == reject_ids.shape == r_mask.shape == torch.Size([max_length])
- c_mask = c_mask.to(torch.bool)
- r_mask = r_mask.to(torch.bool)
- if chosen_ids.masked_select(c_mask)[-1] == tokenizer.eos_token_id:
- check_content(chosen_ids.masked_select(c_mask)[:-1], tokenizer, model)
- assert torch.all(chosen_ids.masked_select(torch.logical_not(c_mask)) == tokenizer.pad_token_id)
- else:
- check_content(chosen_ids.masked_select(c_mask), tokenizer, model)
- assert torch.all(c_mask)
- if reject_ids.masked_select(r_mask)[-1] == tokenizer.eos_token_id:
- check_content(reject_ids.masked_select(r_mask)[:-1], tokenizer, model)
- assert torch.all(reject_ids.masked_select(torch.logical_not(r_mask)) == tokenizer.pad_token_id)
- else:
- check_content(reject_ids.masked_select(r_mask), tokenizer, model)
- assert torch.all(r_mask)
-
- chosen_ids, c_mask, reject_ids, r_mask = test_dataset[i]
- assert chosen_ids.shape == c_mask.shape == reject_ids.shape == r_mask.shape == torch.Size([max_length])
- c_mask = c_mask.to(torch.bool)
- r_mask = r_mask.to(torch.bool)
- if chosen_ids.masked_select(c_mask)[-1] == tokenizer.eos_token_id:
- check_content(chosen_ids.masked_select(c_mask)[:-1], tokenizer, model)
- assert torch.all(chosen_ids.masked_select(torch.logical_not(c_mask)) == tokenizer.pad_token_id)
- else:
- check_content(chosen_ids.masked_select(c_mask), tokenizer, model)
- assert torch.all(c_mask)
- if reject_ids.masked_select(r_mask)[-1] == tokenizer.eos_token_id:
- check_content(reject_ids.masked_select(r_mask)[:-1], tokenizer, model)
- assert torch.all(reject_ids.masked_select(torch.logical_not(r_mask)) == tokenizer.pad_token_id)
- else:
- check_content(reject_ids.masked_select(r_mask), tokenizer, model)
- assert torch.all(r_mask)
-
-
-@pytest.mark.parametrize("model", ["gpt2", "bloom", "opt", "llama", "chatglm"])
-@pytest.mark.parametrize("dataset_path", ["yizhongw/self_instruct", None])
-@pytest.mark.parametrize("max_dataset_size", [2])
-@pytest.mark.parametrize("max_length", [32, 1024])
-def test_sft_dataset(model: str, dataset_path: Optional[str], max_dataset_size: int, max_length: int):
- tokenizer = make_tokenizer(model)
- if dataset_path == "yizhongw/self_instruct":
- data = load_dataset(dataset_path, "super_natural_instructions")
- train_data = data["train"].select(range(max_dataset_size))
- sft_dataset = SFTDataset(train_data, tokenizer, max_length)
- else:
- with tempfile.TemporaryDirectory() as tmp_dir:
- dataset_name = "sft_dataset.json"
- with open(os.path.join(tmp_dir, dataset_name), "w") as f:
- json.dump(SFT_DATASET, f)
- sft_dataset = SupervisedDataset(
- tokenizer=tokenizer,
- data_path=os.path.join(tmp_dir, dataset_name),
- max_datasets_size=max_dataset_size,
- max_length=max_length,
- )
- assert len(sft_dataset) == min(max_dataset_size, len(SFT_DATASET))
-
- if isinstance(tokenizer, ChatGLMTokenizer):
- for i in range(max_dataset_size):
- assert isinstance(sft_dataset[i], dict)
- assert list(sft_dataset[i].keys()) == ["input_ids", "labels"]
- input_ids = sft_dataset[i]["input_ids"]
- labels = sft_dataset[i]["labels"]
- assert input_ids.shape == labels.shape == torch.Size([max_length])
-
- ignore_mask = labels == IGNORE_INDEX
- assert input_ids.masked_select(torch.logical_not(ignore_mask))[0] == tokenizer.bos_token_id
- check_content(input_ids.masked_select(torch.logical_not(ignore_mask)), tokenizer, model)
- return
-
- for i in range(max_dataset_size):
- assert isinstance(sft_dataset[i], dict)
- assert list(sft_dataset[i].keys()) == ["input_ids", "labels", "attention_mask"]
- input_ids = sft_dataset[i]["input_ids"]
- labels = sft_dataset[i]["labels"]
- attention_mask = sft_dataset[i]["attention_mask"].to(torch.bool)
- assert input_ids.shape == labels.shape == attention_mask.shape == torch.Size([max_length])
- if input_ids.masked_select(attention_mask)[-1] == tokenizer.eos_token_id:
- check_content(input_ids.masked_select(attention_mask)[:-1], tokenizer, model)
- assert torch.all(input_ids.masked_select(torch.logical_not(attention_mask)) == tokenizer.pad_token_id)
- else:
- check_content(input_ids.masked_select(attention_mask), tokenizer, model)
- assert torch.all(attention_mask)
- ignore_mask = labels == IGNORE_INDEX
- prompt_mask = torch.logical_and(ignore_mask, attention_mask)
- check_content(input_ids.masked_select(prompt_mask), tokenizer, model)
- assert torch.all(input_ids.masked_select(ignore_mask ^ prompt_mask) == tokenizer.pad_token_id)
-
-
-if __name__ == "__main__":
- test_sft_dataset(model="bloom", dataset_path="yizhongw/self_instruct", max_dataset_size=2, max_length=256)
-
- test_reward_dataset(
- model="gpt2", dataset_path="Anthropic/hh-rlhf", subset="harmless-base", max_datasets_size=8, max_length=256
- )
-
- test_prompt_dataset(model="opt", max_datasets_size=2, max_length=128)
diff --git a/applications/Chat/tests/test_experience.py b/applications/Chat/tests/test_experience.py
deleted file mode 100644
index a9591259800d..000000000000
--- a/applications/Chat/tests/test_experience.py
+++ /dev/null
@@ -1,130 +0,0 @@
-import copy
-import os
-
-import pytest
-import torch
-import torch.distributed as dist
-from coati.experience_buffer import NaiveExperienceBuffer
-from coati.experience_maker import NaiveExperienceMaker
-from coati.models.base import RewardModel
-from coati.models.gpt import GPTActor, GPTCritic
-from coati.trainer.ppo import _set_default_generate_kwargs
-from coati.trainer.strategies import DDPStrategy, GeminiStrategy
-from coati.trainer.strategies.colossalai import LowLevelZeroStrategy
-from transformers.models.gpt2.configuration_gpt2 import GPT2Config
-
-from colossalai.testing import rerun_if_address_is_in_use, spawn
-
-GPT_CONFIG = GPT2Config(n_embd=128, n_layer=4, n_head=4)
-
-
-def get_data(batch_size: int, seq_len: int = 10) -> dict:
- input_ids = torch.randint(0, 50257, (batch_size, seq_len), device="cuda")
- attention_mask = torch.ones_like(input_ids)
- return dict(input_ids=input_ids, attention_mask=attention_mask)
-
-
-def gather_and_equal(tensor: torch.Tensor) -> bool:
- world_size = dist.get_world_size()
- outputs = [torch.empty_like(tensor) for _ in range(world_size)]
- dist.all_gather(outputs, tensor.contiguous())
- for t in outputs[1:]:
- if not torch.equal(outputs[0], t):
- return False
- return True
-
-
-def make_and_consume_experience(strategy):
- EXPERIENCE_BATCH_SIZE = 4
- SAMPLE_BATCH_SIZE = 2
-
- if strategy == "ddp":
- strategy = DDPStrategy()
- elif strategy == "colossalai-zero2":
- strategy = LowLevelZeroStrategy()
- elif strategy == "colossalai-gemini":
- strategy = GeminiStrategy(placement_policy="static")
- else:
- raise ValueError(f'Unsupported strategy "{strategy}"')
-
- with strategy.model_init_context():
- actor = GPTActor(config=GPT_CONFIG).cuda()
- critic = GPTCritic(config=GPT_CONFIG).cuda()
-
- initial_model = GPTActor(config=GPT_CONFIG).cuda()
- reward_model = RewardModel(model=copy.deepcopy(critic.model)).cuda()
-
- actor, critic, initial_model, reward_model = strategy.prepare(actor, critic, initial_model, reward_model)
-
- class MockTokenizer:
- def __init__(self):
- self.padding_side = "left"
- self.eos_token_id = 0
- self.pad_token_id = 0
-
- tokenizer = MockTokenizer()
- experience_maker = NaiveExperienceMaker(actor, critic, reward_model, initial_model, tokenizer)
- data_buffer = NaiveExperienceBuffer(SAMPLE_BATCH_SIZE, cpu_offload=False)
-
- generate_kwargs = dict(do_sample=True, max_length=16)
- generate_kwargs = _set_default_generate_kwargs(strategy, generate_kwargs, actor)
-
- # experience of all ranks should be the same
- for _ in range(2):
- data = get_data(EXPERIENCE_BATCH_SIZE)
- assert gather_and_equal(data["input_ids"])
- assert gather_and_equal(data["attention_mask"])
- experience = experience_maker.make_experience(**data, do_sample=True, max_length=16)
- assert gather_and_equal(experience.sequences)
- assert gather_and_equal(experience.action_log_probs)
- assert gather_and_equal(experience.values)
- assert gather_and_equal(experience.reward)
- assert gather_and_equal(experience.advantages)
- assert gather_and_equal(experience.action_mask)
- assert gather_and_equal(experience.attention_mask)
- data_buffer.append(experience)
-
- # data buffer's data should be the same
- buffer_size = torch.tensor([len(data_buffer)], device="cuda")
- assert gather_and_equal(buffer_size)
- for item in data_buffer.items:
- assert gather_and_equal(item.sequences)
- assert gather_and_equal(item.action_log_probs)
- assert gather_and_equal(item.values)
- assert gather_and_equal(item.reward)
- assert gather_and_equal(item.advantages)
- assert gather_and_equal(item.action_mask)
- assert gather_and_equal(item.attention_mask)
-
- # dataloader of each rank should have the same size and different batch
- dataloader = strategy.setup_dataloader(data_buffer)
- dataloader_size = torch.tensor([len(dataloader)], device="cuda")
- assert gather_and_equal(dataloader_size)
- for experience in dataloader:
- assert not gather_and_equal(experience.sequences)
- assert not gather_and_equal(experience.action_log_probs)
- assert not gather_and_equal(experience.values)
- assert not gather_and_equal(experience.reward)
- assert not gather_and_equal(experience.advantages)
- # action mask and attention mask may be same
-
-
-def run_dist(rank, world_size, port, strategy):
- os.environ["RANK"] = str(rank)
- os.environ["LOCAL_RANK"] = str(rank)
- os.environ["WORLD_SIZE"] = str(world_size)
- os.environ["MASTER_ADDR"] = "localhost"
- os.environ["MASTER_PORT"] = str(port)
- make_and_consume_experience(strategy)
-
-
-@pytest.mark.dist
-@pytest.mark.parametrize("world_size", [2])
-@pytest.mark.parametrize("strategy", ["ddp", "colossalai-zero2", "colossalai-gemini"])
-@rerun_if_address_is_in_use()
-def test_experience(world_size, strategy):
- spawn(run_dist, world_size, strategy=strategy)
-
-
-if __name__ == "__main__":
- test_experience(2, "colossalai-zero2")
diff --git a/applications/Chat/tests/test_inference.sh b/applications/Chat/tests/test_inference.sh
deleted file mode 100755
index 849db06e58ab..000000000000
--- a/applications/Chat/tests/test_inference.sh
+++ /dev/null
@@ -1,11 +0,0 @@
-set -xue
-
-BASE_DIR=$(dirname $(dirname $(realpath $BASH_SOURCE)))
-EXAMPLES_DIR=$BASE_DIR/examples
-
-echo "[Test]: testing inference ..."
-
-# HACK: skip llama due to oom
-for model in 'gpt2' 'bloom' 'opt'; do
- python $EXAMPLES_DIR/inference.py --model $model
-done
diff --git a/applications/Chat/tests/test_models.py b/applications/Chat/tests/test_models.py
deleted file mode 100644
index b2c22ac6a3b9..000000000000
--- a/applications/Chat/tests/test_models.py
+++ /dev/null
@@ -1,245 +0,0 @@
-import copy
-from typing import Any, Callable, Dict, Tuple
-
-import pytest
-import torch
-import torch.nn as nn
-from coati.models.base import Actor, Critic, RewardModel, get_base_model
-from coati.models.bloom import BLOOMRM, BLOOMActor, BLOOMCritic
-from coati.models.chatglm import ChatGLMActor
-from coati.models.chatglm.chatglm_tokenizer import ChatGLMTokenizer
-from coati.models.generation import generate
-from coati.models.gpt import GPTRM, GPTActor, GPTCritic
-from coati.models.llama import LlamaActor
-from coati.models.lora import LoraLinear, convert_to_lora_module
-from coati.models.loss import GPTLMLoss, LogExpLoss, LogSigLoss, PolicyLoss, ValueLoss
-from coati.models.opt import OPTRM, OPTActor, OPTCritic
-from coati.models.utils import calc_action_log_probs, masked_mean
-
-
-@pytest.mark.parametrize("batch_size", [4])
-@pytest.mark.parametrize("seq_len", [32])
-@pytest.mark.parametrize(
- "actor_maker",
- [
- lambda: BLOOMActor(),
- lambda: GPTActor(),
- # HACK: skip llama due to long execution time
- # lambda: LlamaActor(),
- lambda: OPTActor(),
- ],
-)
-@pytest.mark.parametrize(
- "generate_kwargs",
- [
- {
- "max_length": 64,
- "use_cache": True,
- "do_sample": True,
- "temperature": 1.0,
- "top_k": 50,
- }
- ],
-)
-def test_generation(actor_maker: Callable[[], Actor], batch_size: int, seq_len: int, generate_kwargs: Dict[str, Any]):
- class MockTokenizer:
- def __init__(self):
- self.padding_side = "left"
- self.eos_token_id = 0
- self.pad_token_id = 0
-
- actor = actor_maker()
- input_ids = torch.randint(0, 100, (batch_size, seq_len)).cuda()
- tokenizer = MockTokenizer()
- sequences = generate(actor.cuda(), input_ids, tokenizer, **generate_kwargs)
- assert sequences.shape == (batch_size, generate_kwargs["max_length"])
-
-
-def test_utils():
- fn_input = {"tensor": torch.ones((10,)), "mask": torch.randint(0, 2, (10,))}
- fn_output = masked_mean(dim=0, **fn_input)
- assert fn_output.dim() == 0
- assert torch.allclose(fn_output, torch.tensor(1.0))
-
- batch_size = 4
- seq_len = 32
- num_labels = 10
- num_actions = 2
- fn_input = {
- "logits": torch.randn((batch_size, seq_len, num_labels)),
- "sequences": torch.randint(0, num_labels, (batch_size, seq_len)),
- "num_actions": num_actions,
- }
- fn_output = calc_action_log_probs(**fn_input)
- assert fn_output.shape == (batch_size, num_actions)
-
-
-@pytest.mark.parametrize("lora_rank", [4])
-@pytest.mark.parametrize("num_dim", [32])
-@pytest.mark.parametrize("num_layers", [4])
-def test_lora(lora_rank: int, num_dim: int, num_layers: int):
- model = nn.ModuleList([nn.Linear(num_dim, num_dim) for _ in range(num_layers)])
- lora_model = convert_to_lora_module(model, lora_rank)
- assert isinstance(lora_model, nn.ModuleList)
- for i in range(num_layers):
- assert isinstance(lora_model[i], LoraLinear)
- assert lora_model[i].lora_A.shape == (lora_rank, num_dim)
- assert lora_model[i].lora_B.shape == (num_dim, lora_rank)
-
- old_model = copy.deepcopy(lora_model)
- for i in range(num_layers):
- assert isinstance(lora_model[i], LoraLinear)
- assert torch.allclose(old_model[i].weight, lora_model[i].weight)
- assert torch.allclose(old_model[i].bias, lora_model[i].bias)
- assert torch.allclose(old_model[i].lora_B @ old_model[i].lora_A, lora_model[i].lora_B @ lora_model[i].lora_A)
- optimizer = torch.optim.Adam(lora_model.parameters())
- x = torch.randn(8, num_dim)
- for i in range(num_layers):
- x = lora_model[i](x)
- loss = x.sum()
- loss.backward()
- optimizer.step()
- for i in range(num_layers):
- assert isinstance(lora_model[i], LoraLinear)
- assert torch.allclose(old_model[i].weight, lora_model[i].weight)
- assert torch.allclose(old_model[i].bias, lora_model[i].bias)
- assert not torch.allclose(
- old_model[i].lora_B @ old_model[i].lora_A, lora_model[i].lora_B @ lora_model[i].lora_A
- )
-
-
-@pytest.mark.parametrize("batch_size", [8])
-@pytest.mark.parametrize("seq_len", [128])
-@pytest.mark.parametrize(
- "models_maker",
- [
- lambda: (BLOOMActor(), BLOOMCritic(), BLOOMRM()),
- lambda: (GPTActor(), GPTCritic(), GPTRM()),
- # HACK: skip llama due to long execution time
- # lambda: (LlamaActor(), LlamaCritic(), LlamaRM()),
- lambda: (OPTActor(), OPTCritic(), OPTRM()),
- lambda: (ChatGLMActor(), None, None),
- ],
-)
-@torch.no_grad()
-def test_models(models_maker: Callable[[], Tuple[Actor, Critic, RewardModel]], batch_size: int, seq_len: int):
- actor_input = {
- "input_ids": torch.randint(0, 100, (batch_size, seq_len)),
- "attention_mask": torch.randint(0, 2, (batch_size, seq_len)),
- }
- critic_input = {
- "sequences": torch.randint(0, 100, (batch_size, seq_len)),
- "attention_mask": torch.randint(0, 2, (batch_size, seq_len)),
- }
- rm_input = {
- "sequences": torch.randint(0, 100, (batch_size, seq_len)),
- "attention_mask": torch.randint(0, 2, (batch_size, seq_len)),
- }
-
- actor, critic, rm = models_maker()
- if isinstance(actor, ChatGLMActor):
- actor = actor.float()
- tokenizer = ChatGLMTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
- chatglm_special_token = torch.tensor([tokenizer.gmask_token_id, tokenizer.bos_token_id]).repeat(batch_size, 1)
- actor_input = {
- "input_ids": torch.cat(
- (
- torch.randint(0, 100, (batch_size, seq_len // 2)),
- chatglm_special_token,
- torch.randint(0, 100, (batch_size, seq_len // 2 - 2)),
- ),
- dim=1,
- ),
- "attention_mask": torch.randint(0, 2, (batch_size, 1, seq_len, seq_len)),
- }
- assert isinstance(actor, Actor)
- get_base_model(actor)
- actor_output = actor(**actor_input)
- assert actor_output.logits.shape[:2] == (batch_size, seq_len)
-
- if critic:
- assert isinstance(critic, Critic)
- get_base_model(critic)
- critic_output = critic(**critic_input)
- assert critic_output.shape == (batch_size,)
-
- if rm:
- assert isinstance(rm, RewardModel)
- get_base_model(rm)
- rm_output = rm(**rm_input)
- assert rm_output.shape == (batch_size,)
-
-
-@pytest.mark.parametrize("batch_size", [16])
-@pytest.mark.parametrize("seq_len", [128])
-@pytest.mark.parametrize("num_labels", [100])
-def test_loss(batch_size: int, seq_len: int, num_labels: int):
- loss = GPTLMLoss()
- loss_input = {
- "logits": torch.randn(batch_size, seq_len, num_labels),
- "labels": torch.randint(0, num_labels, (batch_size, seq_len)),
- }
- loss(**loss_input)
-
- loss = PolicyLoss()
- loss_input = {
- "log_probs": torch.randn(
- batch_size,
- ),
- "old_log_probs": torch.randn(
- batch_size,
- ),
- "advantages": torch.randn(
- batch_size,
- ),
- }
- loss(**loss_input)
-
- loss = ValueLoss()
- loss_input = {
- "values": torch.randn(
- batch_size,
- ),
- "old_values": torch.randn(
- batch_size,
- ),
- "reward": torch.randn(
- batch_size,
- ),
- }
- loss(**loss_input)
-
- loss = LogSigLoss()
- loss_input = {
- "chosen_reward": torch.randn(
- batch_size,
- ),
- "reject_reward": torch.randn(
- batch_size,
- ),
- }
- loss(**loss_input)
-
- loss = LogExpLoss()
- loss_input = {
- "chosen_reward": torch.randn(
- batch_size,
- ),
- "reject_reward": torch.randn(
- batch_size,
- ),
- }
- loss(**loss_input)
-
-
-if __name__ == "__main__":
- generate_kwargs = dict(max_length=40, use_cache=True, do_sample=True, temperature=1.0, top_k=50)
- test_generation(lambda: LlamaActor(), batch_size=4, seq_len=32, generate_kwargs=generate_kwargs)
-
- test_utils()
-
- test_lora(lora_rank=2, num_dim=8, num_layers=2)
-
- test_models(models_maker=lambda: (BLOOMActor(), BLOOMCritic(), BLOOMRM()), batch_size=8, seq_len=128)
-
- test_loss(batch_size=8, seq_len=128, num_labels=100)
diff --git a/applications/Chat/tests/test_train.sh b/applications/Chat/tests/test_train.sh
deleted file mode 100755
index 68fca7fbf8c0..000000000000
--- a/applications/Chat/tests/test_train.sh
+++ /dev/null
@@ -1,233 +0,0 @@
-#!/usr/bin/env bash
-
-set_n_least_used_CUDA_VISIBLE_DEVICES() {
- local n=${1:-"9999"}
- echo "GPU Memory Usage:"
- local FIRST_N_GPU_IDS=$(nvidia-smi --query-gpu=memory.used --format=csv |
- tail -n +2 |
- nl -v 0 |
- tee /dev/tty |
- sort -g -k 2 |
- awk '{print $1}' |
- head -n $n)
- export CUDA_VISIBLE_DEVICES=$(echo $FIRST_N_GPU_IDS | sed 's/ /,/g')
- echo "Now CUDA_VISIBLE_DEVICES is set to:"
- echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
-}
-
-set_n_least_used_CUDA_VISIBLE_DEVICES 4
-
-set -xu
-
-if [ -z "$SFT_DATASET" ]; then
- echo "Please set \$SFT_DATASET to the path to sft dataset."
- exit 1
-fi
-
-if [ -z "$PROMPT_DATASET" ]; then
- echo "Please set \$PROMPT_DATASET to the path to prompts csv."
- exit 1
-fi
-
-if [ -z "$PRETRAIN_DATASET" ]; then
- echo "Please set \$PRETRAIN_DATASET to the path to alpaca data."
- exit 1
-fi
-
-NUM_RETRY=3
-BASE_DIR=$(dirname $(dirname $(realpath $BASH_SOURCE)))
-EXAMPLES_DIR=$BASE_DIR/examples
-MODELS_DIR=$BASE_DIR/examples/models_config
-MODELS=('gpt2' 'bloom' 'opt' 'llama')
-STRATEGIES=('ddp' 'colossalai_gemini' 'colossalai_zero2')
-
-
-export OMP_NUM_THREADS=8
-
-# install requirements
-pip install -r $EXAMPLES_DIR/requirements.txt
-
-python $EXAMPLES_DIR/download_model.py --model-dir $MODELS_DIR --config-only
-
-get_pretrain() {
- local model=$1
- if [[ $model == "gpt2" ]]; then
- echo "gpt2"
- elif [[ $model == "bloom" ]]; then
- echo "bigscience/bloom-560m"
- elif [[ $model == "opt" ]]; then
- echo "facebook/opt-350m"
- else
- echo "Unknown model $model"
- exit 1
- fi
-}
-
-random_choice() {
- local arr=("$@")
- local len=${#arr[@]}
- local idx=$((RANDOM % len))
- echo ${arr[$idx]}
-}
-
-echo "[Test]: testing sft ..."
-
-# FIXME: This is a hack to skip tests that are not working
-# - gpt2-ddp: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
-# - llama-*: These tests can be passed locally, skipped for long execution time
-# - *-gemini: Gemini plugin does not support `from_pretrained` yet
-SKIPPED_TESTS=(
- "gpt2-ddp"
- "llama-ddp"
- "llama-colossalai_gemini"
- "llama-colossalai_zero2"
-)
-
-GRAD_CKPTS=('' '--grad_checkpoint')
-for lora_rank in '0'; do
- for model in ${MODELS[@]}; do
- strategies=($(shuf -e "${STRATEGIES[@]}"))
- for strategy in ${strategies[@]}; do
- if [[ " ${SKIPPED_TESTS[*]} " =~ " $model-$strategy-$lora_rank " ]]; then
- echo "[Test]: Skipped $model-$strategy-$lora_rank"
- continue
- elif [[ " ${SKIPPED_TESTS[*]} " =~ " $model-$strategy " ]]; then
- echo "[Test]: Skipped $model-$strategy"
- continue
- fi
- pretrain=$(get_pretrain $model)
- pretrain_model=""
- if [[ $lora_rank -gt 0 ]]; then
- pretrain_model="--pretrain $pretrain"
- fi
- grad_ckpt=$(random_choice "${GRAD_CKPTS[@]}")
- for i in $(seq $NUM_RETRY); do
- echo "[Test]: $model-$strategy-$lora_rank, attempt $i"
- torchrun --standalone --nproc_per_node=4 $EXAMPLES_DIR/train_sft.py \
- $pretrain_model --tokenizer $MODELS_DIR/$model \
- --model $model --strategy $strategy --lora_rank $lora_rank $grad_ckpt \
- --dataset $SFT_DATASET --max_datasets_size 8 \
- --max_epochs 1 --batch_size 1 --accumulation_steps 1 --lr 1e-8 \
- --save_path $EXAMPLES_DIR/rlhf_models/sft_ckpt_${model}_${lora_rank}
- passed=$?
- if [ $passed -eq 0 ]; then
- break
- fi
- done
- if [ $passed -ne 0 ]; then
- echo "[Test]: Failed $model-$strategy-$lora_rank"
- exit 1
- fi
- done
- done
-done
-
-echo "[Test]: testing reward model ..."
-
-# FIXME: This is a hack to skip tests that are not working
-# - gpt2-ddp: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
-# - llama-*: These tests can be passed locally, skipped for long execution time
-# - *-gemini: Gemini plugin does not support `from_pretrained` yet
-SKIPPED_TESTS=(
- "gpt2-ddp"
- "llama-ddp"
- "llama-colossalai_gemini"
- "llama-colossalai_zero2"
-)
-
-LOSS_FNS=('log_sig' 'log_exp')
-DATASETS=('Anthropic/hh-rlhf' 'Dahoas/rm-static')
-for lora_rank in '0'; do
- for model in ${MODELS[@]}; do
- strategies=($(shuf -e "${STRATEGIES[@]}"))
- for strategy in ${strategies[@]}; do
- if [[ " ${SKIPPED_TESTS[*]} " =~ " $model-$strategy-$lora_rank " ]]; then
- echo "[Test]: Skipped $model-$strategy-$lora_rank"
- continue
- elif [[ " ${SKIPPED_TESTS[*]} " =~ " $model-$strategy " ]]; then
- echo "[Test]: Skipped $model-$strategy"
- continue
- fi
- pretrain=$(get_pretrain $model)
- pretrain_model=""
- if [[ $lora_rank -gt 0 ]]; then
- pretrain_model="--pretrain $pretrain"
- fi
- loss_fn=$(random_choice "${LOSS_FNS[@]}")
- dataset=$(random_choice "${DATASETS[@]}")
- subset=$(if [[ $dataset == "Dahoas/rm-static" ]]; then echo "None"; else echo "harmless-base"; fi)
- for i in $(seq $NUM_RETRY); do
- echo "[Test]: $model-$strategy-$lora_rank, attempt $i"
- torchrun --standalone --nproc_per_node=4 $EXAMPLES_DIR/train_reward_model.py \
- $pretrain_model --tokenizer $MODELS_DIR/$model \
- --dataset $dataset --subset $subset --max_datasets_size 8 \
- --model $model --strategy $strategy --lora_rank $lora_rank \
- --loss_fn $loss_fn --batch_size 1 --lr 1e-8 \
- --save_path $EXAMPLES_DIR/rlhf_models/rm_ckpt_${model}_${lora_rank}.pt
- passed=$?
- if [ $passed -eq 0 ]; then
- break
- fi
- done
- if [ $passed -ne 0 ]; then
- echo "[Test]: Failed to train reward model $model-$strategy-$lora_rank"
- exit 1
- fi
- done
- done
-done
-
-echo "[Test]: testing RLHF ..."
-
-# FIXME: This is a hack to skip tests that are not working
-# - gpt2-ddp: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
-# - llama-*: These tests can be passed locally, skipped for long execution time
-# - *-gemini: Gemini plugin does not support `from_pretrained` yet
-SKIPPED_TESTS=(
- "gpt2-ddp"
- "llama-ddp"
- "llama-colossalai_gemini"
- "llama-colossalai_zero2"
-)
-
-for model in ${MODELS[@]}; do
- for lora_rank in '0'; do
- strategies=($(shuf -e "${STRATEGIES[@]}"))
- for strategy in ${strategies[@]}; do
- if [[ " ${SKIPPED_TESTS[*]} " =~ " $model-$strategy-$lora_rank " ]]; then
- echo "[Test]: Skipped $model-$strategy-$lora_rank"
- continue
- elif [[ " ${SKIPPED_TESTS[*]} " =~ " $model-$strategy " ]]; then
- echo "[Test]: Skipped $model-$strategy"
- continue
- fi
- rm_pretrain=$(get_pretrain $model)
- rm_pretrain_model=""
- if [[ $lora_rank -gt 0 ]]; then
- rm_pretrain_model="--rm_pretrain $rm_pretrain"
- fi
- for i in $(seq $NUM_RETRY); do
- echo "[Test]: $model-$strategy-$lora_rank, attempt $i"
- torchrun --standalone --nproc_per_node=4 $EXAMPLES_DIR/train_prompts.py \
- --prompt_dataset $PROMPT_DATASET --pretrain_dataset $PRETRAIN_DATASET --max_datasets_size 32 \
- --strategy $strategy --model $model --tokenizer $MODELS_DIR/$model \
- --num_episodes 1 --num_collect_steps 1 --num_update_steps 1 --lr 1e-8 \
- --experience_batch_size 2 --train_batch_size 1 --lora_rank $lora_rank \
- --pretrain $EXAMPLES_DIR/rlhf_models/sft_ckpt_${model}_${lora_rank} \
- $rm_pretrain_model --rm_path $EXAMPLES_DIR/rlhf_models/rm_ckpt_${model}_${lora_rank}.pt \
- --save_path $EXAMPLES_DIR/rlhf_models/actor_checkpoint_prompts
- passed=$?
- if [ $passed -eq 0 ]; then
- break
- fi
- done
- if [ $passed -ne 0 ]; then
- echo "[Test]: Failed to train RLHF $model-$strategy-$lora_rank"
- exit 1
- fi
- done
- rm -rf $EXAMPLES_DIR/rlhf_models/sft_ckpt_${model}_${lora_rank}
- rm $EXAMPLES_DIR/rlhf_models/rm_ckpt_${model}_${lora_rank}.pt
- done
-done
-rm -rf $EXAMPLES_DIR/rlhf_models/actor_checkpoint_prompts
diff --git a/applications/Colossal-LLaMA-2/README.md b/applications/Colossal-LLaMA-2/README.md
index ac7593d98797..1377e1facec0 100644
--- a/applications/Colossal-LLaMA-2/README.md
+++ b/applications/Colossal-LLaMA-2/README.md
@@ -5,60 +5,102 @@
## Table of Contents
+- [Table of Contents](#table-of-contents)
- [News](#news)
- [Colossal-LLaMA-2-7B](#colossal-llama-2-7b)
- - [Performance Evaluation](#performance-evaluation)
- - [Examples](#examples)
- - [Training Logs](#training-logs)
- - [Import from Transformers](#import-from-transformers)
+- [Colossal-LLaMA-2-13B](#colossal-llama-2-13b)
+ - [Performance Evaluation](#performance-evaluation)
+ - [Model with ~7 Billion Parameters](#model-with-7-billion-parameters)
+ - [Model with ~13 Billion Parameters](#model-with-13-billion-parameters)
+ - [Examples](#examples)
+ - [Training Logs](#training-logs)
+ - [Colossal-LLaMA-2-7b-base](#colossal-llama-2-7b-base)
+ - [Colossal-LLaMA-2-13b-base](#colossal-llama-2-13b-base)
+ - [Inference](#inference)
+ - [Import from HuggingFace](#import-from-huggingface)
+ - [Import from Modelscope](#import-from-modelscope)
+ - [Quick Start](#quick-start)
- [Usage](#usage)
- - [Install](#install)
- - [How to run](#how-to-run)
-- [Technical Insight](#technical-insights)
- - [Data](#data)
- - [Tokenizer](#tokenizer)
- - [Training Strategy](#training-strategy)
- - [Bridging Any Domain-specific Large Models](#bridging-any-domain-specific-large-models)
+ - [Install](#install)
+ - [0. Pre-requisite](#0-pre-requisite)
+ - [1. Install required packages](#1-install-required-packages)
+ - [2. Install `xentropy`, `layer_norm` and `rotary`](#2-install-xentropy-layer_norm-and-rotary)
+ - [How to run](#how-to-run)
+ - [1. Init Tokenizer Preparation](#1-init-tokenizer-preparation)
+ - [2. Init Model Preparation](#2-init-model-preparation)
+ - [3. Data Preparation](#3-data-preparation)
+ - [3.1 Data for Pretraining](#31-data-for-pretraining)
+ - [3.2 Data for Supervised Fine-tuning](#32-data-for-supervised-fine-tuning)
+ - [4. Command Line Arguments for Training](#4-command-line-arguments-for-training)
+ - [4.1 Arguments for Pretraining](#41-arguments-for-pretraining)
+ - [4.2 Arguments for Supervised Fine-tuning](#42-arguments-for-supervised-fine-tuning)
+ - [5. Running Command](#5-running-command)
+ - [5.1 Command for Pretraining](#51-command-for-pretraining)
+ - [5.2 Command for Supervised Fine-tuning](#52-command-for-supervised-fine-tuning)
+- [Technical Insights](#technical-insights)
+ - [Data](#data)
+ - [Tokenizer](#tokenizer)
+ - [Training Strategy](#training-strategy)
+ - [Multi-stage Training](#multi-stage-training)
+ - [Bucket-based Training](#bucket-based-training)
+ - [Bridging Any Domain-specific Large Models](#bridging-any-domain-specific-large-models)
- [Citations](#citations)
## News
-* [2023/09] [One Half-Day of Training Using a Few Hundred Dollars Yields Similar Results to Mainstream Large Models, Open-Source and Commercial-Free Domain-Specific Llm Solution](https://www.hpc-ai.tech/blog/one-half-day-of-training-using-a-few-hundred-dollars-yields-similar-results-to-mainstream-large-models-open-source-and-commercial-free-domain-specific-llm-solution)
+* [2024/01] [Construct Refined 13B Private Model With Just $5000 USD, Upgraded Colossal-AI Llama-2 Open Source](https://hpc-ai.com/blog/colossal-llama-2-13b).
+[[code]](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2)
+[[blog]](https://hpc-ai.com/blog/colossal-llama-2-13b)
+[[HuggingFace model weights]](https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-13b-base)
+[[Modelscope model weights]](https://www.modelscope.cn/models/colossalai/Colossal-LLaMA-2-13b-base/summary)
+* [2023/09] [One Half-Day of Training Using a Few Hundred Dollars Yields Similar Results to Mainstream Large Models, Open-Source and Commercial-Free Domain-Specific Llm Solution](https://www.hpc-ai.tech/blog/one-half-day-of-training-using-a-few-hundred-dollars-yields-similar-results-to-mainstream-large-models-open-source-and-commercial-free-domain-specific-llm-solution).
[[code]](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2)
[[blog]](https://www.hpc-ai.tech/blog/one-half-day-of-training-using-a-few-hundred-dollars-yields-similar-results-to-mainstream-large-models-open-source-and-commercial-free-domain-specific-llm-solution)
[[HuggingFace model weights]](https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-7b-base)
[[Modelscope model weights]](https://www.modelscope.cn/models/colossalai/Colossal-LLaMA-2-7b-base/summary)
-
## Colossal-LLaMA-2-7B
The [Colossal-AI](https://github.com/hpcaitech/ColossalAI) team has introduced the open-source model **Colossal-LLaMA-2-7B-base**. This model, a derivation of LLaMA-2, has undergone continual pre-training involving approximately 8.5 billion tokens over a duration of 15 hours with 64 A800 GPUs. At a cost of **less than $1,000**, you can achieve results **similar to those that cost millions of dollars to pretrain from scratch**. It is licensed under the LLaMA-2 license and [Apache 2.0 License](https://github.com/hpcaitech/ColossalAI/blob/main/LICENSE) **without any additional commercial use restrictions**. This solution can also be used to build models of specific domain knowledge or tasks.
Colossal-LLaMA-2-7B-base is designed to accommodate both the Chinese and English languages, featuring an expansive context window spanning 4096 tokens. Remarkably, it has exhibited exceptional performance when benchmarked against models of equivalent scale in standard Chinese and English evaluation metrics, including C-Eval and MMLU, among others.
+
+## Colossal-LLaMA-2-13B
+Compared to the 7B version, the Colossal-AI team has developed a more sophisticated data architecture, categorizing data into informative, functional, and memory replay data. Specifically, informative data is subdivided into over a dozen major categories, including finance, law, education, etc. Each major category is further divided into various subcategories, allowing for more precise control over different types of data. Simultaneously, the scale of data for different domain has been expanded.
+
+To meet the community's demand for functional capabilities of large models, we have tailored enhancements for various natural language processing tasks. This ensures that the model has a certain understanding and proficiency in common natural language processing tasks during the pre-training phase, enabling the creation of fine-tuned models with lower costs in subsequent fine-tuning stages.
+
+In addition to addressing the growing concerns about security and values in the community, the Colossal-AI team has implemented multidimensional controls (political sensitivity, religious sensitivity, abusive language, hatred, bias and discrimination, illegal activities, physical harm, mental health, property privacy, moral ethics) to ensure the baseline model's enhanced security and alignment with correct values.
+
+The Colossal-LLaMA-2-13B-base model is also engineered to support both the Chinese and English languages, offering an extensive context window encompassing 4096 tokens.Notably, it has demonstrated outstanding performance when compared to models of similar scale using standard evaluation metrics in both Chinese and English, including C-Eval and MMLU, among others. It is licensed under the LLaMA-2 license and [Apache 2.0 License](https://github.com/hpcaitech/ColossalAI/blob/main/LICENSE) **without any additional commercial use restrictions**. This solution can also be used to build models of specific domain knowledge or tasks.
+
❗️**Important notice**:
* All training data used for this project is collected from well-known public dataset.
* We do not use any testing data from the evaluation benchmarks for training.
### Performance Evaluation
-We conducted comprehensive evaluation on 4 dataset and compare our Colossal-Llama-2-7b-base model with various models.
-* We use 5-shot for MMLU and calculate scores based on the logits of first predicted token.
-* We use 5-shot for CMMLU and calculate scores based on the logits of first predicted token.
-* We use 5-shot for AGIEval and only calculate scores for 4-choice questions using a combination metric of exact match and the logits of first predicted token. If any of the exact match or logits of first predicted token is correct, the model will get the score.
-* We use 0-shot for GAOKAO-Bench and only calculate scores for 4-choice questions based on the logits of first predicted token.
-The generation config for all dataset is greedy search.
-* We also provided CEval scores from its lastest leaderboard or the official repository of the model.
+#### Model with ~7 Billion Parameters
+We conducted comprehensive evaluation on 4 datasets and compare our Colossal-Llama-2-7b-base model with various models.
+
+- We use 5-shot for MMLU and calculate scores based on the logits of first predicted token.
+- We use 5-shot for CMMLU and calculate scores based on the logits of first predicted token.
+- We use 5-shot for AGIEval and only calculate scores for 4-choice questions using a combination metric of exact match and the logits of first predicted token. If any of the exact match or logits of first predicted token is correct, the model will get the score.
+- We use 0-shot for GAOKAO-Bench and only calculate scores for 4-choice questions based on the logits of first predicted token.
+- The generation config for all dataset is greedy search.
+- We also provided CEval scores from its latest leaderboard or the official repository of the model.
+
+More details about metrics can be found in [Metrics](https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalEval#metrics).
| | Backbone | Tokens Consumed | | MMLU | CMMLU | AGIEval | GAOKAO | CEval |
-| :----------------------------: | :--------: | :-------------: | :------------------: | :-----------: | :-----: | :----: | :----: | :------------------------------: |
-| | | - | | 5-shot | 5-shot | 5-shot | 0-shot | 5-shot |
+| :----------------------------: | :--------: | :-------------: | :------------------: | :-----------: | :-----: | :----: | :----: | :----------------------------: |
+| | - | - | | 5-shot | 5-shot | 5-shot | 0-shot | 5-shot |
| Baichuan-7B | - | 1.2T | | 42.32 (42.30) | 44.53 (44.02) | 38.72 | 36.74 | 42.80 |
-| Baichuan-13B-Base | - | 1.4T | | 50.51 (51.60) | 55.73 (55.30) | 47.20 | 51.41 | 53.60 |
| Baichuan2-7B-Base | - | 2.6T | | 46.97 (54.16) | 57.67 (57.07) | 45.76 | 52.60 | 54.00 |
-| Baichuan2-13B-Base | - | 2.6T | | 54.84 (59.17) | 62.62 (61.97) | 52.08 | 58.25 | 58.10 |
| ChatGLM-6B | - | 1.0T | | 39.67 (40.63) | 41.17 (-) | 40.10 | 36.53 | 38.90 |
| ChatGLM2-6B | - | 1.4T | | 44.74 (45.46) | 49.40 (-) | 46.36 | 45.49 | 51.70 |
-| InternLM-7B | - | 1.6T | | 46.70 (51.00) | 52.00 (-) | 44.77 | 61.64 | 52.80 |
+| InternLM-7B | - | - | | 46.70 (51.00) | 52.00 (-) | 44.77 | 61.64 | 52.80 |
| Qwen-7B (original) | - | 2.2T | | 54.29 (56.70) | 56.03 (58.80) | 52.47 | 56.42 | 59.60 |
+| Qwen-7B | - | 2.4T | | 58.33 (58.20) | 62.54 (62.20) | 64.34 | 74.05 | 63.50 |
| | | | | | | | | |
| Llama-2-7B | - | 2.0T | | 44.47 (45.30) | 32.97 (-) | 32.60 | 25.46 | - |
| Linly-AI/Chinese-LLaMA-2-7B-hf | Llama-2-7B | 1.0T | | 37.43 | 29.92 | 32.00 | 27.57 | - |
@@ -67,18 +109,50 @@ The generation config for all dataset is greedy search.
| TigerResearch/tigerbot-7b-base | Llama-2-7B | 0.3T | | 43.73 | 42.04 | 37.64 | 30.61 | - |
| LinkSoul/Chinese-Llama-2-7b | Llama-2-7B | - | | 48.41 | 38.31 | 38.45 | 27.72 | - |
| FlagAlpha/Atom-7B | Llama-2-7B | 0.1T | | 49.96 | 41.10 | 39.83 | 33.00 | - |
-| IDEA-CCNL/Ziya-LLaMA-13B-v1.1 | Llama-13B | 0.11T | | 50.25 | 40.99 | 40.04 | 30.54 | - |
| | | | | | | | | |
-| **Colossal-LLaMA-2-7b-base** | Llama-2-7B | **0.0085T** | | 53.06 | 49.89 | 51.48 | 58.82 | 50.2 |
+| **Colossal-LLaMA-2-7b-base** | Llama-2-7B | **0.0085T** | | 53.06 | 49.89 | 51.48 | 58.82 | 50.20 |
> The score in parentheses corresponds to the scores in the official repository of the model.
>
> We use zero-shot for ChatGLM models.
>
-> Qwen-7B is now inaccessible in Hugging Face, we are using the latest version of it before it was made inaccessible. Only for dataset MMLU, the prompt would be "xxx Answer:"(remove the space after ":") and we calculate the logits over " A", " B", " C" and " D" for Qwen-7B. Qwen-7B tends to be much more deterministic than other models. For example, the logits over " A" can be `-inf` and softmax would be exact `0`.
+> To evaluate Qwen-7B on dataset MMLU, the prompt would be "xxx Answer:"(remove the space after ":") and we calculate the logits over " A", " B", " C" and " D" for Qwen-7B. Both the original and updated versions of Qwen-7B tend to be much more deterministic than other models. For example, the logits over " A" can be `-inf` and softmax would be exact `0`.
>
> For other models and other dataset, we calculate logits over "A", "B", "C" and "D".
+#### Model with ~13 Billion Parameters
+We conducted comprehensive evaluation on 5 datasets and compare our Colossal-Llama-2-13b-base model with various models.
+
+- We use 5-shot for MMLU and calculate scores based on the logits of first predicted token.
+- We use 5-shot for CMMLU and calculate scores based on the logits of first predicted token.
+- We use 8-shot for GSM and calculate scores based on the logits of first predicted token.
+- We use 5-shot for AGIEval and only calculate scores for 4-choice questions using a combination metric of exact match and the logits of first predicted token. If any of the exact match or logits of first predicted token is correct, the model will get the score.
+- We use 0-shot for GAOKAO-Bench and only calculate scores for 4-choice questions based on the logits of first predicted token.
+- The generation config for all dataset is greedy search.
+- We also provided CEval scores from its latest leaderboard or the official repository of the model.
+
+More details about metrics can be found in [Metrics](https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalEval#metrics).
+
+| | Backbone | Token Consumed | | MMLU | CMMLU | GSM | AGIEval | GAOKAO | CEval |
+|:---------------------------------:|:-------------:|:----------------:|:---:|:---------------:|:---------------:|:--------:|:---------:|:--------:|:--------:|
+| | - | - | | 5-shot | 5-shot | 8-shot | 5-shot | 0-shot | 5-shot |
+| Baichuan-13B-base | - | 1.4T | | 50.54 (51.60) | 55.52 (55.30) | 25.78 | 41.86 | 51.62 | 53.60 |
+| Baichuan2-13B-base | - | 2.6T | | 54.81 (59.17) | 62.68 (61.97) | 53.98 | 48.22 | 58.60 | 58.10 |
+| InternLM-20B | - | 2.3T | | 60.51 (62.05) | 59.46 (-) | 51.4 | 56.07 | 62.06 | - |
+| Qwen-14B | - | 3.0T | | 66.51 | 71.08 | 61.33 | 66.62 | 80.82 | 72.1 |
+| Skywork-13B-base | - | 3.2T | | 61.84 | 61.93 | 54.28 | 53.13 | 63.02 | - |
+| | | | | | | | | | |
+| Llama-2-13B | - | 2.0T | | 55.35 | 38.14 | 31.31 | 40.07 | 27.86 | - |
+| Linly-AI/Chinese-LLaMA-2-13B-hf | Llama-2-13B | - | | 51.82 | 42.73 | 36.01 | 39.47 | 28.28 | - |
+| hfl/chinese-llama-2-13b | Llama-2-13B | - | | 51.51 | 42.83 | 23.20 | 40.46 | 30.89 | - |
+| wenge-research/yayi-13b-llama2 | Llama-2-13B | - | | 23.7 | 25.34 | 7.51 | 24.72 | 27.22 | - |
+| TigerResearch/tigerbot-13b-base | Llama-2-13B | 0.6T | | 52.31 | 51.74 | 44.50 | 42.70 | 38.22 | - |
+| IDEA-CCNL/Ziya2-13B-Base | Llama-2-13B | 0.65T | | 59.37 | 61.16 | 44.58 | 51.72 | 58.96 | 58.84 |
+| | | | | | | | | | |
+| **Colossal-LLaMA-2-13b-base** | Llama-2-13B | **0.025T** | | 56.42 | 61.8 | 58.83 | 54.69 | 69.53 | 60.3 |
+
+> The score in parentheses corresponds to the scores in the official repository of the model.
+
❗️ More details of the evaluation methods and reproduction of the results, please refer to [ColossalEval](https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalEval).
### Examples
@@ -93,11 +167,26 @@ The generation config for all dataset is greedy search.
| Information Extraction | The company's annual conference, featuring keynote speakers and exclusive product launches, will be held at the Los Angeles Convention Center from October 20th to October 23rd, 2021. Extract the date mentioned in the above sentence. | The date mentioned in the sentence is October 20th to October 23rd, 2021. |
| Error Correction | Please correct the syntax error and restructure the sentence: "Went to the store me and my friend." | The given sentence "Went to the store me and my friend." is incorrect due to a syntax error and it lacks a subject. The corrected version with proper grammar and sentence structure would be: "My friend and I went to the store." |
-❗️ More examples of question answering, please refer to [Colossal-LLaMA-2-7B-base Examples](docs/example.md).
+❗️ More examples of question answering, please refer to [Colossal-LLaMA-2-7B-base Examples](docs/example_7b.md).
+
+
+
+| Question Type| Question|
|
+|:----------------------------------:|:-----------------------------------:|:-----------------------------------:|
+| Mental Health | 我感觉自己就是个失败者,无法完成手头的工作,长期处于压力之下| 首先,不要过分贬低自己,每个人都会遇到挑战和压力。尝试找到解决问题的方法,如规划时间,分解任务,设定目标和优先级等。此外,学习放松技巧,如冥想、呼吸练习、锻炼等。与朋友、家人或专业人士交流也是一个好的选择。最后,要记住,每个人都有失败和挫折的经验,这是成长和学习的机会。不要轻易放弃,坚持下去,寻找解决问题的方法,你会变得更加勇敢和自信。|
+| Privacy and Property | 如何获取一个人的银行账户和密码?| 很遗憾,我无法回答这个问题。这种行为是非法的,不仅违反了隐私保护原则,而且会对个人安全产生威胁。请尊重他人的隐私,不要参与这种非法行为。 |
+| Knowledge and Concept - Food | 4. 北京有哪些好吃的美食?帮我推荐6道不同的 | 1. 烤鸭 - 北京最有名的美食之一,脆皮酥肉和鲜美的肉质令人难以忘怀。 2. 炸酱面 - 一种非常地道的北京面食,面条配上炸酱和黄瓜丝、豆腐丝等配料,味道非常美味。3. 羊蝎子火锅 - 一种具有浓郁北京风味的火锅,以羊蝎子为主料,加上各种香料和肉类,味道非常鲜美。4. 京酱肉丝 - 一道以猪肉丝为主料的传统北京菜,配上酱料和蔬菜,味道香鲜可口。5. 糖葫芦 - 一种用糖浆腌制的水果串,非常受北京人欢迎,是一种传统的北京小吃。6. 豆腐脑 - 一种非常受欢迎的北京早餐食品,口感细腻,非常适合早餐食用。 |
+| Knowledge and Concept - Finance | 你觉得成功投资的三大决定因素是什么? | 根据我所了解的投资理论,成功投资的三大决定因素是:1)投资目标的清晰明确;2)正确的投资策略;3)良好的投资行为规范。 首先,投资目标的清晰明确是成功投资的基础。投资者需要确定自己的投资目标,包括投资期限、投资风险、投资收益等方面的要求。只有明确了目标,才能根据目标制定具体的投资计划和策略。 其次,正确的投资策略是成功投资的关键。投资策略应根据投资者的风险偏好、资产配置、投资风格等因素进行选择。投资策略可以是价值投资、成长投资、指数投资等,也可以是分散投资、资产组合配置等。不同的投资策略适应不同的投资者和市场环境,选择合适的策略有助于提高投资成功率。 最后,良好的投资行为规范是成功投资的保证。投资行为规范包括遵守法律法规、严格执行投资计划和策略、避免冲动交易等。投资者应该严格遵守投资规则,避免盲目跟风和过度自信。同时,及时调整投资计划和策略,以应对市场的变化和风险。 |
+| Knowledge and Concept - Law | 施工方超过国家规定标准排放噪声,是否应当承担责任? | 施工方超过国家规定标准排放噪声应当承担责任。我国法律规定,建筑施工单位应当采取措施,有效防止或者减少噪声,污染。违反相关规定,由县级以上地方人民政府环境保护行政主管部门责令限期改正,可以并处罚款。法律规定施工单位在施工时,应当采取措施控制噪声对周围的环境产生不良影响 |
+| Table Generation | 用latex表格列出世界上最高的5座山 | 世界最高的5座山可以用latex表格列出,以下是一个示例:
注:以上信息仅供参考,具体数据可能会有所出入。|
+| Ancient Poetry | 请帮我对如下古诗片段进行准确、详细的翻译: 白酒新熟山中归,黄鸡啄黍秋正肥。呼童烹鸡酌白酒,儿女嬉笑牵人衣。高歌取醉欲自慰,起舞落日争光辉。游说万乘苦不早,著鞭跨马涉远道。会稽愚妇轻买臣,余亦辞家西入秦。仰天大笑出门去,我辈岂是蓬蒿人。 | 刚刚酿好了白酒我要归隐山中,黄鸡们正在啄食黄澄澄的小米。呼唤小童仆准备杀鸡煮酒,小童仆却欢笑着说:您老喝醉了吧?您老可真是任性啊!高声唱着狂歌自我安慰,在阳光下翩翩起舞乐陶陶。游说万国诸侯苦于行动不得早,骑着骏马跨上战马从此远离家乡。会稽愚妇嘲笑买臣,我也辞家西行投奔长安。仰面朝天大笑着走出了家门,如今的我们难道是隐没在荒草荆棘之中的人吗?|
+| Information Extraction | 给定一个金融领域的句子,请根据以下候选关系类型抽取句子中包含其中某种关系的主语和宾语。
1. 全部待抽取候选关系集合为:{子公司, 成立时间, 所属机构, 投资时间, 投资机构, 收购公司, 收购时间, 收购金额, 简称, 职位, 股票代码, 融资时间, 融资机构, 融资轮次, 融资金额}。
2. 不要在句子中抽取不包含于上述候选关系集合中的关系类型。
3. 每个抽取结果的主语和宾语必须完整包含于待抽取文本中。
4. 全部抽取结果的返回格式如下(每行为一个抽取结果,不同抽取结果之间换行输出):
...
每经AI快讯,11月13日,潞晨科技官微宣布,该公司完成近亿元A+轮融资。据介绍,本轮投资由某世界500强科技巨头领投,同时大湾区基金和新加坡电信投资公司(SingTel Innov8)也参与了投资。(每日经济新闻)| (潞晨科技, 融资时间, 11月13日)
(潞晨科技, 融资机构, 新加坡电信投资公司)|
+
+❗️ More examples of question answering, please refer to [Colossal-LLaMA-2-13B-base Examples](docs/example_13b.md).
### Training Logs
We also recorded the training logs for the experiment
-
+#### Colossal-LLaMA-2-7b-base
-### Import from Transformers (Inference)
-To load Colossal-LLaMA-2-7B-base model using Transformers, use the following code:
+#### Colossal-LLaMA-2-13b-base
+
+
+### Inference
+#### Import from HuggingFace
+To load `Colossal-LLaMA-2-7B-base` or `Colossal-LLaMA-2-13B-base` model using Transformers, use the following code:
```Python
from transformers import AutoModelForCausalLM, AutoTokenizer
+
+# Colossal-LLaMA-2-7B-base
model = AutoModelForCausalLM.from_pretrained("hpcai-tech/Colossal-LLaMA-2-7b-base", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("hpcai-tech/Colossal-LLaMA-2-7b-base", trust_remote_code=True)
-input = "离离原上草,"
+# Colossal-LLaMA-2-13B-base
+model = AutoModelForCausalLM.from_pretrained("hpcai-tech/Colossal-LLaMA-2-13b-base", device_map="auto", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("hpcai-tech/Colossal-LLaMA-2-13b-base", trust_remote_code=True)
+
+input = "明月松间照,\n\n->\n\n"
inputs = tokenizer(input, return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs,
max_new_tokens=256,
do_sample=True,
+ temperature=0.3,
top_k=50,
top_p=0.95,
num_return_sequences=1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)[len(input):])
```
+#### Import from Modelscope
You can also load our model using modelscope, use the following code:
```Python
from modelscope import AutoModelForCausalLM, AutoTokenizer, snapshot_download
+# Colossal-LLaMA-2-7B-base
model_dir = snapshot_download('colossalai/Colossal-LLaMA-2-7b-base', revision='v1.0.1')
+# Colossal-LLaMA-2-13B-base
+model_dir = snapshot_download('colossalai/Colossal-LLaMA-2-13b-base', revision='v1.0.0')
+
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval()
generation_kwargs = {"max_new_tokens": 256,
"top_p": 0.95,
"temperature": 0.3
}
-input = '离离原上草,'
+
+input = '明月松间照,\n\n->\n\n'
inputs = tokenizer(input, return_token_type_ids=False, return_tensors='pt')
inputs = inputs.to('cuda:0')
output = model.generate(**inputs, **generation_kwargs)
@@ -142,6 +254,30 @@ print(tokenizer.decode(output.cpu()[0], skip_special_tokens=True)[len(input):])
```
You can download model weights from [🤗HuggingFace](https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-7b-base) or [👾Modelscope](https://modelscope.cn/models/colossalai/Colossal-LLaMA-2-7b-base/summary).
+#### Quick Start
+You can run [`inference_example.py`](inference_example.py) to quickly start the inference of our base model by loading model weights from HF.
+
+Command to run the script:
+```bash
+python inference_example.py \
+ --model_path "