-
Notifications
You must be signed in to change notification settings - Fork 31.1k
Closed
Closed
Copy link
Description
System Info
transformers: 4.35.0
python: 3.9.13
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Setup model
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# load base LLM model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16,
quantization_config=bnb_config, use_flash_attention_2 = True)
Run code for different batch sizes
results = []
for n in range(1, 25):
print(f'Processing {n} examples')
tokenized_prompt = tokenizer([context]*n, return_tensors="pt")
length = len(tokenized_prompt['input_ids'][0])+1
print(length)
t0 = time()
with torch.no_grad():
output = model.generate(
inputs = tokenized_prompt['input_ids'],
max_new_tokens = 400,
repetition_penalty = 1.2
)
t1 = time()
time_taken = t1 - t0
mem_usage = memory()
new_token_length = len(output[0]) - length
tokens_per_second = new_token_length * n / time_taken
time_per_batch = time_taken/n
print('Time taken = ', time_taken)
print(f'Tokens/s = {tokens_per_second}')
gc.collect()
torch.cuda.empty_cache()
results.append({'batch_size': n, 'time_taken': time_taken,
'tokens_per_second': tokens_per_second, 'memory_usage': mem_usage, 'time_per_batch':time_per_batch})
Expected behavior
Results
Very little speedup/memory improvement:

Profiling
Would expect better performance given these
Metadata
Metadata
Assignees
Labels
No labels

