Should I be getting more speedup/memory reduction from FlashAttention2 with Mistral?

### System Info

transformers: 4.35.0
python: 3.9.13

### Who can help?

@SunMarc 
@younesbelkada 
@gant

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

## Setup model 
```
model_id = "mistralai/Mistral-7B-Instruct-v0.1"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# load base LLM model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, 
                                              quantization_config=bnb_config, use_flash_attention_2 = True)
```

## Run code for different batch sizes
```
results = []
for n in range(1, 25):
    
    print(f'Processing {n} examples')
    tokenized_prompt = tokenizer([context]*n, return_tensors="pt")
    
    
    length = len(tokenized_prompt['input_ids'][0])+1
    print(length)
    t0 = time()
    with torch.no_grad():
        output = model.generate(
            inputs = tokenized_prompt['input_ids'],
            max_new_tokens = 400,
            repetition_penalty = 1.2
        )
    t1 = time()

    time_taken = t1 - t0
    mem_usage = memory()  
    new_token_length = len(output[0]) - length
    tokens_per_second = new_token_length * n / time_taken
    time_per_batch = time_taken/n

    print('Time taken = ', time_taken)
    print(f'Tokens/s = {tokens_per_second}')

    gc.collect()
    torch.cuda.empty_cache()

    results.append({'batch_size': n, 'time_taken': time_taken, 
                    'tokens_per_second': tokens_per_second, 'memory_usage': mem_usage, 'time_per_batch':time_per_batch})
```

### Expected behavior

## Results
Very little speedup/memory improvement:
![flash](https://github.com/huggingface/transformers/assets/131266258/2b722a0b-67d4-4a58-be21-a8eab9cc2f09)

### Profiling
With FA2:
<img width="1255" alt="Screenshot 2023-11-06 at 18 22 46" src="https://github.com/huggingface/transformers/assets/131266258/ab75b997-9225-495f-9e9d-f86162039edc">

Without FA2
<img width="1248" alt="Screenshot 2023-11-06 at 18 16 58" src="https://github.com/huggingface/transformers/assets/131266258/96a84c7a-2f37-48f7-8682-bf256ab2490a">

Would expect better performance given these


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should I be getting more speedup/memory reduction from FlashAttention2 with Mistral? #27329

System Info

Who can help?

Information

Tasks

Reproduction

Setup model

Run code for different batch sizes

Expected behavior

Results

Profiling

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Should I be getting more speedup/memory reduction from FlashAttention2 with Mistral? #27329

Description

System Info

Who can help?

Information

Tasks

Reproduction

Setup model

Run code for different batch sizes

Expected behavior

Results

Profiling

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions