Whisper Static Cache #30760

huseinzol05 · 2024-05-11T14:16:52Z

Static Cache for Whisper

This enable to use torch.compile for Whisper generation to enable faster generation, example https://gist.github.com/huseinzol05/9aff34ec1427ee8c92240cb4f3cc0c88

Compiled static cache able to achieve 186.26it/s while non-compiled got 150.20it/s .

Still work in progress

Current forked only work to use static cache, need to follow caching steps as Llama.
There are so many conditions need to fulfill first.
Only worked on Pytorch 2.4.0.dev20240508+cu121 version, not yet released as stable for custom function reduce-overhead torch compile.

mobicham · 2024-05-11T15:40:16Z

Thank you very much @huseinzol05 for the work.
Here's a version with HQQ 4-bit using the torchao backend. As expected there's a good speed-up with the static cache and fullgraph compilation: https://gist.github.com/mobicham/ecfe09a48efb11e4014386901a5c6cce

GPU: 4090
orig - no compile : 48 it/sec
orig + compiled   : 227 it/sec

hqq - no compile  : 42 it/sec
hqq + compile     : 308 it/sec

kadirnar · 2024-05-13T08:47:03Z

Will it be merged? @younesbelkada

amyeroberts · 2024-05-13T09:11:25Z

cc @sanchit-gandhi

huseinzol05 · 2024-05-16T00:34:54Z

@kadirnar , this PR is not ready to merge, or you can continue to work on it to fulfill no 1, 2 and 3. But if you want to use it, you have to split the audio into 30s chunks with overlap and feed into encoder-decoder process, feel free to add temperature and top_k like https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L52

kadirnar · 2024-05-16T08:57:41Z

Will it work if I run these codes? Also, should I make any changes to the gpt-fast library?

https://gist.github.com/huseinzol05/9aff34ec1427ee8c92240cb4f3cc0c88

huseinzol05 · 2024-05-17T23:27:50Z

yeah it should work, i use it in my prod, but dont forget to warmup the static cache multiple time first

kadirnar · 2024-05-18T09:39:06Z

yeah it should work, i use it in my prod, but dont forget to warmup the static cache multiple time first

I ran the notebook file. It gives this error.

File /usr/local/lib/python3.10/dist-packages/transformers/cache_utils.py:484, in WhisperStaticCache.__init__(self, config, dtype, device, existing_cache, batch_size)
    482 torch._dynamo.mark_static_address(e_key_cache)
    483 torch._dynamo.mark_static_address(e_value_cache)
--> 484 e_key_cache[:, :, :, :] = existing_cache[k][2].clone()
    485 e_value_cache[:, :, :, :] = existing_cache[k][3].clone()
    486 self.key_cache.append(new_layer_key_cache)File /usr/local/lib/python3.10/dist-packages/transformers/cache_utils.py:484, in WhisperStaticCache.__init__(self, config, dtype, device, existing_cache, batch_size)
    482 torch._dynamo.mark_static_address(e_key_cache)
    483 torch._dynamo.mark_static_address(e_value_cache)
--> 484 e_key_cache[:, :, :, :] = existing_cache[k][2].clone()
    485 e_value_cache[:, :, :, :] = existing_cache[k][3].clone()
    486 self.key_cache.append(new_layer_key_cache)

IndexError: tuple index out of range

huseinzol05 · 2024-05-18T15:19:25Z

I just reran and no issue, super weird, which line is that you the error?

sanchit-gandhi · 2024-05-20T16:09:37Z

src/transformers/cache_utils.py

            self.key_cache[layer_idx].zero_()
            self.value_cache[layer_idx].zero_()
+
+class WhisperStaticCache(Cache):


Thanks for this great first start @huseinzol05! With @gante, we were discussing how the design of the static k/v cache should look for encoder-decoder models, and we distilled the design options down to two possibilities:

Hold a tuple of StaticCache caches, e.g. as proposed here

Add a new Cache classes specific to encoder-decoder models, e.g. those with the attributes:

key_cache (same as decoder-only self-attn)

value_cache (same as decoder-only self-attn)

cross_key_cache (new for enc-dec cross-attn)

cross_value_cache (new for enc-dec cross-attn)

Option 1 doesn't require any new Cache classes, so should be easier to maintain! Thus, we were thinking this would be the best design option for Whisper (and other encoder-decoder models in the library, such as BART). Would be curious to hear you opinions here, having had a go at option 2

@huseinzol05 this is great work!

I'm heavily biased towards option 1, especially now that we are seeing more cache types. For instance, we could easily plug in the quantized cache as the decoder cache with 0 code overhead, if we design Whisper to support a tuple of Cache objects through past_key_values 🤗

im good with anything

Jiltseb · 2024-06-03T15:23:48Z

We did a comparison of the performance of the torch compiled version with static cache and its HQQ variants (4,3,2 and 1.58 bits) on both short-form audio (open_asr_eval) and long-form audio (internal test benchmark).

Here is the link to the blog post: https://mobiusml.github.io/whisper-static-cache-blog/
Colab Notebook: https://colab.research.google.com/drive/18Zs-oG1Ztco3cfnNexcHDi-Zn9vk2RJ5?usp=sharing

I think the speech community can benefit a lot from this speed-up once integrated into transformers 🤗 !

mobicham · 2024-06-06T08:25:14Z

Any progress on this folks? Is there a timeline for a general static support in transformers? We are very excited to see this officially supported in transformers!

github-actions · 2024-07-01T08:04:20Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

kadirnar · 2024-07-07T16:12:35Z

Will you merge this pull request? @sanchit-gandhi

gante · 2024-07-15T09:21:05Z

Closing this PR: whisper + compilation had a few sensible design decisions, as shown in the discussion above, so we took charge of adding static caches to whisper (PR)

Thank you for kickstarting the process and for the discussion 🤗

added initial

a973582

huseinzol05 mentioned this pull request May 11, 2024

Add static cache support for Whisper #30707

Closed

amyeroberts added Audio Cache labels May 13, 2024

fix caching

0c8e36e

sanchit-gandhi reviewed May 20, 2024

View reviewed changes

zucchini-nlp mentioned this pull request May 21, 2024

tracker: generate compatibility with torch.compile #28981

Closed

33 tasks

Jiltseb mentioned this pull request Jun 3, 2024

Feature Request: Implement Static Cache and Quantization Techniques in CTranslate2 OpenNMT/CTranslate2#1717

Open

sang-nguyen-ts mentioned this pull request Jul 11, 2024

[WIP] Parle-TTS Static Cache huggingface/parler-tts#87

Closed

2 tasks

gante closed this Jul 15, 2024

Whisper Static Cache #30760

Whisper Static Cache #30760

Uh oh!

Conversation

huseinzol05 commented May 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Static Cache for Whisper

Still work in progress

Uh oh!

mobicham commented May 11, 2024

Uh oh!

kadirnar commented May 13, 2024

Uh oh!

amyeroberts commented May 13, 2024

Uh oh!

huseinzol05 commented May 16, 2024

Uh oh!

kadirnar commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huseinzol05 commented May 17, 2024

Uh oh!

kadirnar commented May 18, 2024

Uh oh!

huseinzol05 commented May 18, 2024

Uh oh!

sanchit-gandhi May 20, 2024

Choose a reason for hiding this comment

Uh oh!

gante May 29, 2024

Choose a reason for hiding this comment

Uh oh!

huseinzol05 May 31, 2024

Choose a reason for hiding this comment

Uh oh!

Jiltseb commented Jun 3, 2024

Uh oh!

mobicham commented Jun 6, 2024

Uh oh!

github-actions bot commented Jul 1, 2024

Uh oh!

kadirnar commented Jul 7, 2024

Uh oh!

gante commented Jul 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

huseinzol05 commented May 11, 2024 •

edited

Loading

kadirnar commented May 16, 2024 •

edited

Loading