Efficient Transformers backend support #2858

Cyrilvallez · 2024-12-19T17:47:17Z

What does this PR do?

Initial draft to support transformers as a (more efficient) backend in TGI. huggingface/transformers#35235 added support for a bunch of models already, and more will come progressively.

However, I do need some guidance on how to best support multi-gpu setups 🤗

cc @OlivierDehaene @Narsil

server/text_generation_server/models/__init__.py

Narsil · 2025-01-15T19:26:43Z

server/text_generation_server/models/__init__.py

            )
        else:
-            return CausalLM.fallback(
+            return transformers_causal_lm_class.fallback(


In general, I like to remove indirections.

Here, transformers_causal_lm_class is not known by the reader, he requires looking up where that's define which means following the flow of code is hard.

We know if models support flex attention or not. We can hardcode them CausalLM -> TransformersFlashCausalLM.

That removes the need to "guess" and the dependency on the private bit.

IMO the dynamic behavior is simpler as we will roll support for more and more models in transformers

But can obviously be changed if this is a blocker on your side 😁

Narsil · 2025-01-15T19:27:49Z

server/text_generation_server/models/transformers_flash_causal_lm.py

+    softmax_scale: Optional[float] = None,
+    sliding_window: Optional[int] = None,
+    softcap: Optional[float] = None,
+    **kwargs,


They are needed here to easily "absorb" whatever is passed internally in Transformers and not used in tgi's attention. Made it more explicit for the arguments we do use though

Yes, then we can mark kwargs as _kwargs (just trying to explicitly say there might be arguments we do not use)

server/text_generation_server/models/transformers_flash_causal_lm.py

Narsil · 2025-01-15T19:33:26Z

server/text_generation_server/models/transformers_flash_causal_lm.py

+        prefill_cache_indices,
+        lm_head_indices,
+    ):
+        hidden_states = self.model.model.forward(


IS that consistent enough ?? I thought some models defined self.transformer instead of self.model.

As of now yes - all models that are supported by our refactors are consistent with that naming. However I agree that it is quite an ugly workaround, and I'll open a PR asap in Transformers to allow logit slicing with Tensor (for now we only support int slicing with num_logits_to_keep)

Solved it with huggingface/transformers#35757 in Transformers for cleaner and more robust interface

If it's consistent now, then by all means let's use that. We don't care about legacy transformers versions :)

Narsil · 2025-01-15T19:35:05Z

server/text_generation_server/models/transformers_flash_causal_lm.py

+    def forward(
+        self, batch: FlashCausalLMBatch, adapter_data: AdapterBatchData
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        # NOTE: adapter_data: not supported


We need a hard fail at least for every config that would otherwise be silently ignored.

speculation, and adapter data (the checks might already exist)

Narsil · 2025-01-15T19:35:32Z

server/text_generation_server/models/transformers_flash_causal_lm.py

+        return logits
+
+
+    def forward(


It seems everything downward is a copy from what was before.

What are the key differences if any ?

The calls to self.model.forward() are replaced by self._model_forward()

Cyrilvallez and others added 18 commits December 10, 2024 16:46

add transformers_flash

ade0f44

inits

da22290

switch version to make it work

b3b0747

Update Makefile-flash-att-v2

738f0b0

Update Makefile-flash-att-v2

a84ecf2

Update Makefile-flash-att-v2

372799a

Update Makefile-flash-att-v2

a0035e6

Update Makefile-flash-att-v2

e69a384

Update Makefile-flash-att-v2

3a636ed

runnable version

649cb1f

working

490ca0e

push change

f843b62

fix high dim

715b2d1

init

e93ab92

default

f4c60ca

latest transformers changes

2e2631e

revert

44b3679

simplify check

266377b

Cyrilvallez marked this pull request as ready for review January 15, 2025 18:07

Narsil reviewed Jan 15, 2025

View reviewed changes

Cyrilvallez mentioned this pull request Jan 15, 2025

Flash Transformers modeling backend support #2913

Merged

Cyrilvallez added 9 commits January 17, 2025 12:26

remove flag

32488c1

improve type hints + required args

ac62bd1

Update based on transformers PR

b03d7ae

small fix

b40c889

Remove Warpers for Processor

42ae6de

fix compatibility version issue

f01014d

raise error if needed

2659b59

Simplify with monkey patch

a2fe842

revert + style + minor improvements

6e0f37c

Cyrilvallez closed this Jan 20, 2025

Efficient Transformers backend support #2858

Efficient Transformers backend support #2858

Uh oh!

Conversation

Cyrilvallez commented Dec 19, 2024

What does this PR do?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants