Skip to content

Commit d2cec09

Browse files
joaocmdRocketknight1amyeroberts
authored
Add TF swiftformer (#23342)
* Duplicate swiftformer * Convert SwiftFormerPatchEmbedding * Convert SwiftFormerEmbeddings * Convert TFSwiftFormerMlp * Convert TFSwiftFormerConvEncoder * Convert TFSwiftFormerLocalRepresentation * convert TFSwiftFormerEncoderBlock * Convert SwiftFormerStage * Convert SwiftFormerEncoder * Add TFSWiftFormerPreTrainedModel * Convert SwiftFormerForImageClassification * Add kwargs and start drop path * Fix syntax * Change Model class name * Add TFSwiftFormer to __init__ * Duplicate test_modeling_swiftformer * First test conversions * Change require_torch to require_tf * Add exports to swiftformer __init__ * Add TFSwiftFormerModel wrapper * Fix __init__ and run black * Remove docstring from MainLayer, fix padding * Use keras.layers.Activation on keras.Sequential * Fix swiftformer exports * Fix activation layer from config * Remove post_inits * Use tf.keras.layers.ZeroPadding2D * Convert torch normalize * Change tf test input shape * Fix softmax and reduce_sum * Convert expand_dims and repeat * Add missing reshape and tranpose * Simplify TFSwiftFormerEncoderBlock.call * Fix mismatch in patch embeddings * Fix expected output shape to match channels last * Fix swiftformer typo * Disable test_onnx * Fix TFSwiftFormerForImageClassification call * Add unpack inputs * Convert flatten(2).mean(-1) * Change vision dummy inputs (to be reviewed) * Change test_forward_signature to use .call * Fix @unpack_inputs * Set return_tensors="tf" and rename class * Rename wrongly named patch_embeddings layer * Add serving_output and change dummy_input shape * Make dimensions BCHW and transpose inside embedding layer * Change SwiftFormerEncoderBlock * Fix ruff problems * Add image size to swiftformer config * Change tranpose to MainLayer and use -1 for reshape * Remove serving_outputs and dummy_inputs * Remove test_initialization test from tf model * Make Sequential component a separate layer * Fix layers' names * Tranpose encoder outputs * Fix tests and check if hidden states is not None * Fix TFSwiftFormerForImageClassification * Run make fixup * Run make fix-copies * Update modeling_tf_auto * Update docs * Fix modeling auto mapping * Update modelint_tf_swiftformer docs * Fill image_size doc and type * Add reduction=None to loss computation * Update docs * make style * Debug: Delete the tip to see if that changes anything * Re-add tip * Remove add_code_sample_docstrings * Remove unused import * Get the debug to actually tell us the problem it has with the docs * Try a substitution to match the PyTorch file? * Add swiftformer to ignore list * Add build() methods * Update copyright year Co-authored-by: amyeroberts <[email protected]> * Remove FIXME comment * Remove from_pt * Update copyright year Co-authored-by: amyeroberts <[email protected]> * Rename one-letter variables * Remove FIXMEs related to momentum * Remove old TODO comment * Remove outstanding FIXME comments * Get dropout rate from config * Add specific dropout config for MLP * Add convencoder dropout to config * Pass config to SwiftFormerDropPath layer * Fix drop_path variable name and add Adapted from comment * Run ruff * Removed copied from comment * Run fix copies * Change drop_path to identity to match pt * Cleanup build() methods and move to new keras imports * Update docs/source/en/model_doc/swiftformer.md Co-authored-by: Matt <[email protected]> * Raise error if drop_path_rate > 0.0 * Apply suggestions from code review Replace (self.dim), with self.dim, Co-authored-by: Matt <[email protected]> * Remove drop_path function * Add training to TFSwiftFormerEncoder * Set self.built = True last Co-authored-by: amyeroberts <[email protected]> * Should have been added to previous commit Co-authored-by: amyeroberts <[email protected]> * Apply suggestions from code review Co-authored-by: amyeroberts <[email protected]> * Change default_feature_extractor to default_image_processor Co-authored-by: amyeroberts <[email protected]> * Import Keras from modeling_tf_utils * Remove relative import * Run ruff --fix * Move import keras to tf_available * Add copied from comment to test_forward_signature * Reduce batch size and num_labels * Extract loss logic to hf_compute_loss * Run ruff format --------- Co-authored-by: Matt <[email protected]> Co-authored-by: amyeroberts <[email protected]> Co-authored-by: Matt <[email protected]>
1 parent 21c912e commit d2cec09

File tree

11 files changed

+1244
-20
lines changed

11 files changed

+1244
-20
lines changed

docs/source/en/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -275,7 +275,7 @@ Flax), PyTorch, and/or TensorFlow.
275275
| [StableLm](model_doc/stablelm) ||||
276276
| [Starcoder2](model_doc/starcoder2) ||||
277277
| [SuperPoint](model_doc/superpoint) ||||
278-
| [SwiftFormer](model_doc/swiftformer) || ||
278+
| [SwiftFormer](model_doc/swiftformer) || ||
279279
| [Swin Transformer](model_doc/swin) ||||
280280
| [Swin Transformer V2](model_doc/swinv2) ||||
281281
| [Swin2SR](model_doc/swin2sr) ||||

docs/source/en/model_doc/swiftformer.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ The abstract from the paper is the following:
2626

2727
*Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially for deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed to combine the advantages of convolutions and self-attention for a better speed-accuracy trade-off, the expensive matrix multiplication operations in self-attention remain a bottleneck. In this work, we introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. Our design shows that the key-value interaction can be replaced with a linear layer without sacrificing any accuracy. Unlike previous state-of-the-art methods, our efficient formulation of self-attention enables its usage at all stages of the network. Using our proposed efficient additive attention, we build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2.*
2828

29-
This model was contributed by [shehan97](https://huggingface.co/shehan97).
29+
This model was contributed by [shehan97](https://huggingface.co/shehan97). The TensorFlow version was contributed by [joaocmd](https://huggingface.co/joaocmd).
3030
The original code can be found [here](https://github.com/Amshaker/SwiftFormer).
3131

3232
## SwiftFormerConfig
@@ -42,3 +42,13 @@ The original code can be found [here](https://github.com/Amshaker/SwiftFormer).
4242

4343
[[autodoc]] SwiftFormerForImageClassification
4444
- forward
45+
46+
## TFSwiftFormerModel
47+
48+
[[autodoc]] TFSwiftFormerModel
49+
- call
50+
51+
## TFSwiftFormerForImageClassification
52+
53+
[[autodoc]] TFSwiftFormerForImageClassification
54+
- call

src/transformers/__init__.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4517,6 +4517,14 @@
45174517
"TFSpeech2TextPreTrainedModel",
45184518
]
45194519
)
4520+
_import_structure["models.swiftformer"].extend(
4521+
[
4522+
"TF_SWIFTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
4523+
"TFSwiftFormerForImageClassification",
4524+
"TFSwiftFormerModel",
4525+
"TFSwiftFormerPreTrainedModel",
4526+
]
4527+
)
45204528
_import_structure["models.swin"].extend(
45214529
[
45224530
"TF_SWIN_PRETRAINED_MODEL_ARCHIVE_LIST",
@@ -8901,6 +8909,12 @@
89018909
TFSpeech2TextModel,
89028910
TFSpeech2TextPreTrainedModel,
89038911
)
8912+
from .models.swiftformer import (
8913+
TF_SWIFTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
8914+
TFSwiftFormerForImageClassification,
8915+
TFSwiftFormerModel,
8916+
TFSwiftFormerPreTrainedModel,
8917+
)
89048918
from .models.swin import (
89058919
TF_SWIN_PRETRAINED_MODEL_ARCHIVE_LIST,
89068920
TFSwinForImageClassification,

src/transformers/models/auto/modeling_tf_auto.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,7 @@
8181
("sam", "TFSamModel"),
8282
("segformer", "TFSegformerModel"),
8383
("speech_to_text", "TFSpeech2TextModel"),
84+
("swiftformer", "TFSwiftFormerModel"),
8485
("swin", "TFSwinModel"),
8586
("t5", "TFT5Model"),
8687
("tapas", "TFTapasModel"),
@@ -213,6 +214,7 @@
213214
("regnet", "TFRegNetForImageClassification"),
214215
("resnet", "TFResNetForImageClassification"),
215216
("segformer", "TFSegformerForImageClassification"),
217+
("swiftformer", "TFSwiftFormerForImageClassification"),
216218
("swin", "TFSwinForImageClassification"),
217219
("vit", "TFViTForImageClassification"),
218220
]

src/transformers/models/swiftformer/__init__.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
from ...utils import (
1717
OptionalDependencyNotAvailable,
1818
_LazyModule,
19+
is_tf_available,
1920
is_torch_available,
2021
)
2122

@@ -41,6 +42,19 @@
4142
"SwiftFormerPreTrainedModel",
4243
]
4344

45+
try:
46+
if not is_tf_available():
47+
raise OptionalDependencyNotAvailable()
48+
except OptionalDependencyNotAvailable:
49+
pass
50+
else:
51+
_import_structure["modeling_tf_swiftformer"] = [
52+
"TF_SWIFTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST",
53+
"TFSwiftFormerForImageClassification",
54+
"TFSwiftFormerModel",
55+
"TFSwiftFormerPreTrainedModel",
56+
]
57+
4458
if TYPE_CHECKING:
4559
from .configuration_swiftformer import (
4660
SWIFTFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP,
@@ -60,6 +74,18 @@
6074
SwiftFormerModel,
6175
SwiftFormerPreTrainedModel,
6276
)
77+
try:
78+
if not is_tf_available():
79+
raise OptionalDependencyNotAvailable()
80+
except OptionalDependencyNotAvailable:
81+
pass
82+
else:
83+
from .modeling_tf_swiftformer import (
84+
TF_SWIFTFORMER_PRETRAINED_MODEL_ARCHIVE_LIST,
85+
TFSwiftFormerForImageClassification,
86+
TFSwiftFormerModel,
87+
TFSwiftFormerPreTrainedModel,
88+
)
6389

6490
else:
6591
import sys

src/transformers/models/swiftformer/configuration_swiftformer.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ class SwiftFormerConfig(PretrainedConfig):
4242
4343
4444
Args:
45+
image_size (`int`, *optional*, defaults to 224):
46+
The size (resolution) of each image
4547
num_channels (`int`, *optional*, defaults to 3):
4648
The number of input channels
4749
depths (`List[int]`, *optional*, defaults to `[3, 3, 6, 4]`):
@@ -62,6 +64,10 @@ class SwiftFormerConfig(PretrainedConfig):
6264
Padding in downsampling layers.
6365
drop_path_rate (`float`, *optional*, defaults to 0.0):
6466
Rate at which to increase dropout probability in DropPath.
67+
drop_mlp_rate (`float`, *optional*, defaults to 0.0):
68+
Dropout rate for the MLP component of SwiftFormer.
69+
drop_conv_encoder_rate (`float`, *optional*, defaults to 0.0):
70+
Dropout rate for the ConvEncoder component of SwiftFormer.
6571
use_layer_scale (`bool`, *optional*, defaults to `True`):
6672
Whether to scale outputs from token mixers.
6773
layer_scale_init_value (`float`, *optional*, defaults to 1e-05):
@@ -89,6 +95,7 @@ class SwiftFormerConfig(PretrainedConfig):
8995

9096
def __init__(
9197
self,
98+
image_size=224,
9299
num_channels=3,
93100
depths=[3, 3, 6, 4],
94101
embed_dims=[48, 56, 112, 220],
@@ -99,12 +106,15 @@ def __init__(
99106
down_stride=2,
100107
down_pad=1,
101108
drop_path_rate=0.0,
109+
drop_mlp_rate=0.0,
110+
drop_conv_encoder_rate=0.0,
102111
use_layer_scale=True,
103112
layer_scale_init_value=1e-5,
104113
batch_norm_eps=1e-5,
105114
**kwargs,
106115
):
107116
super().__init__(**kwargs)
117+
self.image_size = image_size
108118
self.num_channels = num_channels
109119
self.depths = depths
110120
self.embed_dims = embed_dims
@@ -115,6 +125,8 @@ def __init__(
115125
self.down_stride = down_stride
116126
self.down_pad = down_pad
117127
self.drop_path_rate = drop_path_rate
128+
self.drop_mlp_rate = drop_mlp_rate
129+
self.drop_conv_encoder_rate = drop_conv_encoder_rate
118130
self.use_layer_scale = use_layer_scale
119131
self.layer_scale_init_value = layer_scale_init_value
120132
self.batch_norm_eps = batch_norm_eps

src/transformers/models/swiftformer/modeling_swiftformer.py

Lines changed: 9 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -103,13 +103,12 @@ def drop_path(input: torch.Tensor, drop_prob: float = 0.0, training: bool = Fals
103103
return output
104104

105105

106-
# Copied from transformers.models.beit.modeling_beit.BeitDropPath with Beit->Swiftformer
107106
class SwiftFormerDropPath(nn.Module):
108107
"""Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
109108

110-
def __init__(self, drop_prob: Optional[float] = None) -> None:
109+
def __init__(self, config: SwiftFormerConfig) -> None:
111110
super().__init__()
112-
self.drop_prob = drop_prob
111+
self.drop_prob = config.drop_path_rate
113112

114113
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
115114
return drop_path(hidden_states, self.drop_prob, self.training)
@@ -169,7 +168,7 @@ def __init__(self, config: SwiftFormerConfig, dim: int):
169168
self.point_wise_conv1 = nn.Conv2d(dim, hidden_dim, kernel_size=1)
170169
self.act = nn.GELU()
171170
self.point_wise_conv2 = nn.Conv2d(hidden_dim, dim, kernel_size=1)
172-
self.drop_path = nn.Identity()
171+
self.drop_path = nn.Dropout(p=config.drop_conv_encoder_rate)
173172
self.layer_scale = nn.Parameter(torch.ones(dim).unsqueeze(-1).unsqueeze(-1), requires_grad=True)
174173

175174
def forward(self, x):
@@ -200,7 +199,7 @@ def __init__(self, config: SwiftFormerConfig, in_features: int):
200199
act_layer = ACT2CLS[config.hidden_act]
201200
self.act = act_layer()
202201
self.fc2 = nn.Conv2d(hidden_features, in_features, 1)
203-
self.drop = nn.Dropout(p=0.0)
202+
self.drop = nn.Dropout(p=config.drop_mlp_rate)
204203

205204
def forward(self, x):
206205
x = self.norm1(x)
@@ -302,7 +301,7 @@ def __init__(self, config: SwiftFormerConfig, dim: int, drop_path: float = 0.0)
302301
self.local_representation = SwiftFormerLocalRepresentation(config, dim=dim)
303302
self.attn = SwiftFormerEfficientAdditiveAttention(config, dim=dim)
304303
self.linear = SwiftFormerMlp(config, in_features=dim)
305-
self.drop_path = SwiftFormerDropPath(drop_path) if drop_path > 0.0 else nn.Identity()
304+
self.drop_path = SwiftFormerDropPath(config) if drop_path > 0.0 else nn.Identity()
306305
self.use_layer_scale = use_layer_scale
307306
if use_layer_scale:
308307
self.layer_scale_1 = nn.Parameter(
@@ -315,21 +314,13 @@ def __init__(self, config: SwiftFormerConfig, dim: int, drop_path: float = 0.0)
315314
def forward(self, x):
316315
x = self.local_representation(x)
317316
batch_size, channels, height, width = x.shape
317+
res = self.attn(x.permute(0, 2, 3, 1).reshape(batch_size, height * width, channels))
318+
res = res.reshape(batch_size, height, width, channels).permute(0, 3, 1, 2)
318319
if self.use_layer_scale:
319-
x = x + self.drop_path(
320-
self.layer_scale_1
321-
* self.attn(x.permute(0, 2, 3, 1).reshape(batch_size, height * width, channels))
322-
.reshape(batch_size, height, width, channels)
323-
.permute(0, 3, 1, 2)
324-
)
320+
x = x + self.drop_path(self.layer_scale_1 * res)
325321
x = x + self.drop_path(self.layer_scale_2 * self.linear(x))
326-
327322
else:
328-
x = x + self.drop_path(
329-
self.attn(x.permute(0, 2, 3, 1).reshape(batch_size, height * width, channels))
330-
.reshape(batch_size, height, width, channels)
331-
.permute(0, 3, 1, 2)
332-
)
323+
x = x + self.drop_path(res)
333324
x = x + self.drop_path(self.linear(x))
334325
return x
335326

0 commit comments

Comments
 (0)