diff --git a/MODEL_ZOO.md b/MODEL_ZOO.md
index 32e9e2d02..780abd43a 100644
--- a/MODEL_ZOO.md
+++ b/MODEL_ZOO.md
@@ -25,6 +25,12 @@ We provided original pretrained models from Caffe2 on heavy models (testing Caff
 | X3D | M | - | 16 x 5 | 75.1 | 76.2 | 3.8 | 4.73 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/x3d_models/x3d_m.pyth) | Kinetics/X3D_M |
 | X3D | L | - | 16 x 5 | 76.9 | 77.5 | 6.2 | 18.37 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/x3d_models/x3d_l.pyth) | Kinetics/X3D_L |
 
+## VTN model (details in projects/vtn)
+
+|  architecture | backbone |  pretrain |  frame length x sample rate | top1 |  top5  |  model | config |
+| :-------------: | :-------------: | :-------------: | :-------------: | :-------------: | :-------------: | ------------- | ------------- |
+| VTN | ViT-B | ImageNet-21K | - | 77.72 | 93.24 | [`link`](https://researchpublic.blob.core.windows.net/vtn/VTN_VIT_B_KINETICS.pyth) | Kinetics/VIT_B_VTN |
+
 ## AVA
 
 | architecture | depth | Pretrain Model |  frame length x sample rate  | MAP | AVA version | model |
@@ -67,4 +73,4 @@ We also release the imagenet pretrained model if finetuning from ImageNet is pre
 
 | architecture | depth |  Top1 |  Top5  |  model  |
 | ------------- | ------------- | ------------- | ------------- | ------------- |
-| ResNet | R50 | 23.6 | 6.8 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/R50_IN1K.pyth) |
+| ResNet | R50 | 23.6 | 6.8 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/R50_IN1K.pyth) |
\ No newline at end of file
diff --git a/README.md b/README.md
index 07a7221e8..eb8c56dc1 100644
--- a/README.md
+++ b/README.md
@@ -6,6 +6,7 @@ PySlowFast is an open source video understanding codebase from FAIR that provide
 - [Non-local Neural Networks](https://arxiv.org/abs/1711.07971)
 - [A Multigrid Method for Efficiently Training Video Models](https://arxiv.org/abs/1912.00998)
 - [X3D: Progressive Network Expansion for Efficient Video Recognition](https://arxiv.org/abs/2004.04730)
+- [Video Transformer Network](https://arxiv.org/abs/2102.00719)
 
 
   
@@ -21,8 +22,10 @@ The goal of PySlowFast is to provide a high-performance, light-weight pytorch co
 - I3D
 - Non-local Network
 - X3D
+- VTN
 
 ## Updates
+ - We now support [VTN Model](https://arxiv.org/abs/2102.00719). See [`projects/vtn`](./projects/vtn/README.md) for more information.
  - We now support [X3D Models](https://arxiv.org/abs/2004.04730). See [`projects/x3d`](./projects/x3d/README.md) for more information.
  - We now support [Multigrid Training](https://arxiv.org/abs/1912.00998) for efficiently training video models. See [`projects/multigrid`](./projects/multigrid/README.md) for more information.
  - PySlowFast is released in conjunction with our [ICCV 2019 Tutorial](https://alexander-kirillov.github.io/tutorials/visual-recognition-iccv19/).
diff --git a/configs/Kinetics/VIT_B_VTN.yaml b/configs/Kinetics/VIT_B_VTN.yaml
new file mode 100644
index 000000000..ecd33b878
--- /dev/null
+++ b/configs/Kinetics/VIT_B_VTN.yaml
@@ -0,0 +1,60 @@
+TRAIN:
+  ENABLE: True
+  DATASET: kinetics
+  BATCH_SIZE: 16
+  EVAL_PERIOD: 1
+  CHECKPOINT_PERIOD: 1
+  AUTO_RESUME: True
+  EVAL_FULL_VIDEO: True
+  EVAL_NUM_FRAMES: 250
+DATA:
+  NUM_FRAMES: 16
+  SAMPLING_RATE: 8
+  TARGET_FPS: 25
+  TRAIN_JITTER_SCALES: [256, 320]
+  TRAIN_CROP_SIZE: 224
+  TEST_CROP_SIZE: 224
+  INPUT_CHANNEL_NUM: [3]
+SOLVER:
+  BASE_LR: 0.001
+  LR_POLICY: steps_with_relative_lrs
+  STEPS: [0, 13, 24]
+  LRS: [1, 0.1, 0.01]
+  MAX_EPOCH: 25
+  MOMENTUM: 0.9
+  OPTIMIZING_METHOD: sgd
+MODEL:
+  NUM_CLASSES: 400
+  ARCH: VIT
+  MODEL_NAME: VTN
+  LOSS_FUNC: cross_entropy
+  DROPOUT_RATE: 0.5
+VTN:
+  PRETRAINED: True
+  MLP_DIM: 768
+  DROP_PATH_RATE: 0.0
+  DROP_RATE: 0.0
+  HIDDEN_DIM: 768
+  MAX_POSITION_EMBEDDINGS: 288
+  NUM_ATTENTION_HEADS: 12
+  NUM_HIDDEN_LAYERS: 3
+  ATTENTION_MODE: 'sliding_chunks'
+  PAD_TOKEN_ID: -1
+  ATTENTION_WINDOW: [18, 18, 18]
+  INTERMEDIATE_SIZE: 3072
+  ATTENTION_PROBS_DROPOUT_PROB: 0.1
+  HIDDEN_DROPOUT_PROB: 0.1
+TEST:
+  ENABLE: True
+  DATASET: kinetics
+  BATCH_SIZE: 16
+  NUM_ENSEMBLE_VIEWS: 1
+  NUM_SPATIAL_CROPS: 1
+DATA_LOADER:
+  NUM_WORKERS: 8
+  PIN_MEMORY: True
+NUM_GPUS: 4
+NUM_SHARDS: 1
+RNG_SEED: 0
+OUTPUT_DIR: .
+LOG_MODEL_INFO: False
\ No newline at end of file
diff --git a/projects/vtn/README.md b/projects/vtn/README.md
new file mode 100644
index 000000000..fe4a9f0ac
--- /dev/null
+++ b/projects/vtn/README.md
@@ -0,0 +1,70 @@
+# Video Transformer Network
+Daniel Neimark, Omri Bar, Maya Zohar, Dotan Asselmann [[Paper](https://arxiv.org/abs/2102.00719)]
+
+
+  

+  

+
+
+
+
+## Installation
+```
+pip install timm
+pip install transformers[torch]
+```
+
+## Getting started
+To use VTN models please refer to the configs under `configs/Kinetics`, or see 
+the [MODEL_ZOO.md](https://github.com/facebookresearch/SlowFast/blob/master/MODEL_ZOO.md) 
+for pre-trained models*.
+
+To train ViT-B-VTN on your dataset (see [paper](https://arxiv.org/abs/2102.00719) for details): 
+```
+python tools/run_net.py \
+  --cfg configs/Kinetics/VIT_B_VTN.yaml \
+  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
+```
+
+To test the trained ViT-B-VTN on Kinetics-400 dataset:
+```
+python tools/run_net.py \
+  --cfg configs/Kinetics/VIT_B_VTN.yaml \
+  DATA.PATH_TO_DATA_DIR path_to_kinetics_dataset \
+  TRAIN.ENABLE False \
+  TEST.CHECKPOINT_FILE_PATH path_to_model \
+  TEST.CHECKPOINT_TYPE pytorch
+```
+
+\* VTN models in [MODEL_ZOO.md](https://github.com/facebookresearch/SlowFast/blob/master/MODEL_ZOO.md) produce slightly
+different results than those reported in the paper due to differences between the PySlowFast code base and the
+original code used to train the models (mainly around data and video loading).
+
+## Citing VTN
+If you find VTN useful for your research, please consider citing the paper using the following BibTeX entry.
+```BibTeX
+@article{neimark2021video,
+  title={Video Transformer Network},
+  author={Neimark, Daniel and Bar, Omri and Zohar, Maya and Asselmann, Dotan},
+  journal={arXiv preprint arXiv:2102.00719},
+  year={2021}
+}
+```
+
+
+## Additional Qualitative Results
+
+
+    

+    Label: Tai chi. Prediction: Tai chi.
+    
+    Label: Chopping wood. Prediction: Chopping wood.
+    
+    Label: Archery. Prediction: Archery.
+    
+    Label: Throwing discus. Prediction: Flying kite.
+    
+    Label: Surfing water. Prediction: Parasailing.
+
+
+
diff --git a/projects/vtn/fig/a.png b/projects/vtn/fig/a.png
new file mode 100644
index 000000000..967106b0d
Binary files /dev/null and b/projects/vtn/fig/a.png differ
diff --git a/projects/vtn/fig/arch.png b/projects/vtn/fig/arch.png
new file mode 100644
index 000000000..b09f0fa9e
Binary files /dev/null and b/projects/vtn/fig/arch.png differ
diff --git a/projects/vtn/fig/b.png b/projects/vtn/fig/b.png
new file mode 100644
index 000000000..f29d29f5a
Binary files /dev/null and b/projects/vtn/fig/b.png differ
diff --git a/projects/vtn/fig/c.png b/projects/vtn/fig/c.png
new file mode 100644
index 000000000..e43a35319
Binary files /dev/null and b/projects/vtn/fig/c.png differ
diff --git a/projects/vtn/fig/d.png b/projects/vtn/fig/d.png
new file mode 100644
index 000000000..42ca2cde9
Binary files /dev/null and b/projects/vtn/fig/d.png differ
diff --git a/projects/vtn/fig/e.png b/projects/vtn/fig/e.png
new file mode 100644
index 000000000..c0e246fd3
Binary files /dev/null and b/projects/vtn/fig/e.png differ
diff --git a/projects/vtn/fig/vtn_demo.gif b/projects/vtn/fig/vtn_demo.gif
new file mode 100644
index 000000000..e271ce553
Binary files /dev/null and b/projects/vtn/fig/vtn_demo.gif differ
diff --git a/slowfast/config/defaults.py b/slowfast/config/defaults.py
index 718801a92..e51c5a809 100644
--- a/slowfast/config/defaults.py
+++ b/slowfast/config/defaults.py
@@ -75,6 +75,12 @@
 # If set, clear all layer names according to the pattern provided.
 _C.TRAIN.CHECKPOINT_CLEAR_NAME_PATTERN = ()  # ("backbone.",)
 
+# If True, will use all video's frames during evaluation
+_C.TRAIN.EVAL_FULL_VIDEO = False
+
+# In case "EVAL_FULL_VIDEO" is True, this will set the number of frames to use for the full video (250 in VTN)
+_C.TRAIN.EVAL_NUM_FRAMES = None
+
 # ---------------------------------------------------------------------------- #
 # Testing options
 # ---------------------------------------------------------------------------- #
@@ -254,6 +260,53 @@
 # pathway.
 _C.SLOWFAST.FUSION_KERNEL_SZ = 5
 
+# -----------------------------------------------------------------------------
+# VTN options
+# -----------------------------------------------------------------------------
+_C.VTN = CfgNode()
+
+# ViT: if True, will load pretrained weights for the backbone.
+_C.VTN.PRETRAINED = True
+
+# ViT: stochastic depth decay rule.
+_C.VTN.DROP_PATH_RATE = 0.0
+
+# ViT: dropout ratio.
+_C.VTN.DROP_RATE = 0.0
+
+# Longformer: the size of the embedding, this is the input size of the MLP head,
+# and should match the ViT output dimension.
+_C.VTN.HIDDEN_DIM = 768
+
+# Longformer: the maximum sequence length that this model might ever be used with.
+_C.VTN.MAX_POSITION_EMBEDDINGS = 288
+
+# Longformer: number of attention heads for each attention layer in the Transformer encoder.
+_C.VTN.NUM_ATTENTION_HEADS = 12
+
+# Longformer: number of hidden layers in the Transformer encoder.
+_C.VTN.NUM_HIDDEN_LAYERS = 3
+
+# Longformer: Type of self-attention: LF use 'sliding_chunks' to process with a sliding window
+_C.VTN.ATTENTION_MODE = 'sliding_chunks'
+
+# Longformer: The value used to pad input_ids.
+_C.VTN.PAD_TOKEN_ID = -1
+
+# Longformer: Size of an attention window around each token.
+_C.VTN.ATTENTION_WINDOW = [18, 18, 18]
+
+# Longformer: Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+_C.VTN.INTERMEDIATE_SIZE = 3072
+
+# Longformer: The dropout ratio for the attention probabilities.
+_C.VTN.ATTENTION_PROBS_DROPOUT_PROB = 0.1
+
+# Longformer: The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+_C.VTN.HIDDEN_DROPOUT_PROB = 0.1
+
+# MLP Head: the dimension of the MLP head hidden layer.
+_C.VTN.MLP_DIM = 768
 
 # -----------------------------------------------------------------------------
 # Data options
diff --git a/slowfast/datasets/decoder.py b/slowfast/datasets/decoder.py
index efd582859..af733efa3 100644
--- a/slowfast/datasets/decoder.py
+++ b/slowfast/datasets/decoder.py
@@ -25,7 +25,7 @@ def temporal_sampling(frames, start_idx, end_idx, num_samples):
     index = torch.linspace(start_idx, end_idx, num_samples)
     index = torch.clamp(index, 0, frames.shape[0] - 1).long()
     frames = torch.index_select(frames, 0, index)
-    return frames
+    return frames, index
 
 
 def get_start_end_idx(video_size, clip_size, clip_idx, num_clips):
@@ -212,7 +212,7 @@ def torchvision_decode(
 
 
 def pyav_decode(
-    container, sampling_rate, num_frames, clip_idx, num_clips=10, target_fps=30
+    container, sampling_rate, num_frames, clip_idx, num_clips=10, target_fps=30, force_all_video=False
 ):
     """
     Convert the video from its original fps to the target_fps. If the video
@@ -233,6 +233,7 @@ def pyav_decode(
             given video.
         target_fps (int): the input video may has different fps, convert it to
             the target video fps before frame sampling.
+        force_all_video (bool): fetch all video's frames
     Returns:
         frames (tensor): decoded frames from the video. Return None if the no
             video stream was found.
@@ -246,7 +247,7 @@ def pyav_decode(
     frames_length = container.streams.video[0].frames
     duration = container.streams.video[0].duration
 
-    if duration is None:
+    if duration is None or force_all_video:
         # If failed to fetch the decoding information, decode the entire video.
         decode_all_video = True
         video_start_pts, video_end_pts = 0, math.inf
@@ -290,6 +291,7 @@ def decode(
     target_fps=30,
     backend="pyav",
     max_spatial_scale=0,
+    force_all_video=False,
 ):
     """
     Decode the video and perform temporal sampling.
@@ -313,6 +315,7 @@ def decode(
         max_spatial_scale (int): keep the aspect ratio and resize the frame so
             that shorter edge size is max_spatial_scale. Only used in
             `torchvision` backend.
+        force_all_video (bool): fetch all video's frames - only supported with pyav backend
     Returns:
         frames (tensor): decoded frames from the video.
     """
@@ -327,6 +330,7 @@ def decode(
                 clip_idx,
                 num_clips,
                 target_fps,
+                force_all_video,
             )
         elif backend == "torchvision":
             frames, fps, decode_all_video = torchvision_decode(
@@ -346,11 +350,11 @@ def decode(
             )
     except Exception as e:
         print("Failed to decode by {} with exception: {}".format(backend, e))
-        return None
+        return None, None
 
     # Return None if the frames was not decoded successfully.
     if frames is None or frames.size(0) == 0:
-        return None
+        return None, None
 
     clip_sz = sampling_rate * num_frames / target_fps * fps
     start_idx, end_idx = get_start_end_idx(
@@ -359,6 +363,11 @@ def decode(
         clip_idx if decode_all_video else 0,
         num_clips if decode_all_video else 1,
     )
+
+    if force_all_video:
+        # To avoid duplicate the last frame for videos smaller then 250 frames
+        end_idx = min(float(frames.shape[0]), end_idx)
+
     # Perform temporal sampling from the decoded video.
-    frames = temporal_sampling(frames, start_idx, end_idx, num_frames)
-    return frames
+    frames, frames_index = temporal_sampling(frames, start_idx, end_idx, num_frames)
+    return frames, frames_index
diff --git a/slowfast/datasets/kinetics.py b/slowfast/datasets/kinetics.py
index 28036573d..f402e4c52 100644
--- a/slowfast/datasets/kinetics.py
+++ b/slowfast/datasets/kinetics.py
@@ -70,6 +70,17 @@ def __init__(self, cfg, mode, num_retries=10):
                 cfg.TEST.NUM_ENSEMBLE_VIEWS * cfg.TEST.NUM_SPATIAL_CROPS
             )
 
+        if self.mode in ["val", "test"] and cfg.TRAIN.EVAL_FULL_VIDEO:
+            # supporting full video evaluation
+            self.force_all_video = True
+            self.num_frames = self.cfg.TRAIN.EVAL_NUM_FRAMES
+            self.sampling_rate = 1
+            self._num_clips = 1
+        else:
+            self.force_all_video = False
+            self.num_frames = self.cfg.DATA.NUM_FRAMES
+            self.sampling_rate = self.cfg.DATA.SAMPLING_RATE
+
         logger.info("Constructing Kinetics {}...".format(mode))
         self._construct_loader()
 
@@ -158,6 +169,16 @@ def __getitem__(self, index):
                         / self.cfg.MULTIGRID.DEFAULT_S
                     )
                 )
+            if self.mode in ["val"] and self.cfg.TRAIN.EVAL_FULL_VIDEO:
+                # supporting full video evaluation:
+                # spatial_sample_index=1 to take only the center
+                # The testing is deterministic and no jitter should be performed.
+                # min_scale, max_scale, and crop_size are expect to be the same.
+                # temporal_sample_index = -1  # this can be random - in the end we take [0,inf]
+                spatial_sample_index = 1
+                min_scale = self.cfg.DATA.TRAIN_JITTER_SCALES[0]
+                max_scale = self.cfg.DATA.TRAIN_JITTER_SCALES[0]
+                crop_size = self.cfg.DATA.TEST_CROP_SIZE
         elif self.mode in ["test"]:
             temporal_sample_index = (
                 self._spatial_temporal_idx[index]
@@ -189,7 +210,7 @@ def __getitem__(self, index):
             )
         sampling_rate = utils.get_random_sampling_rate(
             self.cfg.MULTIGRID.LONG_CYCLE_SAMPLING_RATE,
-            self.cfg.DATA.SAMPLING_RATE,
+            self.sampling_rate,
         )
         # Try to decode and sample a clip from a video. If the video can not be
         # decoded, repeatly find a random video replacement that can be decoded.
@@ -220,16 +241,17 @@ def __getitem__(self, index):
                 continue
 
             # Decode video. Meta info is used to perform selective decoding.
-            frames = decoder.decode(
+            frames, frames_index = decoder.decode(
                 video_container,
                 sampling_rate,
-                self.cfg.DATA.NUM_FRAMES,
+                self.num_frames,
                 temporal_sample_index,
                 self.cfg.TEST.NUM_ENSEMBLE_VIEWS,
                 video_meta=self._video_meta[index],
                 target_fps=self.cfg.DATA.TARGET_FPS,
                 backend=self.cfg.DATA.DECODING_BACKEND,
                 max_spatial_scale=min_scale,
+                force_all_video=self.force_all_video
             )
 
             # If decoding failed (wrong format, video is too short, and etc),
@@ -263,7 +285,7 @@ def __getitem__(self, index):
             )
 
             label = self._labels[index]
-            frames = utils.pack_pathway_output(self.cfg, frames)
+            frames = utils.pack_pathway_output(self.cfg, frames, frames_index)
             return frames, label, index, {}
         else:
             raise RuntimeError(
diff --git a/slowfast/datasets/ssv2.py b/slowfast/datasets/ssv2.py
index 5e4c2a6aa..d6d5fe33d 100644
--- a/slowfast/datasets/ssv2.py
+++ b/slowfast/datasets/ssv2.py
@@ -265,6 +265,7 @@ def __getitem__(self, index):
             inverse_uniform_sampling=self.cfg.DATA.INV_UNIFORM_SAMPLE,
         )
         frames = utils.pack_pathway_output(self.cfg, frames)
+
         return frames, label, index, {}
 
     def __len__(self):
diff --git a/slowfast/datasets/utils.py b/slowfast/datasets/utils.py
index 08a4de1a6..98ecb2b4b 100644
--- a/slowfast/datasets/utils.py
+++ b/slowfast/datasets/utils.py
@@ -70,7 +70,7 @@ def get_sequence(center_idx, half_len, sample_rate, num_frames):
     return seq
 
 
-def pack_pathway_output(cfg, frames):
+def pack_pathway_output(cfg, frames, frames_index=None):
     """
     Prepare output as a list of tensors. Each tensor corresponding to a
     unique pathway.
@@ -83,7 +83,9 @@ def pack_pathway_output(cfg, frames):
     """
     if cfg.DATA.REVERSE_INPUT_CHANNEL:
         frames = frames[[2, 1, 0], :, :, :]
-    if cfg.MODEL.ARCH in cfg.MODEL.SINGLE_PATHWAY_ARCH:
+    if cfg.MODEL.MODEL_NAME == "VTN":
+        frame_list = [frames, frames_index]
+    elif cfg.MODEL.ARCH in cfg.MODEL.SINGLE_PATHWAY_ARCH:
         frame_list = [frames]
     elif cfg.MODEL.ARCH in cfg.MODEL.MULTI_PATHWAY_ARCH:
         fast_pathway = frames
@@ -151,8 +153,8 @@ def spatial_sampling(
             frames, _ = transform.horizontal_flip(0.5, frames)
     else:
         # The testing is deterministic and no jitter should be performed.
-        # min_scale, max_scale, and crop_size are expect to be the same.
-        assert len({min_scale, max_scale, crop_size}) == 1
+        # min_scale and max_scale are expect to be the same.
+        assert min_scale == max_scale
         frames, _ = transform.random_short_side_scale_jitter(
             frames, min_scale, max_scale
         )
diff --git a/slowfast/models/video_model_builder.py b/slowfast/models/video_model_builder.py
index 85a4ed1a9..355c48a66 100644
--- a/slowfast/models/video_model_builder.py
+++ b/slowfast/models/video_model_builder.py
@@ -6,11 +6,12 @@
 import math
 import torch
 import torch.nn as nn
+from timm.models.vision_transformer import vit_base_patch16_224
 
 import slowfast.utils.weight_init_helper as init_helper
 from slowfast.models.batchnorm_helper import get_norm
 
-from . import head_helper, resnet_helper, stem_helper
+from . import head_helper, resnet_helper, stem_helper, vtn_helper
 from .build import MODEL_REGISTRY
 
 # Number of blocks for different stages given the model depth.
@@ -758,3 +759,112 @@ def forward(self, x, bboxes=None):
         for module in self.children():
             x = module(x)
         return x
+
+
+@MODEL_REGISTRY.register()
+class VTN(nn.Module):
+    """
+    VTN model builder. It uses ViT-Base as the backbone.
+
+    Daniel Neimark, Omri Bar, Maya Zohar and Dotan Asselmann.
+    "Video Transformer Network."
+    https://arxiv.org/abs/2102.00719
+    """
+
+    def __init__(self, cfg):
+        """
+        The `__init__` method of any subclass should also contain these
+            arguments.
+        Args:
+            cfg (CfgNode): model building configs, details are in the
+                comments of the config file.
+        """
+        super(VTN, self).__init__()
+        self._construct_network(cfg)
+
+    def _construct_network(self, cfg):
+        """
+        Builds a VTN model, with a given backbone architecture.
+        Args:
+            cfg (CfgNode): model building configs, details are in the
+                comments of the config file.
+        """
+        if cfg.MODEL.ARCH == "VIT":
+            self.backbone = vit_base_patch16_224(pretrained=cfg.VTN.PRETRAINED,
+                                                 num_classes=0,
+                                                 drop_path_rate=cfg.VTN.DROP_PATH_RATE,
+                                                 drop_rate=cfg.VTN.DROP_RATE)
+        else:
+            raise NotImplementedError(f"not supporting {cfg.MODEL.ARCH}")
+
+        embed_dim = self.backbone.embed_dim
+        self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))
+
+        self.temporal_encoder = vtn_helper.VTNLongformerModel(
+            embed_dim=embed_dim,
+            max_position_embeddings=cfg.VTN.MAX_POSITION_EMBEDDINGS,
+            num_attention_heads=cfg.VTN.NUM_ATTENTION_HEADS,
+            num_hidden_layers=cfg.VTN.NUM_HIDDEN_LAYERS,
+            attention_mode=cfg.VTN.ATTENTION_MODE,
+            pad_token_id=cfg.VTN.PAD_TOKEN_ID,
+            attention_window=cfg.VTN.ATTENTION_WINDOW,
+            intermediate_size=cfg.VTN.INTERMEDIATE_SIZE,
+            attention_probs_dropout_prob=cfg.VTN.ATTENTION_PROBS_DROPOUT_PROB,
+            hidden_dropout_prob=cfg.VTN.HIDDEN_DROPOUT_PROB)
+
+        self.mlp_head = nn.Sequential(
+            nn.LayerNorm(cfg.VTN.HIDDEN_DIM),
+            nn.Linear(cfg.VTN.HIDDEN_DIM, cfg.VTN.MLP_DIM),
+            nn.GELU(),
+            nn.Dropout(cfg.MODEL.DROPOUT_RATE),
+            nn.Linear(cfg.VTN.MLP_DIM, cfg.MODEL.NUM_CLASSES)
+        )
+
+    def forward(self, x, bboxes=None):
+
+        x, position_ids = x
+
+        # spatial backbone
+        B, C, F, H, W = x.shape
+        x = x.permute(0, 2, 1, 3, 4)
+        x = x.reshape(B * F, C, H, W)
+        x = self.backbone(x)
+        x = x.reshape(B, F, -1)
+
+        # temporal encoder (Longformer)
+        B, D, E = x.shape
+        attention_mask = torch.ones((B, D), dtype=torch.long, device=x.device)
+        cls_tokens = self.cls_token.expand(B, -1, -1)  # stole cls_tokens impl from Phil Wang, thanks
+        x = torch.cat((cls_tokens, x), dim=1)
+        cls_atten = torch.ones(1).expand(B, -1).to(x.device)
+        attention_mask = torch.cat((attention_mask, cls_atten), dim=1)
+        attention_mask[:, 0] = 2
+        x, attention_mask, position_ids = vtn_helper.pad_to_window_size_local(
+            x,
+            attention_mask,
+            position_ids,
+            self.temporal_encoder.config.attention_window[0],
+            self.temporal_encoder.config.pad_token_id)
+        token_type_ids = torch.zeros(x.size()[:-1], dtype=torch.long, device=x.device)
+        token_type_ids[:, 0] = 1
+
+        # position_ids
+        position_ids = position_ids.long()
+        mask = attention_mask.ne(0).int()
+        max_position_embeddings = self.temporal_encoder.config.max_position_embeddings
+        position_ids = position_ids % (max_position_embeddings - 2)
+        position_ids[:, 0] = max_position_embeddings - 2
+        position_ids[mask == 0] = max_position_embeddings - 1
+
+        x = self.temporal_encoder(input_ids=None,
+                                  attention_mask=attention_mask,
+                                  token_type_ids=token_type_ids,
+                                  position_ids=position_ids,
+                                  inputs_embeds=x,
+                                  output_attentions=None,
+                                  output_hidden_states=None,
+                                  return_dict=None)
+        # MLP head
+        x = x["last_hidden_state"]
+        x = self.mlp_head(x[:, 0])
+        return x
diff --git a/slowfast/models/vtn_helper.py b/slowfast/models/vtn_helper.py
new file mode 100644
index 000000000..0c7b5f831
--- /dev/null
+++ b/slowfast/models/vtn_helper.py
@@ -0,0 +1,55 @@
+import torch
+from transformers import LongformerModel, LongformerConfig
+import torch.nn.functional as F
+
+
+class VTNLongformerModel(LongformerModel):
+
+    def __init__(self,
+                 embed_dim=768,
+                 max_position_embeddings=2 * 60 * 60,
+                 num_attention_heads=12,
+                 num_hidden_layers=3,
+                 attention_mode='sliding_chunks',
+                 pad_token_id=-1,
+                 attention_window=None,
+                 intermediate_size=3072,
+                 attention_probs_dropout_prob=0.1,
+                 hidden_dropout_prob=0.1):
+
+        self.config = LongformerConfig()
+        self.config.attention_mode = attention_mode
+        self.config.intermediate_size = intermediate_size
+        self.config.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.config.hidden_dropout_prob = hidden_dropout_prob
+        self.config.attention_dilation = [1, ] * num_hidden_layers
+        self.config.attention_window = [256, ] * num_hidden_layers if attention_window is None else attention_window
+        self.config.num_hidden_layers = num_hidden_layers
+        self.config.num_attention_heads = num_attention_heads
+        self.config.pad_token_id = pad_token_id
+        self.config.max_position_embeddings = max_position_embeddings
+        self.config.hidden_size = embed_dim
+        super(VTNLongformerModel, self).__init__(self.config, add_pooling_layer=False)
+        self.embeddings.word_embeddings = None  # to avoid distributed error of unused parameters
+
+
+def pad_to_window_size_local(input_ids: torch.Tensor, attention_mask: torch.Tensor, position_ids: torch.Tensor,
+                             one_sided_window_size: int, pad_token_id: int):
+    '''A helper function to pad tokens and mask to work with the sliding_chunks implementation of Longformer self-attention.
+    Based on _pad_to_window_size from https://github.com/huggingface/transformers:
+    https://github.com/huggingface/transformers/blob/71bdc076dd4ba2f3264283d4bc8617755206dccd/src/transformers/models/longformer/modeling_longformer.py#L1516
+    Input:
+        input_ids = torch.Tensor(bsz x seqlen): ids of wordpieces
+        attention_mask = torch.Tensor(bsz x seqlen): attention mask
+        one_sided_window_size = int: window size on one side of each token
+        pad_token_id = int: tokenizer.pad_token_id
+    Returns
+        (input_ids, attention_mask) padded to length divisible by 2 * one_sided_window_size
+    '''
+    w = 2 * one_sided_window_size
+    seqlen = input_ids.size(1)
+    padding_len = (w - seqlen % w) % w
+    input_ids = F.pad(input_ids.permute(0, 2, 1), (0, padding_len), value=pad_token_id).permute(0, 2, 1)
+    attention_mask = F.pad(attention_mask, (0, padding_len), value=False)  # no attention on the padding tokens
+    position_ids = F.pad(position_ids, (1, padding_len), value=False)  # no attention on the padding tokens
+    return input_ids, attention_mask, position_ids
diff --git a/slowfast/utils/misc.py b/slowfast/utils/misc.py
index 5684ef83f..1db4758ae 100644
--- a/slowfast/utils/misc.py
+++ b/slowfast/utils/misc.py
@@ -103,7 +103,13 @@ def _get_model_analysis_input(cfg, use_train_input):
             cfg.DATA.TEST_CROP_SIZE,
             cfg.DATA.TEST_CROP_SIZE,
         )
-    model_inputs = pack_pathway_output(cfg, input_tensors)
+
+    if cfg.MODEL.MODEL_NAME == "VTN":
+        frames_index = torch.arange(input_tensors.shape[1])
+    else:
+        frames_index = None
+
+    model_inputs = pack_pathway_output(cfg, input_tensors, frames_index)
     for i in range(len(model_inputs)):
         model_inputs[i] = model_inputs[i].unsqueeze(0)
         if cfg.NUM_GPUS: