Prototype FP8 Blackwell persistent + TMA kernel with warp specialization #385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

facebook-github-bot merged 1 commit into main from export-D81470285

Sep 18, 2025

+207 −1

Contributor

jananisriram commented Sep 3, 2025

Summary:
Taking inspiration from D77053488, Triton's persistent matmul tutorial, and Triton's block scaled matmul tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Note the following limitations:

(K, N) (second matrix in GEMM) needs to be a multiple of 16
num_warps >= 4: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Differential Revision: D81470285

jananisriram had a problem deploying to docker-s3-upload

September 3, 2025 01:17

— with

GitHub Actions Error

jananisriram had a problem deploying to docker-s3-upload

September 3, 2025 01:17

— with

GitHub Actions Error

meta-cla bot added the cla signed label

Contributor

facebook-github-bot commented Sep 3, 2025

This pull request was exported from Phabricator. Differential Revision: D81470285

facebook-github-bot added the fb-exported label

facebook-github-bot force-pushed the export-D81470285 branch from e6578cb to 11ddc62 Compare

September 3, 2025 01:19

Contributor

facebook-github-bot commented Sep 3, 2025

This pull request was exported from Phabricator. Differential Revision: D81470285

facebook-github-bot had a problem deploying to docker-s3-upload

September 3, 2025 01:20

— with

GitHub Actions Failure

facebook-github-bot had a problem deploying to docker-s3-upload

September 3, 2025 01:20

— with

GitHub Actions Failure

facebook-github-bot pushed a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

8e64a26

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Differential Revision: D81470285

facebook-github-bot force-pushed the export-D81470285 branch from 11ddc62 to 8e64a26 Compare

September 16, 2025 06:40

facebook-github-bot had a problem deploying to docker-s3-upload

September 16, 2025 06:40

— with

GitHub Actions Failure

facebook-github-bot had a problem deploying to docker-s3-upload

September 16, 2025 06:40

— with

GitHub Actions Failure

facebook-github-bot added the meta-exported label

Contributor

facebook-github-bot commented Sep 16, 2025

@jananisriram has exported this pull request. If you are a Meta employee, you can view the originating diff in D81470285.

facebook-github-bot pushed a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

467f591

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Differential Revision: D81470285

facebook-github-bot pushed a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

642d998

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Differential Revision: D81470285

jananisriram requested review from njriasan and NikhilAPatel

September 16, 2025 06:45

jananisriram added a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

4f3731c

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Differential Revision: D81470285

jananisriram added a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

13ceedb

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Differential Revision: D81470285

jananisriram force-pushed the export-D81470285 branch from 8e64a26 to 13ceedb Compare

September 16, 2025 06:57

jananisriram had a problem deploying to docker-s3-upload

September 16, 2025 06:58

— with

GitHub Actions Failure

jananisriram had a problem deploying to docker-s3-upload

September 16, 2025 06:58

— with

GitHub Actions Failure

Contributor

facebook-github-bot commented Sep 16, 2025

@jananisriram has exported this pull request. If you are a Meta employee, you can view the originating diff in D81470285.

jananisriram added a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

33295ad

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Differential Revision: D81470285

jananisriram added a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

5a2e267

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Differential Revision: D81470285

jananisriram added a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

ee54388

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Differential Revision: D81470285

jananisriram force-pushed the export-D81470285 branch from 13ceedb to ee54388 Compare

September 16, 2025 17:40

jananisriram had a problem deploying to docker-s3-upload

September 16, 2025 17:40

— with

GitHub Actions Error

facebook-github-bot had a problem deploying to docker-s3-upload

September 16, 2025 17:40

— with

GitHub Actions Failure

facebook-github-bot had a problem deploying to docker-s3-upload

September 16, 2025 17:40

— with

GitHub Actions Failure

Contributor

facebook-github-bot commented Sep 16, 2025

@jananisriram has exported this pull request. If you are a Meta employee, you can view the originating diff in D81470285.

facebook-github-bot pushed a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

7874d9c

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Differential Revision: D81470285

facebook-github-bot pushed a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

aa5ffc7

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Differential Revision: D81470285

facebook-github-bot force-pushed the export-D81470285 branch from 0c68de7 to aa5ffc7 Compare

September 17, 2025 06:33

facebook-github-bot had a problem deploying to docker-s3-upload

September 17, 2025 06:33

— with

GitHub Actions Error

facebook-github-bot had a problem deploying to docker-s3-upload

September 17, 2025 06:33

— with

GitHub Actions Error

Contributor

facebook-github-bot commented Sep 17, 2025

@jananisriram has exported this pull request. If you are a Meta employee, you can view the originating diff in D81470285.

facebook-github-bot pushed a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

ef3c2a9

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Differential Revision: D81470285

facebook-github-bot force-pushed the export-D81470285 branch from aa5ffc7 to ef3c2a9 Compare

September 17, 2025 06:36

Contributor

facebook-github-bot commented Sep 17, 2025

@jananisriram has exported this pull request. If you are a Meta employee, you can view the originating diff in D81470285.

facebook-github-bot had a problem deploying to docker-s3-upload

September 17, 2025 06:36

— with

GitHub Actions Failure

facebook-github-bot had a problem deploying to docker-s3-upload

September 17, 2025 06:36

— with

GitHub Actions Failure

jananisriram added a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

54294b7

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Differential Revision: D81470285

jananisriram added a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

e3c9e01

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Differential Revision: D81470285


          Prototype persistent + TMA kernel with warp specialization (#385)

161229f

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Reviewed By: njriasan

Differential Revision: D81470285

facebook-github-bot force-pushed the export-D81470285 branch from ef3c2a9 to 161229f Compare

September 18, 2025 02:26

facebook-github-bot temporarily deployed to docker-s3-upload

September 18, 2025 02:26

— with

GitHub Actions Inactive

facebook-github-bot temporarily deployed to docker-s3-upload

September 18, 2025 02:26

— with

GitHub Actions Inactive

Contributor

facebook-github-bot commented Sep 18, 2025

@jananisriram has exported this pull request. If you are a Meta employee, you can view the originating diff in D81470285.

jananisriram added a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

7bad30e

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Reviewed By: njriasan

Differential Revision: D81470285

facebook-github-bot pushed a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

4ddffa1

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Reviewed By: njriasan

Differential Revision: D81470285

facebook-github-bot pushed a commit that referenced this pull request


          Prototype persistent + TMA kernel with warp specialization (#385)

bb8b1b5

Summary:

Taking inspiration from D77053488, Triton's [persistent matmul](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#) tutorial, and Triton's [block scaled matmul](https://triton-lang.org/main/getting-started/tutorials/10-block-scaled-matmul.html) tutorial, write and benchmark a persistent + TMA Triton kernel for FP8 workloads on Blackwell which enables warp specialization and flattening.

Recall the following for FP8 workloads:
- Input shapes: `torch.float8_e4m3fn` (`tl.float8e4nv`)
- Output: `torch.float16` (per-tensor, `tl.float16`) or `torch.bfloat16` (per-row, `tl.bfloat16`)
- Accumulation: `torch.float32` (`tl.float32`).

Note the following limitations:
- `(K, N)` (second matrix in GEMM) needs to be a multiple of 16
- `num_warps >= 4`: TMA instructions expect at least a 128-thread (4 warps) group. 1 warp = 32 threads, so we need 4 warps per thread block to ensure correct functionality.

Note that the current kernel is being autotuned on just one config; this will be changed in a future diff.

Reviewed By: njriasan

Differential Revision: D81470285

NikhilAPatel approved these changes

View reviewed changes

Contributor Author

jananisriram commented Sep 18, 2025

@pytorchbot merge

pytorch-bot bot commented Sep 18, 2025

Mergebot is not configured for this repository. Please use the merge button provided by GitHub.

Contributor Author

jananisriram commented Sep 18, 2025

@pytorchbot merge

pytorch-bot bot commented Sep 18, 2025

Mergebot is not configured for this repository. Please use the merge button provided by GitHub.

facebook-github-bot merged commit 434ca4c into main

8 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported meta-exported