Skip to content

Conversation

@yongwww
Copy link
Member

@yongwww yongwww commented Feb 28, 2024

The topi.cuda.inclusive_scan currently relies on performing an exclusive_scan followed by an add operation that adds the input data back in. To eliminate the overhead introduced by this extra addition, probably we should have an implementation specifically designed for inclusive_scan. As of now, the implementation of inclusive_scan is limited to exclusive_scan_ir, developing an inclusive_scan_ir will be needed for performance in the future.

In this pull request, we specifically address this efficiency for thrust by directly calling the inclusive function.

Co-authored-by: Wuwei Lin [email protected]

@tqchen
Copy link
Member

tqchen commented Feb 28, 2024

@tvm-bot rerun

1 similar comment
@yongwww
Copy link
Member Author

yongwww commented Feb 28, 2024

@tvm-bot rerun

data, output_dtype, exclusive=True, return_reduction=return_reduction, binop=binop
)
if _can_use_scan_thrust(binop):
return scan_thrust(data, output_dtype, return_reduction=return_reduction, binop=binop)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's pass exclusive=True explicitly here

@yongwww
Copy link
Member Author

yongwww commented Feb 28, 2024

With this example:

    @I.ir_module
    class Mod:
        @R.function
        def foo(x: R.Tensor((20, 32000), "float32")):
            with R.dataflow():
                gv = R.cumsum(x, axis=1)
                R.output(gv)
            return gv
    target = tvm.target.Target("cuda -libs=thrust", host="llvm")
    dev = tvm.cuda(0)
    ex = relax.build(Mod, target)
    vm = relax.VirtualMachine(ex, device=dev)

The 'cuda_gpu_kern_sum' stats report

w/ this change

Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name


 92.8       18,099,352      1,101  16,439.0  16,384.0    14,944    18,336        433.8  void cub::CUB_200200_750_NS::DeviceScanByKeyKernel<cub::CUB_200200_750_NS::DeviceScanByKeyPolicy<th…
  7.2        1,409,748      1,101   1,280.4   1,280.0     1,151     1,568        120.1  void cub::CUB_200200_750_NS::DeviceScanByKeyInitKernel<cub::CUB_200200_750_NS::ReduceByKeyScanTileS…

w/o this change


 51.6       22,603,888      1,101  20,530.3  20,545.0    19,744    21,120        293.0  cumsum_kernel
 44.3       19,442,725      1,101  17,659.2  17,665.0    16,641    18,752        294.6  void cub::CUB_200200_750_NS::DeviceScanByKeyKernel<cub::CUB_200200_750_NS::DeviceScanByKeyPolicy<th…
  4.1        1,800,850      1,101   1,635.6   1,632.0     1,440     1,792         35.9  void cub::CUB_200200_750_NS::DeviceScanByKeyInitKernel<cub::CUB_200200_750_NS::ReduceByKeyScanTileS…

The execution perf numbers I got on NVIDIA GeForce RTX 3070:

R.cumsum(x, axis=1)
Description Performance
With this change (w/ thrust) 0.018 ms
Without this change (w/ thrust) 0.039 ms
Without thrust 0.460 ms

R.cumsum(x, axis=0)

Description Performance
With this change (w/ thrust) 0.073 ms
Without this change (w/ thrust) 0.074 ms
Without thrust 44.018 ms

Copy link
Contributor

@MasterJH5574 MasterJH5574 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @yongwww!!

@yongwww
Copy link
Member Author

yongwww commented Feb 29, 2024

The failed test tests/python/relax/test_op_unpack.py::test_tensor_dtype_lanes[bfloat] occurred in the last several merged commits, it's supposed to be fixed with #16649, is unrelated to this PR. I can rebase and have a clean run if necessary. The cpu failed with flaky test, it was green in https://ci.tlcpack.ai/blue/organizations/jenkins/tvm-cpu/detail/PR-16652/3/pipeline/

@yongwww
Copy link
Member Author

yongwww commented Feb 29, 2024

@tvm-bot rerun

@yongwww
Copy link
Member Author

yongwww commented Mar 1, 2024

@tvm-bot rerun

1 similar comment
@MasterJH5574
Copy link
Contributor

@tvm-bot rerun

@tqchen tqchen merged commit 3b25588 into apache:main Mar 3, 2024
@yongwww yongwww deleted the topi_cumsum_update branch March 4, 2024 00:02
Lunderberg pushed a commit to Lunderberg/tvm that referenced this pull request Mar 12, 2024
thaisacs pushed a commit to thaisacs/tvm that referenced this pull request Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants