Skip to content

Conversation

@tqchen
Copy link
Member

@tqchen tqchen commented Aug 19, 2025

This PR updates the autodlpack path to automatically update the env stream to be consistent with torch stream context.

The change would help to make FFI functions to be compatible in stream based executions.
Specifically, TVMFFIEnvSetStream will be called to set the stream from the torch cuda context so the callee can query the stream via TVMFFIEnvGetCurrentStream. The stream will be recovered after the function call ends

We leverage torch cpp_extension load_inline to create an efficient stream query function so it won't slow down the call, the first time loading might take more time to build the jit module and things should be fast after the torch jit module is cached.

@tqchen
Copy link
Member Author

tqchen commented Aug 19, 2025

ffi overhead benchmark AMD Ryzen 9 7950X

-----------------------------
Benchmark f(x, y, z) overhead
-----------------------------
numpy.add                                2.0837783813476562e-07 sec/call
torch.add[cpu]                           5.690574645996094e-07 sec/call
torch.add[cuda]                          2.2510528564453123e-06 sec/call
tvm.ffi.nop                              2.9222965240478516e-07 sec/call
tvm.ffi.nop+from_dlpack(torch)           3.5573482513427735e-06 sec/call
tvm.ffi.nop+from_dlpack(numpy)           1.001763343811035e-06 sec/call
tvm.ffi.nop+from_dlpack(tvm)             1.0982036590576173e-06 sec/call
tvm.ffi.nop+from_dlpack(torch.utils)     2.9434442520141603e-06 sec/call
tvm.ffi.nop.autodlpack(torch[cpu])       3.265666961669922e-06 sec/call
tvm.ffi.nop.autodlpack(torch[cuda])      3.4897327423095704e-06 sec/call
tvm.ffi.nop.autodlpack(torch[cuda][stream]) 3.4964323043823244e-06 sec/call
tvm.ffi.nop.autodlpack(numpy)            1.4113664627075195e-06 sec/call
-------------------------------
Benchmark x.__dlpack__ overhead
-------------------------------
torch.utils.dlpack.to_dlpack             3.6129951477050783e-07 sec/call
torch.__dlpack__                         8.010625839233399e-07 sec/call
numpy.__dlpack__                         6.115436553955078e-08 sec/call
tvm.__dlpack__                           9.13858413696289e-08 sec/call
---------------------------------------------------
Benchmark x.__dlpack__(max_version=(1,1)) overhead
---------------------------------------------------
torch.__dlpack__(max_version=(1,1))      Tensor.__dlpack__() got an unexpected keyword argument 'max_version'
numpy.__dlpack__(max_version=(1,1))      7.741451263427734e-08 sec/call
tvm.__dlpack__(max_version=(1,1))        1.41143798828125e-07 sec/call
---------------------------------------------------
Benchmark torch.get_cuda_stream[default stream]
---------------------------------------------------
torch.cuda.current_stream[cpp-extension] 9.298324584960938e-08 sec/call
torch.cuda.current_stream[python]        8.587837219238281e-07 sec/call
---------------------------------------------------
Benchmark torch.get_cuda_stream[non-default stream]
---------------------------------------------------
torch.cuda.current_stream[cpp-extension] 9.508132934570312e-08 sec/call
torch.cuda.current_stream[python]        8.99958610534668e-07 sec/call

This PR updates the autodlpack path to automatically update
the env stream to be consistent with torch stream context.

The change would help to make FFI functions to be
compatible in stream based executions.

We leverage torch cpp_extension load_inline to create
an efficient query function, the first time loading
might take more time to build the jit module and
things should be fast after the torch jit module is cached.
@yongwww yongwww merged commit 216e9e9 into apache:main Aug 20, 2025
13 checks passed
tqchen added a commit to tqchen/tvm that referenced this pull request Sep 13, 2025
This PR updates the autodlpack path to automatically update
the env stream to be consistent with torch stream context.

The change would help to make FFI functions to be
compatible in stream based executions.

We leverage torch cpp_extension load_inline to create
an efficient query function, the first time loading
might take more time to build the jit module and
things should be fast after the torch jit module is cached.
tqchen added a commit to tqchen/tvm that referenced this pull request Sep 13, 2025
This PR updates the autodlpack path to automatically update
the env stream to be consistent with torch stream context.

The change would help to make FFI functions to be
compatible in stream based executions.

We leverage torch cpp_extension load_inline to create
an efficient query function, the first time loading
might take more time to build the jit module and
things should be fast after the torch jit module is cached.
tqchen added a commit to tqchen/tvm that referenced this pull request Sep 13, 2025
This PR updates the autodlpack path to automatically update
the env stream to be consistent with torch stream context.

The change would help to make FFI functions to be
compatible in stream based executions.

We leverage torch cpp_extension load_inline to create
an efficient query function, the first time loading
might take more time to build the jit module and
things should be fast after the torch jit module is cached.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants