-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[Unity][BYOC] Add cuBLAS backend #14291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
| return arg_idx | ||
| extract_func = tvm.get_global_func("relax.contrib.extract_arg_idx") | ||
| arg_indices = extract_func(pattern_name, f) | ||
| return {k: int(v) for k, v in arg_indices.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @yelite this has been ported to cpp
|
Need to wait for rebase against |
|
@masahi please rebase as the other PR is merged |
|
any updates? |
|
waiting for the next rebase |
| auto C_data = static_cast<char*>(C->data) + C->byte_offset; | ||
|
|
||
| CHECK_CUBLAS_ERROR(cublasLtMatmul(hdl, op_desc, alpha, B_data, A_desc, A_data, B_desc, beta, | ||
| C_data, C_desc, C_data, C_desc, nullptr, nullptr, 0, nullptr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cublas API has a default workspace pool, it seems cublasLT always require workspace being explicit set, does passing nullptr here impact the performance? We may want to have default workspace allocated and stored as thread local in CublasThreadEntry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah there are some performance knobs that might worth exploring, in terms of memory management and algorithm selection (see also https://docs.nvidia.com/cuda/cublas/#heuristics-cache). I haven't tested any of them, those are good items for future work if cuBLAS BYOC gets more traction.
2b5d190 to
75deeed
Compare
|
Just realized that the In particular, the GPU image is using an outdated one https://github.com/apache/tvm/blob/unity/ci/jenkins/unity_jenkinsfile.groovy#L34. That's why I'm still getting an build error after I've updated CUDA version on Shouldn't |
This PR adds support for cuBLAS offloading in Relax via BYOC. In particular, we are targeting cuBLASLt API, which has a limited but useful set of epilogue operations (bias / relu / gelu).
Compared to the CUTLASS BYOC, the introduction of cuBLAS BYOC is motivated by dynamic shape support - For dynamic shape, we cannot tune CUTLASS kernels, so we either end up choosing a kernel that works for any shape (
align1) at build time or developing some runtime heuristics. I realized that cuBLAS doesn't differentiate static / dynamic shape and already has tons of heuristics that are likely better than anything we can come up with. So I believe cuBLAS is a better default solution for dynamic shape.cc @vinx13 @yelite @mbaret