Skip to content

Conversation

@masahi
Copy link
Member

@masahi masahi commented Mar 14, 2023

This PR adds support for cuBLAS offloading in Relax via BYOC. In particular, we are targeting cuBLASLt API, which has a limited but useful set of epilogue operations (bias / relu / gelu).

Compared to the CUTLASS BYOC, the introduction of cuBLAS BYOC is motivated by dynamic shape support - For dynamic shape, we cannot tune CUTLASS kernels, so we either end up choosing a kernel that works for any shape (align1) at build time or developing some runtime heuristics. I realized that cuBLAS doesn't differentiate static / dynamic shape and already has tons of heuristics that are likely better than anything we can come up with. So I believe cuBLAS is a better default solution for dynamic shape.

cc @vinx13 @yelite @mbaret

@tvm-bot
Copy link
Collaborator

tvm-bot commented Mar 14, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@github-actions github-actions bot requested review from mbaret and vinx13 March 14, 2023 07:54
return arg_idx
extract_func = tvm.get_global_func("relax.contrib.extract_arg_idx")
arg_indices = extract_func(pattern_name, f)
return {k: int(v) for k, v in arg_indices.items()}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @yelite this has been ported to cpp

@masahi
Copy link
Member Author

masahi commented Mar 22, 2023

Need to wait for rebase against main to get #14363 in the unity branch

@vinx13
Copy link
Member

vinx13 commented Mar 27, 2023

@masahi please rebase as the other PR is merged

@Hzfengsy
Copy link
Member

any updates?

@masahi
Copy link
Member Author

masahi commented Mar 30, 2023

waiting for the next rebase

auto C_data = static_cast<char*>(C->data) + C->byte_offset;

CHECK_CUBLAS_ERROR(cublasLtMatmul(hdl, op_desc, alpha, B_data, A_desc, A_data, B_desc, beta,
C_data, C_desc, C_data, C_desc, nullptr, nullptr, 0, nullptr));
Copy link
Member

@vinx13 vinx13 Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cublas API has a default workspace pool, it seems cublasLT always require workspace being explicit set, does passing nullptr here impact the performance? We may want to have default workspace allocated and stored as thread local in CublasThreadEntry

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah there are some performance knobs that might worth exploring, in terms of memory management and algorithm selection (see also https://docs.nvidia.com/cuda/cublas/#heuristics-cache). I haven't tested any of them, those are good items for future work if cuBLAS BYOC gets more traction.

@masahi
Copy link
Member Author

masahi commented Apr 3, 2023

Just realized that the unity branch is specifying the CI setup in a different (and old?) way https://github.com/apache/tvm/blob/unity/ci/jenkins/unity_jenkinsfile.groovy.

In particular, the GPU image is using an outdated one https://github.com/apache/tvm/blob/unity/ci/jenkins/unity_jenkinsfile.groovy#L34. That's why I'm still getting an build error after I've updated CUDA version on main #14363

Shouldn't unity be using the same set of CI image tags as main? @tqchen @driazati (UPDATE: Just updated the gpu image tag in this PR for now)

@masahi masahi merged commit e54e04d into apache:unity Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants