-
Couldn't load subscription status.
- Fork 3.7k
[Relay] [TOPI] {relay,topi}.nn.sparse_dense for block-sparse matrix multiplication
#3566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3406392 to
3cca2ee
Compare
|
This is still a work-in-progress since it doesn't full support all block-sizes/shapes, but just putting it up in the meantime. |
{relay,topi}.nn.sparse_dense for block-sparse matrix multiplication{relay,topi}.nn.sparse_dense for block-sparse matrix multiplication
|
Thanks for the awesome work @ajtulloch ! I'm interested at how we'll handle schedule optimizations for these new operators. Due to the dynamism of the problem (value dependent data/index/indexptr dense array shapes), how will we set scheduling knobs in autoTVM tophub? I believe this will require a redesign of the autoTVM infrastructure to support dynamic shapes. |
|
@ZihengJiang @Yulun-Yao have been working on a similar prototype, can you please comment/review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; thanks for the great work. You mentioned the operator not working on certain shapes, and block sizes? can we perhaps have a few FIXMEs in there to address those limitations? or perhaps catch the invalid shapes and output a meaningful runtime error?
|
@tmoreau89 the nice thing about this use-case (at least a constraint) is that the sparsity pattern is fixed a-priori (e.g. it's the output of a standard model sparsification procedure, or we just specify the pattern when initializing the model as in e.g. sparse transformers). So given that, all shapes are static - and indeed in our use-case we also experimented with 'inlining' the sparsity structure (i.e. specifying the sparsity pattern as a |
|
@ajtulloch that makes sense, thanks. So if we wanted to extend TOPI with optimized schedules for this new operator, it would be on an ad-hoc basis since the code generated and the best schedule would be value dependent right? |
|
@tmoreau89 the x86/sparse.py schedule works well for M=1, block-size = 1x{n * SIMD_WIDTH} over a range of N, K – we (@sf-wind and I) did some detailed exploration for these with AutoTVM and the manual schedule performs pretty well. I haven't really looked much at perf in other regimes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[tested locally by calling relay.nn.sparse_dense with different configs]
Would it be better to have one extra dimension for batching to match other relay.nn operators?
src/relay/op/nn/sparse.cc
Outdated
| namespace tvm { | ||
| namespace relay { | ||
|
|
||
| // relay.nn.dense |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relay.nn.sparse_dense
|
@Yulun-Yao I don't quite understand what you mean.
To be consistent with that interface, |
I see. I was thinking about nn.batch_matmul and forgot nn.dense. Thanks for clarifying that out! I am new to the project and the codebase so I might overlook some parts. Sorry about the confusion. |
topi/python/topi/nn/sparse.py
Outdated
| row_elems = row_end - row_start | ||
| elem_idx = tvm.reduce_axis((0, row_elems), name="elem_idx") | ||
| elem = row_start + elem_idx | ||
| a_val = weight_data[elem].astype("float32") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why data type of weight_data here has to be float32 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liangfu it's a remnant of when this code supported fp16/bfloat16/float32 weights. What do you think we should support? Should it follow the out_dtype convention?
|
@ajtulloch the PR looks good to me; I think it would be nice to merge it in even if it's not 100% complete so others working on sparse operator support can build on top of your work. If you remove the [WIP] flag, I will approve it. |
3cca2ee to
28eadd9
Compare
{relay,topi}.nn.sparse_dense for block-sparse matrix multiplication{relay,topi}.nn.sparse_dense for block-sparse matrix multiplication
|
@tmoreau89 @Yulun-Yao thanks for the reviews, I've updated the PR with your comments. |
|
GPU failure seems unrelated: |
|
I would just push a new commit to re-trigger the CI |
…tc), internally and externally, interested in replacing standard dense layers with block-sparse matrix multiplication layers. The motivations are generally: higher performance (due to reduction in FLOPs, memory bandwidth/cache footprint), enabling larger models (e.g. fitting more layers in a given memory budget). Some public work along these lines: * https://openai.com/blog/block-sparse-gpu-kernels/ * https://openai.com/blog/sparse-transformer/ * https://arxiv.org/abs/1802.08435 * https://arxiv.org/abs/1711.02782 Various groups have been able to successfully train models with reasonable levels of sparsity (90%+) with marginal accuracy changes, which suggests substantial speedups are possible (as this implies a >10x reduction in FLOPs). It is fairly straightforward to realize these theoretical speedups, see e.g. TVM benchmarks for Intel CPUs in https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902, and CUDA results in https://github.com/openai/blocksparse, etc. * https://github.com/openai/blocksparse (CUDA) * https://software.intel.com/en-us/mkl-developer-reference-c-mkl-bsrmm (MKL BSRM) * https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.bsr_matrix.html (SCIPY BSR representation) This is extracted from an internal patch we've been using internally. There are various extensions possible (int8/fp16/bf16, CUDA/other GPU architectures), but this is a reasonable starting point. This needs more thorough unit test coverage however. We follow the conventions established by scipy.sparse.bsr_matrix and other libraries, see the unit tests for details. For folks interested in experimenting with scheduling/AutoTVM etc, https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902 is a useful starting point.
28eadd9 to
ad9f8d9
Compare
|
Success. |
|
Thanks @ajtulloch @liangfu @tmoreau89 @Yulun-Yao this PR is now merge |
…tc), (apache#3566) internally and externally, interested in replacing standard dense layers with block-sparse matrix multiplication layers. The motivations are generally: higher performance (due to reduction in FLOPs, memory bandwidth/cache footprint), enabling larger models (e.g. fitting more layers in a given memory budget). Some public work along these lines: * https://openai.com/blog/block-sparse-gpu-kernels/ * https://openai.com/blog/sparse-transformer/ * https://arxiv.org/abs/1802.08435 * https://arxiv.org/abs/1711.02782 Various groups have been able to successfully train models with reasonable levels of sparsity (90%+) with marginal accuracy changes, which suggests substantial speedups are possible (as this implies a >10x reduction in FLOPs). It is fairly straightforward to realize these theoretical speedups, see e.g. TVM benchmarks for Intel CPUs in https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902, and CUDA results in https://github.com/openai/blocksparse, etc. * https://github.com/openai/blocksparse (CUDA) * https://software.intel.com/en-us/mkl-developer-reference-c-mkl-bsrmm (MKL BSRM) * https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.bsr_matrix.html (SCIPY BSR representation) This is extracted from an internal patch we've been using internally. There are various extensions possible (int8/fp16/bf16, CUDA/other GPU architectures), but this is a reasonable starting point. This needs more thorough unit test coverage however. We follow the conventions established by scipy.sparse.bsr_matrix and other libraries, see the unit tests for details. For folks interested in experimenting with scheduling/AutoTVM etc, https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902 is a useful starting point.
…tc), (apache#3566) internally and externally, interested in replacing standard dense layers with block-sparse matrix multiplication layers. The motivations are generally: higher performance (due to reduction in FLOPs, memory bandwidth/cache footprint), enabling larger models (e.g. fitting more layers in a given memory budget). Some public work along these lines: * https://openai.com/blog/block-sparse-gpu-kernels/ * https://openai.com/blog/sparse-transformer/ * https://arxiv.org/abs/1802.08435 * https://arxiv.org/abs/1711.02782 Various groups have been able to successfully train models with reasonable levels of sparsity (90%+) with marginal accuracy changes, which suggests substantial speedups are possible (as this implies a >10x reduction in FLOPs). It is fairly straightforward to realize these theoretical speedups, see e.g. TVM benchmarks for Intel CPUs in https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902, and CUDA results in https://github.com/openai/blocksparse, etc. * https://github.com/openai/blocksparse (CUDA) * https://software.intel.com/en-us/mkl-developer-reference-c-mkl-bsrmm (MKL BSRM) * https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.bsr_matrix.html (SCIPY BSR representation) This is extracted from an internal patch we've been using internally. There are various extensions possible (int8/fp16/bf16, CUDA/other GPU architectures), but this is a reasonable starting point. This needs more thorough unit test coverage however. We follow the conventions established by scipy.sparse.bsr_matrix and other libraries, see the unit tests for details. For folks interested in experimenting with scheduling/AutoTVM etc, https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902 is a useful starting point.
Motivation
We observe multiple groups across a range of domains (ASR, NMT, LM, etc), internally and externally, interested in replacing standard dense layers with block-sparse matrix multiplication layers. The motivations are generally: higher performance (due to reduction in FLOPs, memory bandwidth/cache footprint), enabling larger models (e.g. fitting more layers in a given memory budget).
Some public work along these lines:
Various groups have been able to successfully train models with reasonable levels of sparsity (90%+) with marginal accuracy changes, which suggests substantial speedups are possible (as this implies a >10x reduction in FLOPs).
It is fairly straightforward to realize these theoretical speedups, see e.g. TVM benchmarks for Intel CPUs in https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902, and CUDA results in https://github.com/openai/blocksparse, etc.
Existing libraries/Prior Art
PR details
This is extracted from an internal patch we've been using internally. There are various extensions possible (int8/fp16/bf16, CUDA/other GPU architectures), but this is a reasonable starting point. This needs more thorough unit test coverage however.
We follow the conventions established by scipy.sparse.bsr_matrix and other libraries, see the unit tests for details.
For folks interested in experimenting with scheduling/AutoTVM etc, https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902 is a useful starting point.