Skip to content

Conversation

@ajtulloch
Copy link
Contributor

Motivation

We observe multiple groups across a range of domains (ASR, NMT, LM, etc), internally and externally, interested in replacing standard dense layers with block-sparse matrix multiplication layers. The motivations are generally: higher performance (due to reduction in FLOPs, memory bandwidth/cache footprint), enabling larger models (e.g. fitting more layers in a given memory budget).

Some public work along these lines:

Various groups have been able to successfully train models with reasonable levels of sparsity (90%+) with marginal accuracy changes, which suggests substantial speedups are possible (as this implies a >10x reduction in FLOPs).

It is fairly straightforward to realize these theoretical speedups, see e.g. TVM benchmarks for Intel CPUs in https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902, and CUDA results in https://github.com/openai/blocksparse, etc.

Existing libraries/Prior Art

PR details

This is extracted from an internal patch we've been using internally. There are various extensions possible (int8/fp16/bf16, CUDA/other GPU architectures), but this is a reasonable starting point. This needs more thorough unit test coverage however.

We follow the conventions established by scipy.sparse.bsr_matrix and other libraries, see the unit tests for details.

For folks interested in experimenting with scheduling/AutoTVM etc, https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902 is a useful starting point.

@ajtulloch
Copy link
Contributor Author

cc @tqchen, @Laurawly, @yzhliu

@ajtulloch ajtulloch force-pushed the sparse-dense-relay-topi branch 2 times, most recently from 3406392 to 3cca2ee Compare July 17, 2019 18:51
@ajtulloch
Copy link
Contributor Author

This is still a work-in-progress since it doesn't full support all block-sizes/shapes, but just putting it up in the meantime.

@ajtulloch ajtulloch changed the title [Relay] [TOPI] {relay,topi}.nn.sparse_dense for block-sparse matrix multiplication [WIP] [Relay] [TOPI] {relay,topi}.nn.sparse_dense for block-sparse matrix multiplication Jul 17, 2019
@tmoreau89
Copy link
Contributor

Thanks for the awesome work @ajtulloch ! I'm interested at how we'll handle schedule optimizations for these new operators. Due to the dynamism of the problem (value dependent data/index/indexptr dense array shapes), how will we set scheduling knobs in autoTVM tophub? I believe this will require a redesign of the autoTVM infrastructure to support dynamic shapes.

@tmoreau89
Copy link
Contributor

@ZihengJiang @Yulun-Yao have been working on a similar prototype, can you please comment/review this PR?

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks for the great work. You mentioned the operator not working on certain shapes, and block sizes? can we perhaps have a few FIXMEs in there to address those limitations? or perhaps catch the invalid shapes and output a meaningful runtime error?

@ajtulloch
Copy link
Contributor Author

@tmoreau89 the nice thing about this use-case (at least a constraint) is that the sparsity pattern is fixed a-priori (e.g. it's the output of a standard model sparsification procedure, or we just specify the pattern when initializing the model as in e.g. sparse transformers). So given that, all shapes are static - and indeed in our use-case we also experimented with 'inlining' the sparsity structure (i.e. specifying the sparsity pattern as a const_matrix and unrolling the outer loops, so there were zero data-dependent branches at runtime) - this helps a bit for smaller MMs, but hurts for larger MMs (probably icache bound).

@tmoreau89
Copy link
Contributor

@ajtulloch that makes sense, thanks. So if we wanted to extend TOPI with optimized schedules for this new operator, it would be on an ad-hoc basis since the code generated and the best schedule would be value dependent right?

@ajtulloch
Copy link
Contributor Author

ajtulloch commented Jul 18, 2019

@tmoreau89 the x86/sparse.py schedule works well for M=1, block-size = 1x{n * SIMD_WIDTH} over a range of N, K – we (@sf-wind and I) did some detailed exploration for these with AutoTVM and the manual schedule performs pretty well. I haven't really looked much at perf in other regimes.

Copy link
Contributor

@yy665 yy665 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[tested locally by calling relay.nn.sparse_dense with different configs]
Would it be better to have one extra dimension for batching to match other relay.nn operators?

namespace tvm {
namespace relay {

// relay.nn.dense
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relay.nn.sparse_dense

@ajtulloch
Copy link
Contributor Author

@Yulun-Yao I don't quite understand what you mean.

relay.nn.dense takes a 2D tensor X of size (M, K), a 2D tensor W of size (N, K), and returns a 2D tensor of size (M, N) right?

To be consistent with that interface, relay.nn.sparse_dense takes a 2D tensor X of size (M, K), a sparse 2D tensor W of size (N, K), and returns a 2D tensor of size (M, N)`.

@yy665
Copy link
Contributor

yy665 commented Jul 18, 2019

@Yulun-Yao I don't quite understand what you mean.

relay.nn.dense takes a 2D tensor X of size (M, K), a 2D tensor W of size (N, K), and returns a 2D tensor of size (M, N) right?

To be consistent with that interface, relay.nn.sparse_dense takes a 2D tensor X of size (M, K), a sparse 2D tensor W of size (N, K), and returns a 2D tensor of size (M, N)`.

I see. I was thinking about nn.batch_matmul and forgot nn.dense. Thanks for clarifying that out! I am new to the project and the codebase so I might overlook some parts. Sorry about the confusion.

row_elems = row_end - row_start
elem_idx = tvm.reduce_axis((0, row_elems), name="elem_idx")
elem = row_start + elem_idx
a_val = weight_data[elem].astype("float32")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why data type of weight_data here has to be float32 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liangfu it's a remnant of when this code supported fp16/bfloat16/float32 weights. What do you think we should support? Should it follow the out_dtype convention?

@tmoreau89
Copy link
Contributor

@ajtulloch the PR looks good to me; I think it would be nice to merge it in even if it's not 100% complete so others working on sparse operator support can build on top of your work. If you remove the [WIP] flag, I will approve it.

@ajtulloch ajtulloch force-pushed the sparse-dense-relay-topi branch from 3cca2ee to 28eadd9 Compare July 23, 2019 00:13
@ajtulloch ajtulloch changed the title [WIP] [Relay] [TOPI] {relay,topi}.nn.sparse_dense for block-sparse matrix multiplication [Relay] [TOPI] {relay,topi}.nn.sparse_dense for block-sparse matrix multiplication Jul 23, 2019
@ajtulloch
Copy link
Contributor Author

@tmoreau89 @Yulun-Yao thanks for the reviews, I've updated the PR with your comments.

@ajtulloch
Copy link
Contributor Author

GPU failure seems unrelated:

test_op_level6.test_argsort ... ./tests/scripts/task_python_integration.sh: line 41: 12373 Bus error               (core dumped) TVM_FFI=ctypes python3 -m nose -v tests/python/relay

script returned exit code 135```

@tmoreau89
Copy link
Contributor

I would just push a new commit to re-trigger the CI

…tc),

internally and externally, interested in replacing standard dense layers with
block-sparse matrix multiplication layers. The motivations are generally: higher
performance (due to reduction in FLOPs, memory bandwidth/cache footprint),
enabling larger models (e.g. fitting more layers in a given memory budget).

Some public work along these lines:

* https://openai.com/blog/block-sparse-gpu-kernels/
* https://openai.com/blog/sparse-transformer/
* https://arxiv.org/abs/1802.08435
* https://arxiv.org/abs/1711.02782

Various groups have been able to successfully train models with reasonable
levels of sparsity (90%+) with marginal accuracy changes, which suggests
substantial speedups are possible (as this implies a >10x reduction in FLOPs).

It is fairly straightforward to realize these theoretical speedups, see e.g. TVM
benchmarks for Intel CPUs in
https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902, and CUDA
results in https://github.com/openai/blocksparse, etc.

* https://github.com/openai/blocksparse (CUDA)
* https://software.intel.com/en-us/mkl-developer-reference-c-mkl-bsrmm (MKL BSRM)
* https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.bsr_matrix.html (SCIPY BSR representation)

This is extracted from an internal patch we've been using internally. There are
various extensions possible (int8/fp16/bf16, CUDA/other GPU architectures), but
this is a reasonable starting point. This needs more thorough unit test coverage
however.

We follow the conventions established by scipy.sparse.bsr_matrix and other
libraries, see the unit tests for details.

For folks interested in experimenting with scheduling/AutoTVM etc,
https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902 is a useful
starting point.
@ajtulloch ajtulloch force-pushed the sparse-dense-relay-topi branch from 28eadd9 to ad9f8d9 Compare July 23, 2019 03:09
@ajtulloch
Copy link
Contributor Author

Success.

@tqchen tqchen merged commit d6dcd6c into apache:master Jul 23, 2019
@tqchen
Copy link
Member

tqchen commented Jul 23, 2019

Thanks @ajtulloch @liangfu @tmoreau89 @Yulun-Yao this PR is now merge

wweic pushed a commit to wweic/tvm that referenced this pull request Aug 9, 2019
…tc), (apache#3566)

internally and externally, interested in replacing standard dense layers with
block-sparse matrix multiplication layers. The motivations are generally: higher
performance (due to reduction in FLOPs, memory bandwidth/cache footprint),
enabling larger models (e.g. fitting more layers in a given memory budget).

Some public work along these lines:

* https://openai.com/blog/block-sparse-gpu-kernels/
* https://openai.com/blog/sparse-transformer/
* https://arxiv.org/abs/1802.08435
* https://arxiv.org/abs/1711.02782

Various groups have been able to successfully train models with reasonable
levels of sparsity (90%+) with marginal accuracy changes, which suggests
substantial speedups are possible (as this implies a >10x reduction in FLOPs).

It is fairly straightforward to realize these theoretical speedups, see e.g. TVM
benchmarks for Intel CPUs in
https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902, and CUDA
results in https://github.com/openai/blocksparse, etc.

* https://github.com/openai/blocksparse (CUDA)
* https://software.intel.com/en-us/mkl-developer-reference-c-mkl-bsrmm (MKL BSRM)
* https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.bsr_matrix.html (SCIPY BSR representation)

This is extracted from an internal patch we've been using internally. There are
various extensions possible (int8/fp16/bf16, CUDA/other GPU architectures), but
this is a reasonable starting point. This needs more thorough unit test coverage
however.

We follow the conventions established by scipy.sparse.bsr_matrix and other
libraries, see the unit tests for details.

For folks interested in experimenting with scheduling/AutoTVM etc,
https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902 is a useful
starting point.
wweic pushed a commit to neo-ai/tvm that referenced this pull request Sep 6, 2019
…tc), (apache#3566)

internally and externally, interested in replacing standard dense layers with
block-sparse matrix multiplication layers. The motivations are generally: higher
performance (due to reduction in FLOPs, memory bandwidth/cache footprint),
enabling larger models (e.g. fitting more layers in a given memory budget).

Some public work along these lines:

* https://openai.com/blog/block-sparse-gpu-kernels/
* https://openai.com/blog/sparse-transformer/
* https://arxiv.org/abs/1802.08435
* https://arxiv.org/abs/1711.02782

Various groups have been able to successfully train models with reasonable
levels of sparsity (90%+) with marginal accuracy changes, which suggests
substantial speedups are possible (as this implies a >10x reduction in FLOPs).

It is fairly straightforward to realize these theoretical speedups, see e.g. TVM
benchmarks for Intel CPUs in
https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902, and CUDA
results in https://github.com/openai/blocksparse, etc.

* https://github.com/openai/blocksparse (CUDA)
* https://software.intel.com/en-us/mkl-developer-reference-c-mkl-bsrmm (MKL BSRM)
* https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.bsr_matrix.html (SCIPY BSR representation)

This is extracted from an internal patch we've been using internally. There are
various extensions possible (int8/fp16/bf16, CUDA/other GPU architectures), but
this is a reasonable starting point. This needs more thorough unit test coverage
however.

We follow the conventions established by scipy.sparse.bsr_matrix and other
libraries, see the unit tests for details.

For folks interested in experimenting with scheduling/AutoTVM etc,
https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902 is a useful
starting point.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants