[Relay] [TOPI] `{relay,topi}.nn.sparse_dense` for block-sparse matrix multiplication #3566

ajtulloch · 2019-07-17T18:16:53Z

Motivation

We observe multiple groups across a range of domains (ASR, NMT, LM, etc), internally and externally, interested in replacing standard dense layers with block-sparse matrix multiplication layers. The motivations are generally: higher performance (due to reduction in FLOPs, memory bandwidth/cache footprint), enabling larger models (e.g. fitting more layers in a given memory budget).

Some public work along these lines:

Various groups have been able to successfully train models with reasonable levels of sparsity (90%+) with marginal accuracy changes, which suggests substantial speedups are possible (as this implies a >10x reduction in FLOPs).

It is fairly straightforward to realize these theoretical speedups, see e.g. TVM benchmarks for Intel CPUs in https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902, and CUDA results in https://github.com/openai/blocksparse, etc.

Existing libraries/Prior Art

https://github.com/openai/blocksparse (CUDA)
https://software.intel.com/en-us/mkl-developer-reference-c-mkl-bsrmm (MKL BSRMM)
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.bsr_matrix.html (SCIPY BSR representation)

PR details

This is extracted from an internal patch we've been using internally. There are various extensions possible (int8/fp16/bf16, CUDA/other GPU architectures), but this is a reasonable starting point. This needs more thorough unit test coverage however.

We follow the conventions established by scipy.sparse.bsr_matrix and other libraries, see the unit tests for details.

For folks interested in experimenting with scheduling/AutoTVM etc, https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902 is a useful starting point.

ajtulloch · 2019-07-17T18:18:15Z

cc @tqchen, @Laurawly, @yzhliu

ajtulloch · 2019-07-17T18:51:36Z

This is still a work-in-progress since it doesn't full support all block-sizes/shapes, but just putting it up in the meantime.

tmoreau89 · 2019-07-17T21:30:47Z

Thanks for the awesome work @ajtulloch ! I'm interested at how we'll handle schedule optimizations for these new operators. Due to the dynamism of the problem (value dependent data/index/indexptr dense array shapes), how will we set scheduling knobs in autoTVM tophub? I believe this will require a redesign of the autoTVM infrastructure to support dynamic shapes.

tmoreau89 · 2019-07-17T21:32:38Z

@ZihengJiang @Yulun-Yao have been working on a similar prototype, can you please comment/review this PR?

tmoreau89

LGTM; thanks for the great work. You mentioned the operator not working on certain shapes, and block sizes? can we perhaps have a few FIXMEs in there to address those limitations? or perhaps catch the invalid shapes and output a meaningful runtime error?

ajtulloch · 2019-07-17T23:53:54Z

@tmoreau89 the nice thing about this use-case (at least a constraint) is that the sparsity pattern is fixed a-priori (e.g. it's the output of a standard model sparsification procedure, or we just specify the pattern when initializing the model as in e.g. sparse transformers). So given that, all shapes are static - and indeed in our use-case we also experimented with 'inlining' the sparsity structure (i.e. specifying the sparsity pattern as a const_matrix and unrolling the outer loops, so there were zero data-dependent branches at runtime) - this helps a bit for smaller MMs, but hurts for larger MMs (probably icache bound).

tmoreau89 · 2019-07-18T00:17:52Z

@ajtulloch that makes sense, thanks. So if we wanted to extend TOPI with optimized schedules for this new operator, it would be on an ad-hoc basis since the code generated and the best schedule would be value dependent right?

ajtulloch · 2019-07-18T01:13:00Z

@tmoreau89 the x86/sparse.py schedule works well for M=1, block-size = 1x{n * SIMD_WIDTH} over a range of N, K – we (@sf-wind and I) did some detailed exploration for these with AutoTVM and the manual schedule performs pretty well. I haven't really looked much at perf in other regimes.

yy665

[tested locally by calling relay.nn.sparse_dense with different configs]
Would it be better to have one extra dimension for batching to match other relay.nn operators?

yy665 · 2019-07-18T01:45:38Z

src/relay/op/nn/sparse.cc

+namespace tvm {
+namespace relay {
+
+// relay.nn.dense


relay.nn.sparse_dense

ajtulloch · 2019-07-18T01:58:21Z

@Yulun-Yao I don't quite understand what you mean.

relay.nn.dense takes a 2D tensor X of size (M, K), a 2D tensor W of size (N, K), and returns a 2D tensor of size (M, N) right?

To be consistent with that interface, relay.nn.sparse_dense takes a 2D tensor X of size (M, K), a sparse 2D tensor W of size (N, K), and returns a 2D tensor of size (M, N)`.

yy665 · 2019-07-18T02:13:03Z

@Yulun-Yao I don't quite understand what you mean.

relay.nn.dense takes a 2D tensor X of size (M, K), a 2D tensor W of size (N, K), and returns a 2D tensor of size (M, N) right?

To be consistent with that interface, relay.nn.sparse_dense takes a 2D tensor X of size (M, K), a sparse 2D tensor W of size (N, K), and returns a 2D tensor of size (M, N)`.

I see. I was thinking about nn.batch_matmul and forgot nn.dense. Thanks for clarifying that out! I am new to the project and the codebase so I might overlook some parts. Sorry about the confusion.

liangfu · 2019-07-18T10:31:01Z

topi/python/topi/nn/sparse.py

+        row_elems = row_end - row_start
+        elem_idx = tvm.reduce_axis((0, row_elems), name="elem_idx")
+        elem = row_start + elem_idx
+        a_val = weight_data[elem].astype("float32")


Why data type of weight_data here has to be float32 ?

@liangfu it's a remnant of when this code supported fp16/bfloat16/float32 weights. What do you think we should support? Should it follow the out_dtype convention?

tmoreau89 · 2019-07-20T00:32:04Z

@ajtulloch the PR looks good to me; I think it would be nice to merge it in even if it's not 100% complete so others working on sparse operator support can build on top of your work. If you remove the [WIP] flag, I will approve it.

ajtulloch · 2019-07-23T00:14:38Z

@tmoreau89 @Yulun-Yao thanks for the reviews, I've updated the PR with your comments.

ajtulloch · 2019-07-23T02:51:30Z

GPU failure seems unrelated:

test_op_level6.test_argsort ... ./tests/scripts/task_python_integration.sh: line 41: 12373 Bus error               (core dumped) TVM_FFI=ctypes python3 -m nose -v tests/python/relay

script returned exit code 135```

tmoreau89 · 2019-07-23T03:05:23Z

I would just push a new commit to re-trigger the CI

…tc), internally and externally, interested in replacing standard dense layers with block-sparse matrix multiplication layers. The motivations are generally: higher performance (due to reduction in FLOPs, memory bandwidth/cache footprint), enabling larger models (e.g. fitting more layers in a given memory budget). Some public work along these lines: * https://openai.com/blog/block-sparse-gpu-kernels/ * https://openai.com/blog/sparse-transformer/ * https://arxiv.org/abs/1802.08435 * https://arxiv.org/abs/1711.02782 Various groups have been able to successfully train models with reasonable levels of sparsity (90%+) with marginal accuracy changes, which suggests substantial speedups are possible (as this implies a >10x reduction in FLOPs). It is fairly straightforward to realize these theoretical speedups, see e.g. TVM benchmarks for Intel CPUs in https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902, and CUDA results in https://github.com/openai/blocksparse, etc. * https://github.com/openai/blocksparse (CUDA) * https://software.intel.com/en-us/mkl-developer-reference-c-mkl-bsrmm (MKL BSRM) * https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.bsr_matrix.html (SCIPY BSR representation) This is extracted from an internal patch we've been using internally. There are various extensions possible (int8/fp16/bf16, CUDA/other GPU architectures), but this is a reasonable starting point. This needs more thorough unit test coverage however. We follow the conventions established by scipy.sparse.bsr_matrix and other libraries, see the unit tests for details. For folks interested in experimenting with scheduling/AutoTVM etc, https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902 is a useful starting point.

ajtulloch · 2019-07-23T20:50:43Z

Success.

tqchen · 2019-07-23T21:45:04Z

Thanks @ajtulloch @liangfu @tmoreau89 @Yulun-Yao this PR is now merge

…tc), (apache#3566) internally and externally, interested in replacing standard dense layers with block-sparse matrix multiplication layers. The motivations are generally: higher performance (due to reduction in FLOPs, memory bandwidth/cache footprint), enabling larger models (e.g. fitting more layers in a given memory budget). Some public work along these lines: * https://openai.com/blog/block-sparse-gpu-kernels/ * https://openai.com/blog/sparse-transformer/ * https://arxiv.org/abs/1802.08435 * https://arxiv.org/abs/1711.02782 Various groups have been able to successfully train models with reasonable levels of sparsity (90%+) with marginal accuracy changes, which suggests substantial speedups are possible (as this implies a >10x reduction in FLOPs). It is fairly straightforward to realize these theoretical speedups, see e.g. TVM benchmarks for Intel CPUs in https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902, and CUDA results in https://github.com/openai/blocksparse, etc. * https://github.com/openai/blocksparse (CUDA) * https://software.intel.com/en-us/mkl-developer-reference-c-mkl-bsrmm (MKL BSRM) * https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.bsr_matrix.html (SCIPY BSR representation) This is extracted from an internal patch we've been using internally. There are various extensions possible (int8/fp16/bf16, CUDA/other GPU architectures), but this is a reasonable starting point. This needs more thorough unit test coverage however. We follow the conventions established by scipy.sparse.bsr_matrix and other libraries, see the unit tests for details. For folks interested in experimenting with scheduling/AutoTVM etc, https://gist.github.com/ajtulloch/e65f90487bceb8848128e8db582fe902 is a useful starting point.

ajtulloch force-pushed the sparse-dense-relay-topi branch 2 times, most recently from 3406392 to 3cca2ee Compare July 17, 2019 18:51

ajtulloch changed the title ~~[Relay] [TOPI] {relay,topi}.nn.sparse_dense for block-sparse matrix multiplication~~ [WIP] [Relay] [TOPI] {relay,topi}.nn.sparse_dense for block-sparse matrix multiplication Jul 17, 2019

tqchen added the status: need review label Jul 17, 2019

tmoreau89 reviewed Jul 17, 2019

View reviewed changes

yy665 approved these changes Jul 18, 2019

View reviewed changes

src/relay/op/nn/sparse.cc Outdated

namespace tvm {

namespace relay {

// relay.nn.dense

Copy link

Contributor

yy665 Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relay.nn.sparse_dense

liangfu reviewed Jul 18, 2019

View reviewed changes

ajtulloch force-pushed the sparse-dense-relay-topi branch from 3cca2ee to 28eadd9 Compare July 23, 2019 00:13

ajtulloch changed the title ~~[WIP] [Relay] [TOPI] {relay,topi}.nn.sparse_dense for block-sparse matrix multiplication~~ [Relay] [TOPI] {relay,topi}.nn.sparse_dense for block-sparse matrix multiplication Jul 23, 2019

ajtulloch force-pushed the sparse-dense-relay-topi branch from 28eadd9 to ad9f8d9 Compare July 23, 2019 03:09

tmoreau89 approved these changes Jul 23, 2019

View reviewed changes

tqchen merged commit d6dcd6c into apache:master Jul 23, 2019

tqchen added status: accepted and removed status: need review labels Jul 23, 2019

sf-wind mentioned this pull request Jul 29, 2019

Enable the sparse schedule #3651

Merged

tqchen mentioned this pull request Nov 8, 2019

[RELEASE][DRAFT] TVM v0.6 Release candidate #4259

Closed

Uh oh!

[Relay] [TOPI] {relay,topi}.nn.sparse_dense for block-sparse matrix multiplication #3566

[Relay] [TOPI] {relay,topi}.nn.sparse_dense for block-sparse matrix multiplication #3566

Uh oh!

Conversation

ajtulloch commented Jul 17, 2019

Motivation

Existing libraries/Prior Art

PR details

Uh oh!

ajtulloch commented Jul 17, 2019

Uh oh!

ajtulloch commented Jul 17, 2019

Uh oh!

tmoreau89 commented Jul 17, 2019

Uh oh!

tmoreau89 commented Jul 17, 2019

Uh oh!

tmoreau89 left a comment

Choose a reason for hiding this comment

Uh oh!

ajtulloch commented Jul 17, 2019

Uh oh!

tmoreau89 commented Jul 18, 2019

Uh oh!

ajtulloch commented Jul 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yy665 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yy665 Jul 18, 2019

Choose a reason for hiding this comment

Uh oh!

ajtulloch commented Jul 18, 2019

Uh oh!

yy665 commented Jul 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liangfu Jul 18, 2019

Choose a reason for hiding this comment

Uh oh!

ajtulloch Jul 18, 2019

Choose a reason for hiding this comment

Uh oh!

tmoreau89 commented Jul 20, 2019

Uh oh!

ajtulloch commented Jul 23, 2019

Uh oh!

ajtulloch commented Jul 23, 2019

Uh oh!

tmoreau89 commented Jul 23, 2019

Uh oh!

ajtulloch commented Jul 23, 2019

Uh oh!

tqchen commented Jul 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Relay] [TOPI] `{relay,topi}.nn.sparse_dense` for block-sparse matrix multiplication #3566

[Relay] [TOPI] `{relay,topi}.nn.sparse_dense` for block-sparse matrix multiplication #3566

ajtulloch commented Jul 18, 2019 •

edited

Loading

yy665 left a comment •

edited

Loading

yy665 commented Jul 18, 2019 •

edited

Loading