-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[VM] Per-input, data dependence specification for shape func #7210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VM] Per-input, data dependence specification for shape func #7210
Conversation
|
I like this, I think it makes a lot of sense, but I'll defer mainly to @zhiics and @icemelon9 since they mainly implemented the infrastructure for heterogeneous shape functions. |
|
cc @icemelon9 @zhiics Any thought on this issue? I think this is important to discuss. |
zhiics
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with the change but I am not sure if there is a better solution. @icemelon9 can you take a look?
1ffa75a to
f658556
Compare
f658556 to
11bd3f5
Compare
11bd3f5 to
6c1b318
Compare
|
@icemelon9 it should be ready |
icemelon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks @masahi @zhiics @mbrookhart |
…7210) * made TShapeDataDependant array * add stub * dyn strided slice working * reshape also working * remove log * works on maskrcnn * lint fix * fix cpp test * remove stale pop back * add more doc * dependant -> dependent * remove redundant check * remove data_dependent_
…7210) * made TShapeDataDependant array * add stub * dyn strided slice working * reshape also working * remove log * works on maskrcnn * lint fix * fix cpp test * remove stale pop back * add more doc * dependant -> dependent * remove redundant check * remove data_dependent_
…7210) * made TShapeDataDependant array * add stub * dyn strided slice working * reshape also working * remove log * works on maskrcnn * lint fix * fix cpp test * remove stale pop back * add more doc * dependant -> dependent * remove redundant check * remove data_dependent_
…7210) * made TShapeDataDependant array * add stub * dyn strided slice working * reshape also working * remove log * works on maskrcnn * lint fix * fix cpp test * remove stale pop back * add more doc * dependant -> dependent * remove redundant check * remove data_dependent_
Motivation
Currently, shape functions are executed on CPU, even if the model is running on GPU. Each shape function is declared with
data_dependentflag, specifying whether or not the shape function needs the input tensors themselves or only the shapes of input tensors, to compute output shape:tvm/python/tvm/relay/op/op.py
Lines 359 to 368 in 9e766d9
When an op's
data_dependentis true, VM will do device to host memcpy of entire tensors before running that op's shape func on CPU. In particular, sincedyn.strided_slicehasdata_dependentTrue, VM would do device to host memcpy of a to-be-sliced tensor for everydyn.strided_slliceinvocation, which can be highly expensive if the tensor is big.tvm/python/tvm/relay/op/dyn/_transform.py
Line 195 in 9e766d9
In fact, one of the bottlenecks of running PyTorch MaskRCNN on GPU is this repeated device to host memcpy, as shown in the profiler output below. Most of them is for
dyn.strided_sliceshape func.CUDA memcpy HtoDis also very slow, but this is necessary to send large parameters once to GPU.But if we think about it, these expensive copyings are completely useless: shape func of
dyn.strided_sliceonly needs data shape. But since other argumentsbegin, end, stridesrequire their tensor values,dyn.strided_sliceneeds to declare itsdata_dependantflag to be True. This issue can be resolved if we let each op declare its data dependence per input.Proposed solution
Luckily, decisions to insert device-to-host memcpy before shape func is already done per each input separately, as shown below:
tvm/python/tvm/relay/transform/memory_alloc.py
Lines 207 to 221 in 4c4888b
So all we need to do is to have a way to specify per-input data dependence for each op, and send these information to
ManifestAllocPassabove. This PR enables such specification like this:I've also updated
compile_engine.ccaccordingly to send per-input data dep information toManifestAllocPass. The change I made is minimum necessary to achieve my goal, so it can be improved. With this PR, I was able to remove all expensive device-to-host memcpy, and it cuts GPU MaskRCNN runtime by 14 millisecond. More importantly, the purpose of this PR is to let people aware of this problem, and decide the best solution.please review @icemelon9 @zhiics @kevinthesun @mbrookhart