-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Target:
- Avoid verification in the original Tensorize, which we found it is not easy while detecting some patterns, like specifying abosolute/relative indexing.
- Achieve better encapsulation. Vendors can build external package for their hardware, and user do not need to know the detail while writing their TVM declaration.
- Propose a way to represent a block in tensor, which can be scheduled as an unit.
Motivation
The current tensorize method use TensorIntrin as a contract. While lowering, it will compare the body of the intrinsic‘s declaration and the body of the original declaration. After verifying the both are absolutely identical, the body of the original declaration will be replaced by intrinsic.
However, we found that it is not easy to guarantee/express the identity in some situation, especially involving the value of an index. This problem is related to the implementation of tensorize: normally, tensorize will replace the index in intrinsic with the index in the original declaration, so that the lowered code can be perfectly matched. For example, if we want to tensorize an addition intrinsic:
# intrinsic declaration
for k in 0...16
A[k] = B[k] + C[k]
# original declaration
for i in 0...32
for j in 0...16
A[i*16+j] = B[i*16+j] + C[i*16+j]
We will replace k in intrinsic declaration with i*16+j, then to compare the ir whether is identical. However, if we require the value of index, things will change. The verification will fail unless we express absolute indexing and relative indexing clearly.
Design
Instead of proposing a way to express absolute indexing, we would like to come up with generalizing tvm.compute to the region of tensor. That means we can view a tensor intrinsic as an operation between regions, just like + as an operation between elements.
- region: following numpy, we use slice to express a region in tensor:
A[i, 0:16] - tensor_op: different with
externoperation, we hope that we can keep the ability to schedule the part out of tensorized region. It means there are two kinds of axes: 1. axes in tensorized region, which cannot be scheduled; 2. axes out of the tensorized region, which can be scheduled throughsplit,reorder, etc.
With this generalization, we can use tensor intrinsic in declaration directly, so verification is no longer needed. Also, this way provides better encapsulation, user do not need to know the detail of tensor intrinsic.
Interface
def tensor_op(out_dims,
in_dims,
finputs,
tensor_intrin,
reduce_axis=[],
name='tensor_op',
tag=""):
"""Construct new tensors with intrinsic.
Parameters
----------
out_dims: tuple
The dimensions out of the tensorized region, which can be scheduled through `reorder`, `split`.
in_dims: tuple
The dimensions inside of the tensorized region, which cannot be manipulated.
finputs: lambda function of out_dims -> list of TensorSlice
Specifies involved regions of input tensors.
tensor_intrin : TensorIntrin
The tensor intrinsic used for computation.
reduce_axis : IterVar
An iteration variable representing the value.
name: str, optional
The name hint of the tensor
tag: str, optional
Additonal tag information about the compute.Demo:
# tensor intrinsic
def intrin_vadd(n):
x = tvm.placeholder((n,), dtype=dtype, name='vx')
y = tvm.placeholder((n,), dtype=dtype, name='vy')
z = tvm.compute(x.shape, lambda i: x[i] + y[i], name='z')
def intrin_func(ins, outs):
ib = tvm.ir_builder.create()
ib.emit(tvm.call_extern(outs[0].dtype, 'vadd', ins[0].access_ptr("r"), ins[1].access_ptr('r'), outs[0].access_ptr('wr'), 8, 1, 1, 1, 8, 8, 8))
return ib.get()
return tvm.decl_tensor_intrin(z.op, intrin_func)
# tensorize way
def tensorize_vadd():
A = tvm.placeholder((m/factor, factor), name='A')
B = tvm.placeholder((m/factor, factor), name='B')
C = tvm.compute(A.shape, lambda *i: A(*i) + B(*i), name='C')
s = tvm.create_schedule(C.op)
xo, xi = C.op.axis
vadd = intrin_vadd(factor)
s[C].tensorize(xi, vadd)
print(tvm.lower(s, [A, B, C], simple_mode=True))
# tensor_op way
def tensor_op_vadd():
A = tvm.placeholder((m/factor, factor), name="A", dtype=dtype)
B = tvm.placeholder((m/factor, factor), name="B", dtype=dtype)
intrin = intrin_vadd(16)
C = tvm.tensor_op([m/factor,], [factor,],
lambda i: [A[i, 0:factor], B[i, 0:factor]],
intrin, name='C')
s = tvm.create_schedule(C.op)
print(tvm.lower(s, [A, B, C], simple_mode=True))link to the PR: #1476