[RFC] Generalize compute to tensor region

## Target:

- Avoid verification in the original Tensorize, which we found it is not easy while detecting some patterns, like specifying abosolute/relative indexing.
- Achieve better encapsulation. Vendors can build external package for their hardware, and user do not need to know the detail while writing their TVM declaration.
- Propose a way to represent a block in tensor, which can be scheduled as an unit.

## Motivation

The current `tensorize` method use `TensorIntrin` as a contract. While lowering, it will compare the body of the intrinsic‘s declaration and the body of the original declaration. After verifying the both are absolutely identical, the body of the original declaration will be replaced by intrinsic.

However, we found that it is not easy to guarantee/express the identity in some situation, especially involving the value of an index. This problem is related to the implementation of `tensorize`: normally, `tensorize` will replace the index in intrinsic with the index in the original declaration, so that the lowered code can be perfectly matched. For example, if we want to tensorize an addition intrinsic:

```
# intrinsic declaration
for k in 0...16
    A[k] = B[k] + C[k]

# original declaration
for i in 0...32
    for j in 0...16
        A[i*16+j] = B[i*16+j] + C[i*16+j]
```
We will replace `k` in intrinsic declaration with `i*16+j`, then to compare the ir whether is identical. However, if we require the value of index, things will change. The verification will fail unless we express absolute indexing and relative indexing clearly.


## Design

Instead of proposing a way to express absolute indexing, we would like to come up with generalizing `tvm.compute` to the region of tensor. That means we can view a tensor intrinsic as an operation between regions, just like `+` as an operation between elements. 

- **region**: following numpy, we use slice to express a region in tensor: `A[i, 0:16]`
- **tensor_op**: different with `extern` operation, we hope that we can keep the ability to schedule the part out of tensorized region. It means there are two kinds of axes: 1. axes in tensorized region, which cannot be scheduled; 2. axes out of the tensorized region, which can be scheduled through `split`, `reorder`, etc.

With this generalization, we can use tensor intrinsic in declaration directly, so verification is no longer needed. Also, this way provides better encapsulation, user do not need to know the detail of tensor intrinsic.

## Interface

```python
def tensor_op(out_dims,
              in_dims,
              finputs,
              tensor_intrin,
              reduce_axis=[],
              name='tensor_op',
              tag=""):
    """Construct new tensors with intrinsic.

    Parameters
    ----------
    out_dims: tuple
        The dimensions out of the tensorized region, which can be scheduled through `reorder`, `split`.

    in_dims: tuple
        The dimensions inside of the tensorized region, which cannot be manipulated.

    finputs: lambda function of out_dims -> list of TensorSlice
        Specifies involved regions of input tensors.

    tensor_intrin : TensorIntrin
        The tensor intrinsic used for computation.

    reduce_axis : IterVar
        An iteration variable representing the value.

    name: str, optional
        The name hint of the tensor

    tag: str, optional
        Additonal tag information about the compute.
```

## Demo:

```python
# tensor intrinsic
def intrin_vadd(n):
    x = tvm.placeholder((n,), dtype=dtype, name='vx')
    y = tvm.placeholder((n,), dtype=dtype, name='vy')
    z = tvm.compute(x.shape, lambda i: x[i] + y[i], name='z')

    def intrin_func(ins, outs):
        ib = tvm.ir_builder.create()
        ib.emit(tvm.call_extern(outs[0].dtype, 'vadd', ins[0].access_ptr("r"), ins[1].access_ptr('r'), outs[0].access_ptr('wr'), 8, 1, 1, 1, 8, 8, 8))
        return ib.get()

    return tvm.decl_tensor_intrin(z.op, intrin_func)


# tensorize way
def tensorize_vadd():
    A = tvm.placeholder((m/factor, factor), name='A')
    B = tvm.placeholder((m/factor, factor), name='B')
    C = tvm.compute(A.shape, lambda *i: A(*i) + B(*i), name='C')

    s = tvm.create_schedule(C.op)
    xo, xi = C.op.axis
    vadd = intrin_vadd(factor)
    s[C].tensorize(xi, vadd)
    print(tvm.lower(s, [A, B, C], simple_mode=True))


# tensor_op way
def tensor_op_vadd():
    A = tvm.placeholder((m/factor, factor), name="A", dtype=dtype)
    B = tvm.placeholder((m/factor, factor), name="B", dtype=dtype)

    intrin = intrin_vadd(16)
    C = tvm.tensor_op([m/factor,], [factor,],
                      lambda i: [A[i, 0:factor], B[i, 0:factor]], 
                      intrin, name='C')

    s = tvm.create_schedule(C.op)
    print(tvm.lower(s, [A, B, C], simple_mode=True))

```

link to the PR: #1476 

@tqchen @xqdan 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Generalize compute to tensor region #1485

Target:

Motivation

Design

Interface

Demo:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Generalize compute to tensor region #1485

Description

Target:

Motivation

Design

Interface

Demo:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions