[RFC] Relay Dynamic Runtime

# Putting the VM in TVM: The Relay Virtual Machine

We have previously introduced Relay, a new program representation,  which is able to represent and optimize a greater breadth of machine  learning programs. Unfortunately by supporting a more expressive set of programs we  introduced several new execution challenges. So far we have not provided a  production-ready solution to executing full Relay programs, i.e those containing control-flow,  abstraction (both data and functional), as well as dynamism.

We introduced a “debug” interpreter which performs naive AST traversal  to execute the program. This approach is conceptually simple but requires  traversal of the program for each evaluation. The program is stored as a tree which makes heavy use of indirection and leads to inefficient execution. 
There are still unknown dynamism issues such as dynamic scheduling and allocation, fully dynamic tensor shapes, and control-flow. The interpreter has simple solutions for these, but none provide a compelling and optimized solution.

The second execution mechanism is the existing graph runtime, in order to target Relay programs to this we translate a small subset of them to the
old graph format, and execute them on the runtime. This provides a solid execution 
experience but only for a very limited subset of Relay programs.

Finally we have developed an experimental ahead-of-time compiler, 
which transforms a Relay program into a shared library which implements it. 
This final approach provides compelling performance but is hard to extend with new
approaches to handling dynamic programs.

This RFC proposes a new virtual machine for Relay programs. The virtual machine is designed to strike a balance between performance and flexibility when deploying and executing Relay programs, without giving up the benefits of TVM.

Virtual machine (VM) design is a well studied area in programming languages and systems,  and there have been various virtual machine designs for both full fledged,  and embedded programing languages. Previous language VM designs have been heavily tailored to the execution profile of traditional programs. Traditional programs manipulate small scalar values and consist of a large number of low level instructions. The sheer quantity of instructions to compute requires instruction execution and dispatch to be extremely efficient. 

In the context of machine learning we manipulate primarily tensor values, using a (relatively)  low number of high level instructions. ML program's cost centers are expensive operator invocations  such as GEMM or convolution, over a large input. Due to execution profile exhibited by ML programs micro-optimizations present in scalar-VMs are dramatically less important. A model’s runtime will  be dominated by executing expensive operators on large inputs.

TVM has provided a strong support for vision models, but we want to grow to support a wider variety of models. The graph runtime is able to utilize the fully static nature of the input graphs to perform aggressive optimization such as fully static allocation, and optimal memory reuse. When we introduce models which make use of control-flow, recursion, dynamic shapes, dynamic allocation we must change how execution works.

The rest of this design document is focused on explaining a VM which addresses these challenges and and explores  design decisions which remain.

# Proposed Design

I have been experimenting with different designs and discussing how
to solve this problem with members of the community for the past few months. 

Our belief is the most important design aspects  will be optimizing for cheap “allocation” of 
objects (by trying to avoid real allocation, reuse of static fragments, and the ability 
to do dynamic (i.e jagged tensors).

# Instruction Set
The critical design choice of a VM is the instruction set and their representation.
The current representation of the instructions is a tagged union, containing the op-code and the data payload.  An important design decision is the level of abstraction of the instructions, and how they take their data, that is RISC vs. CISC and fixed-width instruction encoding vs. variable length. The current version is closer to CISC with complex instructions like AllocTensor, and is variable length due to the inclusion of the shape as part of the instruction. The current instruction set is very high level and corresponds roughly to high level operations in Relay.

### Push
**Arguments**:
  size_t stack_index

Reads the value at base pointer + stack_index and pushes
it on to the stack.

### Pop
**Arguments**: 
  size_t pop_count
 
Pops pop_count number of entries off the stack starting from the end.

### Ret
**Arguments**:
  None
 
Returns from the current function call, popping off the last value of
the frame stack and restoring the VM state that was recorded at the
last call site. The last value on the stack is interpreted as the
return value.

### InvokePacked
**Arguments**:
  size_t packed_index
  size_t arity
  size_t output_size

Invoke the packed function denoted by packed_index. The arity
and output size are used to inform the VM how many inputs and 
outputs to expect.

### AllocTensor
**Arguments**:
  const std::vector<int64_t>& shape, DLDataType dtype);
  
Allocate a tensor value of the appropriate shape and dtype.

### AllocDatatype
**Arguments**:
  (size_t tag, size_t num_fields);

Allocate a data type with the tag `tag` using the
top `num_fields` entries on the stack as its fields.

### AllocClosure
**Arguments**:
- size_t func_index
- size_t num_freevar
 
Allocate a closure with the VMFunction at func_index as
its code, and the `num_freevar` entries on the stack
as its free variables.

### GetField
**Arguments**:
- size_t object_offset 
- size_t field_index

### If
**Arguments**:
- size_t true_branch
- size_t false_branch

Check if the top element on the stack is `true` or `false`.
If true relative jump by `true_branch`, else relative
jump by `false_branch`.

### Goto
**Arguments**:
- size_t pc_offset

Relative unconditional jump by `pc_offset`.

### Invoke

**Arguments**: 
- size_t func_index
  
Invoke function at `func_index`, consumes the number of arguments contained in the VMFunction's
arity field, and places return value on stack as top element.

### InvokeClosure
**Arguments**:
None
 
Expects the top value on the stack is a closure. Invokes closure consuming the number of 
arguments declared in the closure_index's VMFunction, and places return value on the stack.

### LoadConst
**Arguments**: 
- size_t const_index

Load the constant at `const_index` from the constant pool.

# Object Representation
We use a simple object representation that uses shared pointers and tagging. 
There is a huge space of object representations we can trade off here, but we 
believe micro-optimizing this code has little to no-effect on the end-to-end performance. 
```
struct VMObjectCell {
  VMObjectTag tag;
  ...
};

struct VMObject {
  std::shared_ptr<VMObjectCell> ptr;
  ... 
}
```
See `vm.h` for more details.

Currently we support 3 types of objects, tensors, data types, and closures.

```
VMObject VMTensor(const tvm::runtime::NDArray& data);
VMObject VMDatatype(size_t tag, const std::vector<VMObject>& fields);
VMObject VMClosure(size_t func_index, std::vector<VMObject> free_vars);
```

# Stack and State

The Relay VM consists of two important stacks, the value stack which acts as the normal call stack
as used in C/C++/Java/etc, and the frame stack, which contains information about how to resume the
previous call. 

A stack machine is straightforward to implement, register-based virtual machines are more efficient
in the scalar-world, but we believe in the tensor world that the complexity to performance gain
tradeoff is not worth it.

We keep track of a set of Relay functions we have called, a pointer into its bytecode, a offset into the byte code known as the program counter, as well as an offset into the value stack which tells us where the stack frame begins known as the base pointer.

```
struct VirtualMachine {
    ...
    std::vector<VMFrame> frames;
    std::vector<VMObject> stack;
    ... 
    // Current function.
    size_t func_index;
    // Pointer into the current function's instructions.
    const Instruction* code;
    // Current program counter relative to the code pointer.
    size_t pc;
    // The current base pointer.
    size_t bp;
    ... 
};
```

# Dispatch Loop
A very critical piece of a VM is the dispatch loop, usually this dominates execution time of a virtual machine, but experimentally we have found the performance of the loop to not be of much importance. We have just implemented a simple switch/goto dispatch loop which dispatches based on instruction op code.

This loop is implemented by `VirtualMachine::Run()`.

It is my belief that this code is not as important to end-to-end performance as allocation, 
and memory reuse. 

# VM Compiler

An important part of this infrastructure is a compiler from Relay's full IR into a sequence of bytecode.
The VM compiler transforms a `tvm::relay::Module` into a `tvm::relay::vm::VirtualMachine`. The virtual
machine contains a set of compiled functions, the compiled functions are contained in `tvm::relay::vm::Function`. The functions contain metadata about the the function as well as its compiled bytecode. For full definition of the data structures see `vm.h`.

# Optimizations

There are quite a few optimizations required by the VM compiler. 

We have implemented them in the old pass style, but plan to port them to
the new pass manager (#2546) before merging.

- A-Normal Form 
- Lambda Lift (see `src/relay/vm/lambda_lift.cc`)
- Inline Primitives (see `src/relay/vm/inline_primitives.cc`)
- Inliner (see `src/relay/pass/inliner.cc`)
- Tail Call Optimization (see ...)
- Constant Pool Layout (see ...)
- ADT Tag Allocation (see ...)
- Liveness Analysis (see ...)

# Serialization

A final and yet to be implemented part of the VM design is serialization. This accompanying PR will introduce both the bytecode, its serialization, as well as VM level serialization. The idea being that a VM can be efficiently stored to disk and resumed at a later time. This would also allow us to efficiently schedule many models on to a single machine in order to obtain good utilization.

# Unresolved Questions

## How do we handle dynamic shapes?
I have another prototype extension to Relay which adds initial support for compiling and executing programs containing fully dynamic shapes. I will post an RFC and prototype PR on this subject soon.

## How can we modify the VM to support JIT compilation of certain code paths?
In the code generation space there are still many tradeoffs to be analyzed and the VM is designed
to be very flexible so we can modify it for future experiments. 

## How do we support heterogenous execution?
Heterogenous execution should work out of the box assuming we have annotated the appropriate device copies.
In order to do this properly we need to run the device annotation and copying passes. We for see nothing too complex in this work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC] Relay Dynamic Runtime #2810

Putting the VM in TVM: The Relay Virtual Machine

Proposed Design

Instruction Set

Push

Pop

Ret

InvokePacked

AllocTensor

AllocDatatype

AllocClosure

GetField

If

Goto

Invoke

InvokeClosure

LoadConst

Object Representation

Stack and State

Dispatch Loop

VM Compiler

Optimizations

Serialization

Unresolved Questions

How do we handle dynamic shapes?

How can we modify the VM to support JIT compilation of certain code paths?

How do we support heterogenous execution?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC] Relay Dynamic Runtime #2810

Description

Putting the VM in TVM: The Relay Virtual Machine

Proposed Design

Instruction Set

Push

Pop

Ret

InvokePacked

AllocTensor

AllocDatatype

AllocClosure

GetField

If

Goto

Invoke

InvokeClosure

LoadConst

Object Representation

Stack and State

Dispatch Loop

VM Compiler

Optimizations

Serialization

Unresolved Questions

How do we handle dynamic shapes?

How can we modify the VM to support JIT compilation of certain code paths?

How do we support heterogenous execution?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions