Skip to content

[RFC] Relay Dynamic Runtime #2810

@jroesch

Description

@jroesch

Putting the VM in TVM: The Relay Virtual Machine

We have previously introduced Relay, a new program representation, which is able to represent and optimize a greater breadth of machine learning programs. Unfortunately by supporting a more expressive set of programs we introduced several new execution challenges. So far we have not provided a production-ready solution to executing full Relay programs, i.e those containing control-flow, abstraction (both data and functional), as well as dynamism.

We introduced a “debug” interpreter which performs naive AST traversal to execute the program. This approach is conceptually simple but requires traversal of the program for each evaluation. The program is stored as a tree which makes heavy use of indirection and leads to inefficient execution.
There are still unknown dynamism issues such as dynamic scheduling and allocation, fully dynamic tensor shapes, and control-flow. The interpreter has simple solutions for these, but none provide a compelling and optimized solution.

The second execution mechanism is the existing graph runtime, in order to target Relay programs to this we translate a small subset of them to the
old graph format, and execute them on the runtime. This provides a solid execution
experience but only for a very limited subset of Relay programs.

Finally we have developed an experimental ahead-of-time compiler,
which transforms a Relay program into a shared library which implements it.
This final approach provides compelling performance but is hard to extend with new
approaches to handling dynamic programs.

This RFC proposes a new virtual machine for Relay programs. The virtual machine is designed to strike a balance between performance and flexibility when deploying and executing Relay programs, without giving up the benefits of TVM.

Virtual machine (VM) design is a well studied area in programming languages and systems, and there have been various virtual machine designs for both full fledged, and embedded programing languages. Previous language VM designs have been heavily tailored to the execution profile of traditional programs. Traditional programs manipulate small scalar values and consist of a large number of low level instructions. The sheer quantity of instructions to compute requires instruction execution and dispatch to be extremely efficient.

In the context of machine learning we manipulate primarily tensor values, using a (relatively) low number of high level instructions. ML program's cost centers are expensive operator invocations such as GEMM or convolution, over a large input. Due to execution profile exhibited by ML programs micro-optimizations present in scalar-VMs are dramatically less important. A model’s runtime will be dominated by executing expensive operators on large inputs.

TVM has provided a strong support for vision models, but we want to grow to support a wider variety of models. The graph runtime is able to utilize the fully static nature of the input graphs to perform aggressive optimization such as fully static allocation, and optimal memory reuse. When we introduce models which make use of control-flow, recursion, dynamic shapes, dynamic allocation we must change how execution works.

The rest of this design document is focused on explaining a VM which addresses these challenges and and explores design decisions which remain.

Proposed Design

I have been experimenting with different designs and discussing how
to solve this problem with members of the community for the past few months.

Our belief is the most important design aspects will be optimizing for cheap “allocation” of
objects (by trying to avoid real allocation, reuse of static fragments, and the ability
to do dynamic (i.e jagged tensors).

Instruction Set

The critical design choice of a VM is the instruction set and their representation.
The current representation of the instructions is a tagged union, containing the op-code and the data payload. An important design decision is the level of abstraction of the instructions, and how they take their data, that is RISC vs. CISC and fixed-width instruction encoding vs. variable length. The current version is closer to CISC with complex instructions like AllocTensor, and is variable length due to the inclusion of the shape as part of the instruction. The current instruction set is very high level and corresponds roughly to high level operations in Relay.

Push

Arguments:
size_t stack_index

Reads the value at base pointer + stack_index and pushes
it on to the stack.

Pop

Arguments:
size_t pop_count

Pops pop_count number of entries off the stack starting from the end.

Ret

Arguments:
None

Returns from the current function call, popping off the last value of
the frame stack and restoring the VM state that was recorded at the
last call site. The last value on the stack is interpreted as the
return value.

InvokePacked

Arguments:
size_t packed_index
size_t arity
size_t output_size

Invoke the packed function denoted by packed_index. The arity
and output size are used to inform the VM how many inputs and
outputs to expect.

AllocTensor

Arguments:
const std::vector<int64_t>& shape, DLDataType dtype);

Allocate a tensor value of the appropriate shape and dtype.

AllocDatatype

Arguments:
(size_t tag, size_t num_fields);

Allocate a data type with the tag tag using the
top num_fields entries on the stack as its fields.

AllocClosure

Arguments:

  • size_t func_index
  • size_t num_freevar

Allocate a closure with the VMFunction at func_index as
its code, and the num_freevar entries on the stack
as its free variables.

GetField

Arguments:

  • size_t object_offset
  • size_t field_index

If

Arguments:

  • size_t true_branch
  • size_t false_branch

Check if the top element on the stack is true or false.
If true relative jump by true_branch, else relative
jump by false_branch.

Goto

Arguments:

  • size_t pc_offset

Relative unconditional jump by pc_offset.

Invoke

Arguments:

  • size_t func_index

Invoke function at func_index, consumes the number of arguments contained in the VMFunction's
arity field, and places return value on stack as top element.

InvokeClosure

Arguments:
None

Expects the top value on the stack is a closure. Invokes closure consuming the number of
arguments declared in the closure_index's VMFunction, and places return value on the stack.

LoadConst

Arguments:

  • size_t const_index

Load the constant at const_index from the constant pool.

Object Representation

We use a simple object representation that uses shared pointers and tagging.
There is a huge space of object representations we can trade off here, but we
believe micro-optimizing this code has little to no-effect on the end-to-end performance.

struct VMObjectCell {
  VMObjectTag tag;
  ...
};

struct VMObject {
  std::shared_ptr<VMObjectCell> ptr;
  ... 
}

See vm.h for more details.

Currently we support 3 types of objects, tensors, data types, and closures.

VMObject VMTensor(const tvm::runtime::NDArray& data);
VMObject VMDatatype(size_t tag, const std::vector<VMObject>& fields);
VMObject VMClosure(size_t func_index, std::vector<VMObject> free_vars);

Stack and State

The Relay VM consists of two important stacks, the value stack which acts as the normal call stack
as used in C/C++/Java/etc, and the frame stack, which contains information about how to resume the
previous call.

A stack machine is straightforward to implement, register-based virtual machines are more efficient
in the scalar-world, but we believe in the tensor world that the complexity to performance gain
tradeoff is not worth it.

We keep track of a set of Relay functions we have called, a pointer into its bytecode, a offset into the byte code known as the program counter, as well as an offset into the value stack which tells us where the stack frame begins known as the base pointer.

struct VirtualMachine {
    ...
    std::vector<VMFrame> frames;
    std::vector<VMObject> stack;
    ... 
    // Current function.
    size_t func_index;
    // Pointer into the current function's instructions.
    const Instruction* code;
    // Current program counter relative to the code pointer.
    size_t pc;
    // The current base pointer.
    size_t bp;
    ... 
};

Dispatch Loop

A very critical piece of a VM is the dispatch loop, usually this dominates execution time of a virtual machine, but experimentally we have found the performance of the loop to not be of much importance. We have just implemented a simple switch/goto dispatch loop which dispatches based on instruction op code.

This loop is implemented by VirtualMachine::Run().

It is my belief that this code is not as important to end-to-end performance as allocation,
and memory reuse.

VM Compiler

An important part of this infrastructure is a compiler from Relay's full IR into a sequence of bytecode.
The VM compiler transforms a tvm::relay::Module into a tvm::relay::vm::VirtualMachine. The virtual
machine contains a set of compiled functions, the compiled functions are contained in tvm::relay::vm::Function. The functions contain metadata about the the function as well as its compiled bytecode. For full definition of the data structures see vm.h.

Optimizations

There are quite a few optimizations required by the VM compiler.

We have implemented them in the old pass style, but plan to port them to
the new pass manager (#2546) before merging.

  • A-Normal Form
  • Lambda Lift (see src/relay/vm/lambda_lift.cc)
  • Inline Primitives (see src/relay/vm/inline_primitives.cc)
  • Inliner (see src/relay/pass/inliner.cc)
  • Tail Call Optimization (see ...)
  • Constant Pool Layout (see ...)
  • ADT Tag Allocation (see ...)
  • Liveness Analysis (see ...)

Serialization

A final and yet to be implemented part of the VM design is serialization. This accompanying PR will introduce both the bytecode, its serialization, as well as VM level serialization. The idea being that a VM can be efficiently stored to disk and resumed at a later time. This would also allow us to efficiently schedule many models on to a single machine in order to obtain good utilization.

Unresolved Questions

How do we handle dynamic shapes?

I have another prototype extension to Relay which adds initial support for compiling and executing programs containing fully dynamic shapes. I will post an RFC and prototype PR on this subject soon.

How can we modify the VM to support JIT compilation of certain code paths?

In the code generation space there are still many tradeoffs to be analyzed and the VM is designed
to be very flexible so we can modify it for future experiments.

How do we support heterogenous execution?

Heterogenous execution should work out of the box assuming we have annotated the appropriate device copies.
In order to do this properly we need to run the device annotation and copying passes. We for see nothing too complex in this work.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions