Skip to content

[RFC][μTVM] Bringing TVM to Bare-Metal Devices #2563

@mutinifni

Description

@mutinifni

There has been a proliferation of resource-constrained and embedded devices that do not have operating systems or a mature software stack. This trend is likely to continue with the shift towards hardware specialization and the growing interest in open-source hardware, for which software support takes time to develop. Running ML and DL applications on such devices will lead to opportunities for faster and broader impact. This entails the following challenges:

  • Bare-metal devices usually do not have on-device memory management.
  • They typically do not support LLVM and it may be hard to develop custom IR passes for them.
  • They are hard to debug because of a rigid programming and cross-compilation interface.
  • Some of these devices may not have RPC or network support.
  • They are hard to optimize for and compare against, because efficient operator libraries typically do not exist.

image

This RFC proposes a way to support TVM on such bare-metal devices by extending the stack at the layers highlighted in the figure above. I have already pushed C codegen support, and have developed and tested an initial μTVM implementation by emulating a device with an allocated region of memory on the host machine. Next, @weberlo and I will continue this work to support devices that expose a JTAG debugging interface (including RISC-V backends), and develop an optimization framework based on AutoTVM.

Proposed Design

We envision supporting bare-metal devices in two steps. The first step is to create a separated control and data plane mechanism, where the control plane lies within the host, and is used to drive execution on the board, which will allow us to reuse the existing interfaces for AutoTVM optimizations and easy Python-based device programming. The next step is to create a minimal TVM runtime that can independently run on the board without requiring a host driver, which will make μTVM suitable for deployment. This RFC currently focuses on the first step.

Because LLVM is not as ubiquitous as standard C, we started out by building a gcc / g++ code generation backend for TVM, which has already been upstreamed (#2161). However, μTVM’s design is independent of the code generator backend used, as long as it is supported for the target device.

Overview

μTVM will contain the following components:

  • Python frontend: Set of modules that let us program bare-metal devices using TVM’s Python DSL.
  • Code generator: TVM module that can generate executable code for the target device. Will typically be one of C, C++ or LLVM.
  • LowLevelDeviceAPI: A minimal read/write/execute interface that any bare-metal device must expose in order to work with μTVM.
  • MicroDeviceAPI: Child class of TVM’s DeviceAPI that performs on-device memory management and provides helper functions to copy memory to and from device by using a LowLevelDeviceAPI backend.
  • MicroModule: Child class of TVM’s ModuleNode which provides functionality to load compiled library on to the device and execute functions, also by using LowLevelDeviceAPI.

image

The above figure shows the way in which μTVMs components interact with each other while using an OpenOCD-based connection to a RISC-V board. The board is connected to the OpenOCD server through a JTAG debugger. μTVM runtime attaches to OpenOCD server using a Tcl socket connection, which enables the runtime to read, write, and execute from on-device memory. In order to actually program the board, the C file produced by the code generator is compiled into an object file using a vendor-specific GCC (in this case riscv-gcc). μTVM reads this binary and remaps it to on-device addresses by dumping relevant program sections and linking it by using the ld linker. The various ELF sections in this remapped binary are then written to the board along with a device-side driver code (an ”init stub” similar to a bootloader). On-device function calls are then made by calling the init stub with the correct function pointer and passing appropriate arguments.

Frontend

The Python frontend for μTVM will look the same as for any other device interface. For example, the below code segment performs vector add on the bare-metal device.

        ...
        m = tvm.module.load("test.obj", "openocd")
        fadd = m['fadd']
        ctx = tvm.micro_dev(0)
        n = 1024
        a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), ctx)
        b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), ctx)
        c = tvm.nd.array(np.zeros(n, dtype=C.dtype), ctx)
        fadd(a, b, c)
        ...

Device-Side Driver Code

In order to invoke functions on the board, we implement a device-side driver (similar to an init stub or a bootloader) whose function body simply invokes a function pointer with type-erased arguments. These arguments can be set by the μTVM runtime whenever it needs to invoke a function on the board. The driver code is uploaded to the target device when it is started with μTVM and is never erased. For each function invocation, the runtime rewrites the driver function pointer value with the target function, copies function arguments to a dedicated memory region on the device, and then executes the function by calling the driver code.

Backend

The μTVM backend has multiple components, as described below.

LowLevelDeviceAPI

First, to enable TVM to read or write to on-device memory or to start execution on the device, we introduce a new interface called low_level_device_api.h. This is the only interface a micro device must implement to be supported in TVM. For example, this could either be an emulated region of memory on the host (HostLowLevelDeviceAPI), or an interface exposed through an on-host JTAG controller software (OpenOCDLowLevelDeviceAPI).

The code below shows the LowLevelDeviceAPI interface.

class LowLevelDeviceAPI {
  public:
  virtual ~LowLevelDeviceAPI() {}

  virtual void Write(TVMContext ctx, void* offset, uint8_t* buf, size_t num_bytes) = 0;

  virtual void Read(TVMContext ctx, void* offset, uint8_t* buf, size_t num_bytes) = 0;

  virtual void Execute(TVMContext ctx, TVMArgs args, TVMRetValue *rv, void* offset) = 0;

  virtual void Reset(TVMContext ctx) = 0;
...
};

In order to communicate with a wide variety of target devices, we plan to use OpenOCD, a debugging tool for bare-metal devices. OpenOCD lets us program devices that support JTAG hardware debuggers by providing an interface to upload binaries, read and write from memory, etc. This is achieved by sending OpenOCD instructions via a Tcl socket connection. Hence, TVM will implement the above interface in OpenOCDLowLevelDeviceAPI in order to communicate with OpenOCD. The only peculiarity about this setup is that OpenOCD must be running (separately from TVM) and connected to the target device, and this must be set up by the user. One question that remains is how to configure the connection between TVM and OpenOCD (i.e., figure out which port the OpenOCD process is listening on), without resorting to hardcoding. Our current approach is for the user to pass this as a parameter when initializing OpenOCD in the Python frontend. For example, tvm.micro_dev(dev_id = 0, openocd_port = 6666).

MicroDeviceAPI and MicroModule

Next, we extend TVM's DeviceAPI and ModuleNode classes to support bare-metal devices. The MicroDeviceAPI class adds functions that perform memory management on the target low-level devices, and also allows data to be copied to and from the target device. The MicroModule class adds support to load and manage ELF sections into memory, find correct function pointers, obtain symbol addresses, etc. Both of these in tandem allow TVM to support any new bare-metal device backend. In the next few paragraphs, we describe how, using the LowLevelDeviceAPI, we can implement memory management, library loading, and function invocation.

Because it is possible that our target bare-metal devices have tight memory constraints, we manage memory on the host. Doing so avoids the space overhead on the device of memory management bookkeeping structures and functions. For example, if we want to allocate and populate a DLTensor, we find a suitable memory region on the device (by calling into our on-host memory manager), then use the Write method to populate that memory region. The tensor on the host only stores metadata (e.g., shape, device data pointer, etc.) and queries for on-device data whenever needed by the frontend.

To load library functions onto the device, we dump the ELF sections of the compiled binary and copy them into memory regions on the board. To remove a library, we can simply overwrite this region with null values. This allows us to support multiple libraries simultaneously.

To invoke a function, as described above, we write the target function pointer to the pointer used by the device driver, and copy function arguments to a dedicated section in the device’s address space. Once this data has been written, we use the Execute method to call the target function using the device-side driver, which will read arguments from the arguments section, execute its body, then write the result, if any, back to the arguments section (because we pass memory references in function arguments). Whenever output data is read in the frontend, the runtime will simply copy back the relevant arguments from the target device. Return values from functions are not supported, and also unnecessary because we can transfer data by reference passing.

Bare-metal devices are normally very difficult to debug and will crash without providing informative error messages. Because of the difficulty of this workflow, we have already implemented a HostMicroDevice for a smoother debugging experience and for faster iteration on experimental features. The HostMicroDevice emulates a bare-metal device by allocating an executable memory region on the host and treating it as the target device. All data copying to and from this region is done in the same way as it would be on a target device. The only difference lies in the way functions are invoked, because on the host, a function is simply called by casting the appropriate region in memory as a function pointer and calling it as usual. The next step will be to develop an OpenOCD-compatible MicroDeviceAPI and MicroModule.

Design Benefits

Our proposed plan will enable the following benefits for TVM:

  • Bare-metal backend support with dependence on only read, write and execute commands.
  • Remote driving of execution on the target device by separating out the control plane and running it on the host, leading to greater flexibility in device management.
  • Easier debugging with a Python-based programming interface.
  • RISC-V support.
  • JTAG-based devices supported by OpenOCD potentially also become TVM backends.
  • Reuse TVMs existing infrastructure for optimizations.

Roadmap

  • add a gcc / g++ code generation backend ([BACKEND][CODEGEN] C codegen with tests #2161)
  • test idea on emulated host memory region (PR in a week)
  • implement a low-level OpenOCD device (1 week)
  • test on RISC-V Spike simulator (1 week)
  • test on actual hardware backends (1 week)
  • test AutoTVM-based model optimizations (2 weeks)

Comments welcome!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions