Skip to content

Intermediate representation

Jiří Fatka edited this page Jan 2, 2019 · 3 revisions

(WIP) Intermediate representation of source code.

Introduction

The intermediate representation (IR) is necessary to support language extensibility because it brings new AST nodes and without IR interpreter and compiler must be extended to support that. Using IR allows to translate AST nodes into well defined representation which is understod by interpreter or compiler.

The IR can be designed for stack or register based processors. The main difference is the stack based using stack for computation so for adding two values it must move two values on top of stack and then call add operation and result is stored on stack top. The register based are using a set of registers for computation where values are moved to registers and then operation uses specified registers to perform an operation.

The IR is based on LLVM IR and uses SSA form. There is no registers or stack just values which can be later allocated to registers or stack (by interpreter or compiler). Full SSA form is not used because expressing something in SSA form can be hard. Simplifications can be done using alloc and store instructions where store doesn't create new value but modifies a value at allocated address. Thank to this there is no need for Phi instruction (which is harder to understand).

Types

There are two kinds of types: fundamental and aggregate.

Fundamental

Fundamental types are integer, floating point and pointer types and aggregate are other types like structs. Currently only supported integer types are: int1, int8, int16, int32, int64 and float types: float32 and float64.

Pointers

Pointer types are designed for storing other value address, some place in system memory. Always points to value with specified type and is recognized by star * suffix (e.g. int8*). Can be chained to more levels like int8***. The 0 address aka null pointer is considered as invalid address and should not be used for storing and loading value from it.

Structures

Structures are aggregates of fundamental, pointers or other aggregate types.

Blocks

Blocks are basic building primitives of functions. All defined functions must have at least one block. Function body transfer control flow between blocks so all blocks must end with control flow instruction like branch or return.

Functions

Functions are basic primitives which operates on program data. Without functions there are nothing to do.

Values

Values represents program data. Each value have defined type and both value and type is immutable.

Memory

The IR doesn't offer any memory allocation support except local variable allocation via alloc. Can be implemented by calling system functions.

Instructions

A list of supported instructions. For readability it's written in human readable form which is not same representation used for storing IR code. In most cases the instructions have following scheme:

[ <result> = ] <name> [ <attributes...> ] [ <operands...> ]

The result value <result> is optional, some instruction doesn't yield a value. The <name> is instruction name and is required. The attributes <attributes...> is a list of space separated values which specify the instruction, e.g. operands type. The operands <operands...> are comma separated values that specify instruction operands.

%1 = load int32 %0, 3

This represents instruction of name load with attribute int32 (operands type) and its operands are %0 and 3.

Function

<name> = function <type>(<types...>) { ... }
<name> = function void(<types...>) { ... }

Define a function with return type <type> or void with parameters <types...> and name <name>. Multiple functions with same name can be defined but it must differ by <types...>.

Alloc

<result> = alloc <type> [, <number = 1>]

Allocate memory for local variable defined by size of <type> multiplied by <number> (default to 1 if not present) and return it's address. The allocated memory is automatically released after scope exit. Do not use for allocation of big objects since in most cases memory is allocated on stack with limited size. It's guaranteed the result value is not null address.

Store

store <type> <addr>, <value> [, <index = 0> ]

Store a value <value> of type <type> at given address <addr>. As mentioned before, values are immutable, but this function doesn't alter values (in IR point of view) it just store value at some address in memory. Store to null address or accessing index outside of allocation range is undefined behaviour.

Load

<result> = load <type> <addr> [, <index = 0> ]

Loads value of <type> from address <addr> to <result> variable. If <addr> is memory of array, the <index> can be specified and it's used as offset to address as <index> * size of <type>. Load to null address or accessing index outside of allocation range is undefined behaviour.

Add

<result> = add <type> <value1>, <value2>

Perform addition of the two operands <value1> and <value2>. Only supported types <type> are integer and floating point types.

Sub

<result> = sub <type> <value1>, <value2>

Perform subtraction of the two operands <value1> and <value2>. Only supported types <type> are integer and floating point types.

Mul

<result> = mul <type> <value1>, <value2>

Perform multiplication of the two operands <value1> and <value2>. Only supported types <type> are integer and floating point types.

Div

<result> = div <type> <value1>, <value2>

Perform division of the two operands <value1> and <value2>. Only supported types <type> are integer and floating point types.

Rem

<result> = rem <type> <value1>, <value2>

Perform division of the two operands <value1> and <value2> and return remainder. Only supported types <type> are integer and floating point types.

Cmp

<result> = cmp <cond> <ty> <value1>, <value2>

Perform comparision <cond> of the two operands <value1> and <value2>. Only supported types <type> are integer and floating point types.

Supported <cond>:

  • eq - Are operands equal?
  • ne - Are operands not equal?
  • gt - Is first operand greater than second operand?
  • ge - Is first operand greater or equal to second operand?
  • lt - Is first operand less than second operand?
  • le - Is first operand less or equal to second operand?

And

<result> = and <type> <value1>, <value2>

Perform bitwise and operation with values <value1> and <value2> of type <type>.

Or

<result> = or <type> <value1>, <value2>

Perform bitwise or operation with values <value1> and <value2> of type <type>.

Xor

<result> = xor <type> <value1>, <value2>

Perform bitwise xor (exclusive or) operation with values <value1> and <value2> of type <type>.

Branch

branch <cond>, label <iftrue>, label <iffalse>
branch <dest>

Perform branch or jump to specified label. The conditional branch chooses destination label by condition value so if <cond> is true the control flow jumps to <iftrue> label, otherwise to <iffalse>. The unconditional jump (the second form) is just a simple jump where the control flow jumps to <dest> label.

Call

<result> = call <type>(<types...>) <name> (<args...>)
call void(<types...>) <name> (<args...>)

Transfer control flow to called function. There are two types of call: with and without return value. The only difference is in return type specification and returned value. The call instruction is specified by return type type (or void) and parameter types <types...>. This call signature is used to select proper version of the function which name can be overloaded. Calling function which doesn't exists (name or signature mismatch) the behaviour is implementation defined but recommended behaviour is an error.

Return

return <type> <value>
return void

Return control flow to the caller with return value. The return void can be used only in cases when called function has no return type (i.e. void).

Examples

This example perform sum of 1 and 3, converts result to character string (int8*) by calling to_string function and prints it using print function.

@main = function void() {
    %1 = alloc int32
    %2 = alloc int32
    store int32 %1, 1
    store int32 %2, 3
    %3 = add int32 %1, %2
    %4 = call int8*(int32) @to_string %3
    call void(int8*) @print(%4)
    return void
}

Example of branching.

@fn = function int32(int32, int32) {
    // %0 = int32
    // %1 = int32

    // @L0:
    %2 = alloc int32
    %3 = cmp eq %0, %1
    branch %3, @L1, @L2

    // @L1:
    %4 = add int32 %0, %1
    %5 = mul int32 2, %4
    store int32 %2, %5
    branch @L3

    // @L2:
    %6 = sub int32 %0, %1
    store int32 %2, %6
    branch @L3

    // @L3:
    %7 = load int32 %2
    return int32 %7
}

Snippets

A list of snippets representing common structures in source code.

Declaration and assignment

@snippet = function void() {
    // int %1;
    %1 = alloc int32
    // double %2;
    %2 = alloc float64

    // %1 = 30;
    store int32 %1, 30

    // %2 = 0.31;
    store float64, %2, 0.31
}

Expression

@snippet = function void() {
    // int %1;
    %1 = alloc int32
    // int %2;
    %2 = alloc int32

    // %1 = 30;
    store int32 %1, 30

    // %2 = 15;
    store int32, %2, 15

    // %3 = %1 + %2;
    %3 = add int32 %1, %2

    // %4 = %1 - %2;
    %4 = sub int32 %1, %2

    // %5 = %1 * %2
    %5 = mul int32 %1, %2

    // %6 = %1 / %2
    %6 = div int32 %1, %2

    return void
}

Condition

Conditions are just conditional branch instructions with uses int1 value with can be obtained by using cmp instruction.

@snippet = function void() {
    // int %1;
    %1 = alloc int32
    // int %2;
    %2 = alloc int32

    // %1 = 10;
    store int32 %1, 10

    // %2 = 30;
    store int32 %2, 30

    // bool %3 = %1 <= %2;
    %3 = cmp le int32 %1, %2

    // if (%3)
    branch %3, @L1, @L2

    // @L1:
    branch @L3

    // @L2:
    branch @L3

    // @L3:
    return void
}

Loop

@snippet = function void() {
    // int %1;
    %1 = alloc int32
    // %1 = 0
    store int32 %1, 0
    branch @L1

    // @L1:
    while (%1 < 60)
    %2 = load int32 %1
    %3 = cmp lt int32 %2, 60
    branch %3, @L1, @L2

    // @L2
    // %1 = %1 + 1;
    %4 = load int32 %1
    %5 = add int32 %4, 1
    store int32 %1, %5
    branch @L1

    // @L3:
    return void
}

Function

@factorial = function int32(int32) {
    // %0 = arg 0

    // return value
    %1 = alloc int32
    // %2 = %0 < 1;
    %2 = cmp lt int32 %0, 1
    // if (%2)
    branch %2, @L1, @L2

    // @L1:
    // %1 = 1
    store int32 %1, 1
    branch @L3

    // @L2:
    %3 = sub int32 %0, 1
    // factorial(%0 - 1)
    %4 = call int32(int32) @factorial(%3)
    // %0 * %4
    %5 = mul int32 %0, %4
    store int32 %1, %5
    branch @L3

    // @L3:
    %6 = load int32 %1
    return %6
}

@main = function int32() {
    // return factorial(3);
    %1 = call int32(int32) @factorial(3)
    return int32 %1;
}

Structures

TODO

Binary representation

This section describes how the IR is stored in binary form which produces smaller files (than text representation) and are much simpler to load.

Types

Almost all instructions uses a type as attribute it needs to be encoded.

Type Encoding Size
int1 0x01 1 byte
int8 0x02 1 byte
int16 0x03 1 byte
int32 0x04 1 byte
int64 0x05 1 byte
float32 0x06 1 byte
float64 0x07 1 byte
pointer 0xE0 + <type> 1+N bytes
struct 0xF0 + <index> 1+2 bytes

The fundamental types takes only one byte to encode but pointer type and struct type can take more. The pointer takes one byte + other type (even recursive). The struct is one byte and then 2 bytes for index to structure table (maximum 65535 structures).

Structures

In order to save space the structures must be declared in separate part and they are then referenced by index number. The structures are stored in table which begins with two bytes of table size (max 65535 structures) followed by structure descriptions.

TODO

Instructions

Each instruction is encoded by few bytes where the first byte defines instruction type or it's alternative. In following table all instructions are described how are encoded. The placeholder <type> represents type encoding (see Types), <result> represents index of result value (2 bytes), <value> represents index of argument value (2 bytes), <constant> represents constant value encoded directly into instruction and it's size depends on instruction type (4 bytes for int32, ...), <label> is label index in current function, <types...> are a list of types which begins with one byte for list size (max. 256 types) followed by types and <name> is placeholder for string which begins with one byte for string length (256 max. length).

Instruction Encoding Size
alloc 0x00 + <type> + <result> 1+N+2 bytes
alloc 0x01 + <type> + <index> + <result> 1+N+4+2 bytes
store 0x10 + <type> + <address> + <value> 1+N+2+2 bytes
store 0x11 + <type> + <address> + <constant> 1+N+2+M bytes
store 0x12 + <type> + <address> + <value> + <index> 1+N+2+2+4 bytes
store 0x13 + <type> + <address> + <constant> + <index> 1+N+2+M+4 bytes
load 0x20 + <type> + <address> + <result> 1+N+2+2 bytes
load 0x21 + <type> + <address> + <result> + <index> 1+N+2+2+4 bytes
add 0x30 + <type> + <value1> + <value2> + <result> 1+N+2+2+2 bytes
add 0x31 + <type> + <value1> + <constant> + <result> 1+N+2+M+2 bytes
sub 0x40 + <type> + <value1> + <value2> + <result> 1+N+2+2+2 bytes
sub 0x41 + <type> + <value1> + <constant> + <result> 1+N+2+M+2 bytes
sub 0x42 + <type> + <constant> + <value2> + <result> 1+N+M+2+2 bytes
mul 0x50 + <type> + <value1> + <value2> + <result> 1+N+2+2+2 bytes
mul 0x51 + <type> + <value1> + <constant> + <result> 1+N+2+M+2 bytes
div 0x60 + <type> + <value1> + <value2> + <result> 1+N+2+2+2 bytes
div 0x61 + <type> + <value1> + <constant> + <result> 1+N+2+M+2 bytes
div 0x62 + <type> + <constant> + <value2> + <result> 1+N+M+2+2 bytes
rem 0x70 + <type> + <value1> + <value2> + <result> 1+N+2+2+2 bytes
rem 0x71 + <type> + <value1> + <constant> + <result> 1+N+2+M+2 bytes
rem 0x72 + <type> + <constant> + <value2> + <result> 1+N+M+2+2 bytes
cmp 0x80 + <op> + <type> + <value1> + <value2> + <result> 1+1+N+2+2+2 bytes
cmp 0x81 + <op> + <type> + <value1> + <constant> + <result> 1+1+N+2+M+2 bytes
and 0x90 + <type> + <value1> + <value2> + <result> 1+N+2+2+2 bytes
and 0x91 + <type> + <value1> + <constant> + <result> 1+N+2+M+2 bytes
and 0x92 + <type> + <constant> + <value2> + <result> 1+N+M+2+2 bytes
or 0xA0 + <type> + <value1> + <value2> + <result> 1+N+2+2+2 bytes
or 0xA1 + <type> + <value1> + <constant> + <result> 1+N+2+M+2 bytes
or 0xA2 + <type> + <constant> + <value2> + <result> 1+N+M+2+2 bytes
xor 0xB0 + <type> + <value1> + <value2> + <result> 1+N+2+2+2 bytes
xor 0xB1 + <type> + <value1> + <constant> + <result> 1+N+2+M+2 bytes
xor 0xB2 + <type> + <constant> + <value2> + <result> 1+N+M+2+2 bytes
branch 0xC0 + <label> 1+2 bytes
branch 0xC1 + <value> + <label1> + <label2> 1+2+2+2 bytes
call 0xD0 + <types...> + <name> 1+N+M bytes
call 0xD1 + <type> + <types...> + <name> + <result> 1+N+M+K+2 bytes
return 0xE0 1 bytes
return 0xE1 + <type> + <value> 1+N+2 bytes

Blocks

The main building block of the function are blocks which contain instructions.

Type Size Description
instructions 2+N bytes list of instructions

Functions

The functions are containers for blocks and instructions. Each function starts with it's name (might not be unique in the module), followed by return type (0x00 for void) and list of argument types. The last part of the function is a list blocks.

Type Size Description
string 2+N bytes 16-bit size + characters
type N bytes type encoding, 0x00 for void
types 2+N bytes list of types for arguments
blocks 2+N bytes list of blocks

Module

One binary file represent a single module. It's the main structure with is used for serialize and deserialize. It's guarded by binary guard in form of 4 bytes of string SHRD followed by 2 bytes encoding module format version: 1. byte for major and 2. byte for minor version. The next part is for structures definitions. Each structure used in the code must be defined before so it can be referenced in the code by index. The last part is a list of functions defined in the module. Module can contain multiple functions with same name but must have different argument types (function overloading).

Type Size Description
guard 4 bytes SHRD
version 2 bytes 2 bytes for major and minor version
structures 2+N bytes list of structures
functions 2+N bytes list of functions
Clone this wiki locally