Intermediate representation

(WIP) Intermediate representation of source code.

Introduction

The intermediate representation (IR) is necessary to support language extensibility because it brings new AST nodes and without IR interpreter and compiler must be extended to support that. Using IR allows to translate AST nodes into well defined representation which is understod by interpreter or compiler.

The IR can be designed for stack or register based processors. The main difference is the stack based using stack for computation so for adding two values it must move two values on top of stack and then call add operation and result is stored on stack top. The register based are using a set of registers for computation where values are moved to registers and then operation uses specified registers to perform an operation.

The IR is based on LLVM IR and uses SSA form. There is no registers or stack just values which can be later allocated to registers or stack (by interpreter or compiler). Full SSA form is not used because expressing something in SSA form can be hard. Simplifications can be done using alloc and store instructions where store doesn't create new value but modifies a value at allocated address. Thank to this there is no need for Phi instruction (which is harder to understand).

Types

There are two kinds of types: fundamental and aggregate.

Fundamental

Fundamental types are integer, floating point and pointer types and aggregate are other types like structs. Currently only supported integer types are: int1, int8, int16, int32, int64 and float types: float32 and float64.

Pointers

Pointer types are designed for storing other value address, some place in system memory. Always points to value with specified type and is recognized by star * suffix (e.g. int8*). Can be chained to more levels like int8***. The 0 address aka null pointer is considered as invalid address and should not be used for storing and loading value from it.

Structures

Structures are aggregates of fundamental, pointers or other aggregate types.

Blocks

Blocks are basic building primitives of functions. All defined functions must have at least one block. Function body transfer control flow between blocks so all blocks must end with control flow instruction like branch or return.

Functions

Functions are basic primitives which operates on program data. Without functions there are nothing to do.

Values

Values represents program data. Each value have defined type and both value and type is immutable.

Memory

The IR doesn't offer any memory allocation support except local variable allocation via alloc. Can be implemented by calling system functions.

Instructions

A list of supported instructions. For readability it's written in human readable form which is not same representation used for storing IR code. In most cases the instructions have following scheme:

[ <result> = ] <name> [ <attributes...> ] [ <operands...> ]

The result value <result> is optional, some instruction doesn't yield a value. The <name> is instruction name and is required. The attributes <attributes...> is a list of space separated values which specify the instruction, e.g. operands type. The operands <operands...> are comma separated values that specify instruction operands.

%1 = load int32 %0, 3

This represents instruction of name load with attribute int32 (operands type) and its operands are %0 and 3.

Function

<name> = function <type>(<types...>) { ... }
<name> = function void(<types...>) { ... }

Define a function with return type <type> or void with parameters <types...> and name <name>. Multiple functions with same name can be defined but it must differ by <types...>.

Alloc

<result> = alloc <type> [, <number = 1>]

Allocate memory for local variable defined by size of <type> multiplied by <number> (default to 1 if not present) and return it's address. The allocated memory is automatically released after scope exit. Do not use for allocation of big objects since in most cases memory is allocated on stack with limited size. It's guaranteed the result value is not null address.

Store

store <type> <addr>, <value> [, <index = 0> ]

Store a value <value> of type <type> at given address <addr>. As mentioned before, values are immutable, but this function doesn't alter values (in IR point of view) it just store value at some address in memory. Store to null address or accessing index outside of allocation range is undefined behaviour.

Load

<result> = load <type> <addr> [, <index = 0> ]

Loads value of <type> from address <addr> to <result> variable. If <addr> is memory of array, the <index> can be specified and it's used as offset to address as <index> * size of <type>. Load to null address or accessing index outside of allocation range is undefined behaviour.

Add

<result> = add <type> <value1>, <value2>

Perform addition of the two operands <value1> and <value2>. Only supported types <type> are integer and floating point types.

Sub

<result> = sub <type> <value1>, <value2>

Perform subtraction of the two operands <value1> and <value2>. Only supported types <type> are integer and floating point types.

Mul

<result> = mul <type> <value1>, <value2>

Perform multiplication of the two operands <value1> and <value2>. Only supported types <type> are integer and floating point types.

Div

<result> = div <type> <value1>, <value2>

Perform division of the two operands <value1> and <value2>. Only supported types <type> are integer and floating point types.

Rem

<result> = rem <type> <value1>, <value2>

Perform division of the two operands <value1> and <value2> and return remainder. Only supported types <type> are integer and floating point types.

Cmp

<result> = cmp <cond> <ty> <value1>, <value2>

Perform comparision <cond> of the two operands <value1> and <value2>. Only supported types <type> are integer and floating point types.

Supported <cond>:

eq - Are operands equal?
ne - Are operands not equal?
gt - Is first operand greater than second operand?
ge - Is first operand greater or equal to second operand?
lt - Is first operand less than second operand?
le - Is first operand less or equal to second operand?

And

<result> = and <type> <value1>, <value2>

Perform bitwise and operation with values <value1> and <value2> of type <type>.

Or

<result> = or <type> <value1>, <value2>

Perform bitwise or operation with values <value1> and <value2> of type <type>.

Xor

<result> = xor <type> <value1>, <value2>

Perform bitwise xor (exclusive or) operation with values <value1> and <value2> of type <type>.

Branch

branch <cond>, label <iftrue>, label <iffalse>
branch <dest>

Perform branch or jump to specified label. The conditional branch chooses destination label by condition value so if <cond> is true the control flow jumps to <iftrue> label, otherwise to <iffalse>. The unconditional jump (the second form) is just a simple jump where the control flow jumps to <dest> label.

Call

<result> = call <type>(<types...>) <name> (<args...>)
call void(<types...>) <name> (<args...>)

Transfer control flow to called function. There are two types of call: with and without return value. The only difference is in return type specification and returned value. The call instruction is specified by return type type (or void) and parameter types <types...>. This call signature is used to select proper version of the function which name can be overloaded. Calling function which doesn't exists (name or signature mismatch) the behaviour is implementation defined but recommended behaviour is an error.

Return

return <type> <value>
return void

Return control flow to the caller with return value. The return void can be used only in cases when called function has no return type (i.e. void).

Examples

This example perform sum of 1 and 3, converts result to character string (int8*) by calling to_string function and prints it using print function.

@main = function void() {
    %1 = alloc int32
    %2 = alloc int32
    store int32 %1, 1
    store int32 %2, 3
    %3 = add int32 %1, %2
    %4 = call int8*(int32) @to_string %3
    call void(int8*) @print(%4)
    return void
}

Example of branching.

@fn = function int32(int32, int32) {
    // %0 = int32
    // %1 = int32

    // @L0:
    %2 = alloc int32
    %3 = cmp eq %0, %1
    branch %3, @L1, @L2

    // @L1:
    %4 = add int32 %0, %1
    %5 = mul int32 2, %4
    store int32 %2, %5
    branch @L3

    // @L2:
    %6 = sub int32 %0, %1
    store int32 %2, %6
    branch @L3

    // @L3:
    %7 = load int32 %2
    return int32 %7
}

Snippets

A list of snippets representing common structures in source code.

Declaration and assignment

@snippet = function void() {
    // int %1;
    %1 = alloc int32
    // double %2;
    %2 = alloc float64

    // %1 = 30;
    store int32 %1, 30

    // %2 = 0.31;
    store float64, %2, 0.31
}

Expression

@snippet = function void() {
    // int %1;
    %1 = alloc int32
    // int %2;
    %2 = alloc int32

    // %1 = 30;
    store int32 %1, 30

    // %2 = 15;
    store int32, %2, 15

    // %3 = %1 + %2;
    %3 = add int32 %1, %2

    // %4 = %1 - %2;
    %4 = sub int32 %1, %2

    // %5 = %1 * %2
    %5 = mul int32 %1, %2

    // %6 = %1 / %2
    %6 = div int32 %1, %2

    return void
}

Condition

Conditions are just conditional branch instructions with uses int1 value with can be obtained by using cmp instruction.

@snippet = function void() {
    // int %1;
    %1 = alloc int32
    // int %2;
    %2 = alloc int32

    // %1 = 10;
    store int32 %1, 10

    // %2 = 30;
    store int32 %2, 30

    // bool %3 = %1 <= %2;
    %3 = cmp le int32 %1, %2

    // if (%3)
    branch %3, @L1, @L2

    // @L1:
    branch @L3

    // @L2:
    branch @L3

    // @L3:
    return void
}

Loop

@snippet = function void() {
    // int %1;
    %1 = alloc int32
    // %1 = 0
    store int32 %1, 0
    branch @L1

    // @L1:
    while (%1 < 60)
    %2 = load int32 %1
    %3 = cmp lt int32 %2, 60
    branch %3, @L1, @L2

    // @L2
    // %1 = %1 + 1;
    %4 = load int32 %1
    %5 = add int32 %4, 1
    store int32 %1, %5
    branch @L1

    // @L3:
    return void
}

Function

@factorial = function int32(int32) {
    // %0 = arg 0

    // return value
    %1 = alloc int32
    // %2 = %0 < 1;
    %2 = cmp lt int32 %0, 1
    // if (%2)
    branch %2, @L1, @L2

    // @L1:
    // %1 = 1
    store int32 %1, 1
    branch @L3

    // @L2:
    %3 = sub int32 %0, 1
    // factorial(%0 - 1)
    %4 = call int32(int32) @factorial(%3)
    // %0 * %4
    %5 = mul int32 %0, %4
    store int32 %1, %5
    branch @L3

    // @L3:
    %6 = load int32 %1
    return %6
}

@main = function int32() {
    // return factorial(3);
    %1 = call int32(int32) @factorial(3)
    return int32 %1;
}

Structures

TODO

Binary representation

This section describes how the IR is stored in binary form which produces smaller files (than text representation) and are much simpler to load.

Types

Almost all instructions uses a type as attribute it needs to be encoded.

Type	Encoding	Size
`int1`	`0x01`	1 byte
`int8`	`0x02`	1 byte
`int16`	`0x03`	1 byte
`int32`	`0x04`	1 byte
`int64`	`0x05`	1 byte
`float32`	`0x06`	1 byte
`float64`	`0x07`	1 byte
pointer	`0xE0` + `<type>`	1+N bytes
`struct`	`0xF0` + `<index>`	1+2 bytes

The fundamental types takes only one byte to encode but pointer type and struct type can take more. The pointer takes one byte + other type (even recursive). The struct is one byte and then 2 bytes for index to structure table (maximum 65535 structures).

Structures

In order to save space the structures must be declared in separate part and they are then referenced by index number. The structures are stored in table which begins with two bytes of table size (max 65535 structures) followed by structure descriptions.

TODO

Instructions

Each instruction is encoded by few bytes where the first byte defines instruction type or it's alternative. In following table all instructions are described how are encoded. The placeholder <type> represents type encoding (see Types), <result> represents index of result value (2 bytes), <value> represents index of argument value (2 bytes), <constant> represents constant value encoded directly into instruction and it's size depends on instruction type (4 bytes for int32, ...), <label> is label index in current function, <types...> are a list of types which begins with one byte for list size (max. 256 types) followed by types and <name> is placeholder for string which begins with one byte for string length (256 max. length).

Instruction	Encoding	Size
`alloc`	`0x00` + `<type>` + `<result>`	1+N+2 bytes
`alloc`	`0x01` + `<type>` + `<index>` + `<result>`	1+N+4+2 bytes
`store`	`0x10` + `<type>` + `<address>` + `<value>`	1+N+2+2 bytes
`store`	`0x11` + `<type>` + `<address>` + `<constant>`	1+N+2+M bytes
`store`	`0x12` + `<type>` + `<address>` + `<value>` + `<index>`	1+N+2+2+4 bytes
`store`	`0x13` + `<type>` + `<address>` + `<constant>` + `<index>`	1+N+2+M+4 bytes
`load`	`0x20` + `<type>` + `<address>` + `<result>`	1+N+2+2 bytes
`load`	`0x21` + `<type>` + `<address>` + `<result>` + `<index>`	1+N+2+2+4 bytes
`add`	`0x30` + `<type>` + `<value1>` + `<value2>` + `<result>`	1+N+2+2+2 bytes
`add`	`0x31` + `<type>` + `<value1>` + `<constant>` + `<result>`	1+N+2+M+2 bytes
`sub`	`0x40` + `<type>` + `<value1>` + `<value2>` + `<result>`	1+N+2+2+2 bytes
`sub`	`0x41` + `<type>` + `<value1>` + `<constant>` + `<result>`	1+N+2+M+2 bytes
`sub`	`0x42` + `<type>` + `<constant>` + `<value2>` + `<result>`	1+N+M+2+2 bytes
`mul`	`0x50` + `<type>` + `<value1>` + `<value2>` + `<result>`	1+N+2+2+2 bytes
`mul`	`0x51` + `<type>` + `<value1>` + `<constant>` + `<result>`	1+N+2+M+2 bytes
`div`	`0x60` + `<type>` + `<value1>` + `<value2>` + `<result>`	1+N+2+2+2 bytes
`div`	`0x61` + `<type>` + `<value1>` + `<constant>` + `<result>`	1+N+2+M+2 bytes
`div`	`0x62` + `<type>` + `<constant>` + `<value2>` + `<result>`	1+N+M+2+2 bytes
`rem`	`0x70` + `<type>` + `<value1>` + `<value2>` + `<result>`	1+N+2+2+2 bytes
`rem`	`0x71` + `<type>` + `<value1>` + `<constant>` + `<result>`	1+N+2+M+2 bytes
`rem`	`0x72` + `<type>` + `<constant>` + `<value2>` + `<result>`	1+N+M+2+2 bytes
`cmp`	`0x80` + `<op>` + `<type>` + `<value1>` + `<value2>` + `<result>`	1+1+N+2+2+2 bytes
`cmp`	`0x81` + `<op>` + `<type>` + `<value1>` + `<constant>` + `<result>`	1+1+N+2+M+2 bytes
`and`	`0x90` + `<type>` + `<value1>` + `<value2>` + `<result>`	1+N+2+2+2 bytes
`and`	`0x91` + `<type>` + `<value1>` + `<constant>` + `<result>`	1+N+2+M+2 bytes
`and`	`0x92` + `<type>` + `<constant>` + `<value2>` + `<result>`	1+N+M+2+2 bytes
`or`	`0xA0` + `<type>` + `<value1>` + `<value2>` + `<result>`	1+N+2+2+2 bytes
`or`	`0xA1` + `<type>` + `<value1>` + `<constant>` + `<result>`	1+N+2+M+2 bytes
`or`	`0xA2` + `<type>` + `<constant>` + `<value2>` + `<result>`	1+N+M+2+2 bytes
`xor`	`0xB0` + `<type>` + `<value1>` + `<value2>` + `<result>`	1+N+2+2+2 bytes
`xor`	`0xB1` + `<type>` + `<value1>` + `<constant>` + `<result>`	1+N+2+M+2 bytes
`xor`	`0xB2` + `<type>` + `<constant>` + `<value2>` + `<result>`	1+N+M+2+2 bytes
`branch`	`0xC0` + `<label>`	1+2 bytes
`branch`	`0xC1` + `<value>` + `<label1>` + `<label2>`	1+2+2+2 bytes
`call`	`0xD0` + `<types...>` + `<name>`	1+N+M bytes
`call`	`0xD1` + `<type>` + `<types...>` + `<name>` + `<result>`	1+N+M+K+2 bytes
`return`	`0xE0`	1 bytes
`return`	`0xE1` + `<type>` + `<value>`	1+N+2 bytes

Blocks

The main building block of the function are blocks which contain instructions.

Type	Size	Description
instructions	2+N bytes	list of instructions

Functions

The functions are containers for blocks and instructions. Each function starts with it's name (might not be unique in the module), followed by return type (0x00 for void) and list of argument types. The last part of the function is a list blocks.

Type	Size	Description
string	2+N bytes	16-bit size + characters
type	N bytes	type encoding, `0x00` for `void`
types	2+N bytes	list of types for arguments
blocks	2+N bytes	list of blocks

Module

One binary file represent a single module. It's the main structure with is used for serialize and deserialize. It's guarded by binary guard in form of 4 bytes of string SHRD followed by 2 bytes encoding module format version: 1. byte for major and 2. byte for minor version. The next part is for structures definitions. Each structure used in the code must be defined before so it can be referenced in the code by index. The last part is a list of functions defined in the module. Module can contain multiple functions with same name but must have different argument types (function overloading).

Type	Size	Description
guard	4 bytes	`SHRD`
version	2 bytes	2 bytes for major and minor version
structures	2+N bytes	list of structures
functions	2+N bytes	list of functions

Intermediate representation

Introduction

Types

Fundamental

Pointers

Structures

Blocks

Functions

Values

Memory

Instructions

Function

Alloc

Store

Load

Add

Sub

Mul

Div

Rem

Cmp

And

Or

Xor

Branch

Call

Return

Examples

Snippets

Declaration and assignment

Expression

Condition

Loop

Function

Structures

Binary representation

Types

Structures

Instructions

Blocks

Functions

Module

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally