-
Notifications
You must be signed in to change notification settings - Fork 0
Intermediate representation
(WIP) Intermediate representation of source code.
The intermediate representation (IR) is necessary to support language extensibility because it brings new AST nodes and without IR interpreter and compiler must be extended to support that. Using IR allows to translate AST nodes into well defined representation which is understod by interpreter or compiler.
The IR can be designed for stack or register based processors. The main difference is the stack based using stack for computation so for adding two values it must move two values on top of stack and then call add operation and result is stored on stack top. The register based are using a set of registers for computation where values are moved to registers and then operation uses specified registers to perform an operation.
The IR is based on LLVM IR and uses SSA form. There is no registers or stack just values which can be later allocated to registers or stack (by interpreter or compiler). Full SSA form is not used because expressing something in SSA form can be hard. Simplifications can be done using alloc
and store
instructions where store
doesn't create new value but modifies a value at allocated address. Thank to this there is no need for Phi instruction (which is harder to understand).
There are two kinds of types: fundamental and aggregate.
Fundamental types are integer, floating point and pointer types and aggregate are other types like structs. Currently only supported integer types are: int1
, int8
, int16
, int32
, int64
and float types: float32
and float64
.
Pointer types are designed for storing other value address, some place in system memory. Always points to value with specified type and is recognized by star *
suffix (e.g. int8*
). Can be chained to more levels like int8***
. The 0
address aka null pointer is considered as invalid address and should not be used for storing and loading value from it.
Structures are aggregates of fundamental, pointers or other aggregate types.
Blocks are basic building primitives of functions. All defined functions must have at least one block. Function body transfer control flow between blocks so all blocks must end with control flow instruction like branch
or return
.
Functions are basic primitives which operates on program data. Without functions there are nothing to do.
Values represents program data. Each value have defined type and both value and type is immutable.
The IR doesn't offer any memory allocation support except local variable allocation via alloc
. Can be implemented by calling system functions.
A list of supported instructions. For readability it's written in human readable form which is not same representation used for storing IR code. In most cases the instructions have following scheme:
[ <result> = ] <name> [ <attributes...> ] [ <operands...> ]
The result value <result>
is optional, some instruction doesn't yield a value. The <name>
is instruction name and is required. The attributes <attributes...>
is a list of space separated values which specify the instruction, e.g. operands type. The operands <operands...>
are comma separated values that specify instruction operands.
%1 = load int32 %0, 3
This represents instruction of name load
with attribute int32
(operands type) and its operands are %0
and 3
.
<name> = function <type>(<types...>) { ... }
<name> = function void(<types...>) { ... }
Define a function with return type <type>
or void
with parameters <types...>
and name <name>
. Multiple functions with same name can be defined but it must differ by <types...>
.
<result> = alloc <type> [, <number = 1>]
Allocate memory for local variable defined by size of <type>
multiplied by <number>
(default to 1
if not present) and return it's address. The allocated memory is automatically released after scope exit. Do not use for allocation of big objects since in most cases memory is allocated on stack with limited size. It's guaranteed the result value is not null address.
store <type> <addr>, <value> [, <index = 0> ]
Store a value <value>
of type <type>
at given address <addr>
. As mentioned before, values are immutable, but this function doesn't alter values (in IR point of view) it just store value at some address in memory. Store to null address or accessing index outside of allocation range is undefined behaviour.
<result> = load <type> <addr> [, <index = 0> ]
Loads value of <type>
from address <addr>
to <result>
variable. If <addr>
is memory of array, the <index>
can be specified and it's used as offset to address as <index> * size of <type>
. Load to null address or accessing index outside of allocation range is undefined behaviour.
<result> = add <type> <value1>, <value2>
Perform addition of the two operands <value1>
and <value2>
. Only supported types <type>
are integer and floating point types.
<result> = sub <type> <value1>, <value2>
Perform subtraction of the two operands <value1>
and <value2>
. Only supported types <type>
are integer and floating point types.
<result> = mul <type> <value1>, <value2>
Perform multiplication of the two operands <value1>
and <value2>
. Only supported types <type>
are integer and floating point types.
<result> = div <type> <value1>, <value2>
Perform division of the two operands <value1>
and <value2>
. Only supported types <type>
are integer and floating point types.
<result> = rem <type> <value1>, <value2>
Perform division of the two operands <value1>
and <value2>
and return remainder. Only supported types <type>
are integer and floating point types.
<result> = cmp <cond> <ty> <value1>, <value2>
Perform comparision <cond>
of the two operands <value1>
and <value2>
. Only supported types <type>
are integer and floating point types.
Supported <cond>
:
-
eq
- Are operands equal? -
ne
- Are operands not equal? -
gt
- Is first operand greater than second operand? -
ge
- Is first operand greater or equal to second operand? -
lt
- Is first operand less than second operand? -
le
- Is first operand less or equal to second operand?
<result> = and <type> <value1>, <value2>
Perform bitwise and operation with values <value1>
and <value2>
of type <type>
.
<result> = or <type> <value1>, <value2>
Perform bitwise or operation with values <value1>
and <value2>
of type <type>
.
<result> = xor <type> <value1>, <value2>
Perform bitwise xor (exclusive or) operation with values <value1>
and <value2>
of type <type>
.
branch <cond>, label <iftrue>, label <iffalse>
branch <dest>
Perform branch or jump to specified label. The conditional branch chooses destination label by condition value so if <cond>
is true
the control flow jumps to <iftrue>
label, otherwise to <iffalse>
. The unconditional jump (the second form) is just a simple jump where the control flow jumps to <dest>
label.
<result> = call <type>(<types...>) <name> (<args...>)
call void(<types...>) <name> (<args...>)
Transfer control flow to called function. There are two types of call: with and without return value. The only difference is in return type specification and returned value. The call instruction is specified by return type type
(or void
) and parameter types <types...>
. This call signature is used to select proper version of the function which name can be overloaded. Calling function which doesn't exists (name or signature mismatch) the behaviour is implementation defined but recommended behaviour is an error.
return <type> <value>
return void
Return control flow to the caller with return value. The return void
can be used only in cases when called function has no return type (i.e. void
).
This example perform sum of 1
and 3
, converts result to character string (int8*
) by calling to_string
function and prints it using print
function.
@main = function void() {
%1 = alloc int32
%2 = alloc int32
store int32 %1, 1
store int32 %2, 3
%3 = add int32 %1, %2
%4 = call int8*(int32) @to_string %3
call void(int8*) @print(%4)
return void
}
Example of branching.
@fn = function int32(int32, int32) {
// %0 = int32
// %1 = int32
// @L0:
%2 = alloc int32
%3 = cmp eq %0, %1
branch %3, @L1, @L2
// @L1:
%4 = add int32 %0, %1
%5 = mul int32 2, %4
store int32 %2, %5
branch @L3
// @L2:
%6 = sub int32 %0, %1
store int32 %2, %6
branch @L3
// @L3:
%7 = load int32 %2
return int32 %7
}
A list of snippets representing common structures in source code.
@snippet = function void() {
// int %1;
%1 = alloc int32
// double %2;
%2 = alloc float64
// %1 = 30;
store int32 %1, 30
// %2 = 0.31;
store float64, %2, 0.31
}
@snippet = function void() {
// int %1;
%1 = alloc int32
// int %2;
%2 = alloc int32
// %1 = 30;
store int32 %1, 30
// %2 = 15;
store int32, %2, 15
// %3 = %1 + %2;
%3 = add int32 %1, %2
// %4 = %1 - %2;
%4 = sub int32 %1, %2
// %5 = %1 * %2
%5 = mul int32 %1, %2
// %6 = %1 / %2
%6 = div int32 %1, %2
return void
}
Conditions are just conditional branch
instructions with uses int1
value with can be obtained by using cmp
instruction.
@snippet = function void() {
// int %1;
%1 = alloc int32
// int %2;
%2 = alloc int32
// %1 = 10;
store int32 %1, 10
// %2 = 30;
store int32 %2, 30
// bool %3 = %1 <= %2;
%3 = cmp le int32 %1, %2
// if (%3)
branch %3, @L1, @L2
// @L1:
branch @L3
// @L2:
branch @L3
// @L3:
return void
}
@snippet = function void() {
// int %1;
%1 = alloc int32
// %1 = 0
store int32 %1, 0
branch @L1
// @L1:
while (%1 < 60)
%2 = load int32 %1
%3 = cmp lt int32 %2, 60
branch %3, @L1, @L2
// @L2
// %1 = %1 + 1;
%4 = load int32 %1
%5 = add int32 %4, 1
store int32 %1, %5
branch @L1
// @L3:
return void
}
@factorial = function int32(int32) {
// %0 = arg 0
// return value
%1 = alloc int32
// %2 = %0 < 1;
%2 = cmp lt int32 %0, 1
// if (%2)
branch %2, @L1, @L2
// @L1:
// %1 = 1
store int32 %1, 1
branch @L3
// @L2:
%3 = sub int32 %0, 1
// factorial(%0 - 1)
%4 = call int32(int32) @factorial(%3)
// %0 * %4
%5 = mul int32 %0, %4
store int32 %1, %5
branch @L3
// @L3:
%6 = load int32 %1
return %6
}
@main = function int32() {
// return factorial(3);
%1 = call int32(int32) @factorial(3)
return int32 %1;
}
TODO
This section describes how the IR is stored in binary form which produces smaller files (than text representation) and are much simpler to load.
Almost all instructions uses a type as attribute it needs to be encoded.
Type | Encoding | Size |
---|---|---|
int1 |
0x01 |
1 byte |
int8 |
0x02 |
1 byte |
int16 |
0x03 |
1 byte |
int32 |
0x04 |
1 byte |
int64 |
0x05 |
1 byte |
float32 |
0x06 |
1 byte |
float64 |
0x07 |
1 byte |
pointer |
0xE0 + <type>
|
1+N bytes |
struct |
0xF0 + <index>
|
1+2 bytes |
The fundamental types takes only one byte to encode but pointer type and struct type can take more. The pointer takes one byte + other type (even recursive). The struct is one byte and then 2 bytes for index to structure table (maximum 65535 structures).
In order to save space the structures must be declared in separate part and they are then referenced by index number. The structures are stored in table which begins with two bytes of table size (max 65535 structures) followed by structure descriptions.
TODO
Each instruction is encoded by few bytes where the first byte defines instruction type or it's alternative. In following table all instructions are described how are encoded. The placeholder <type>
represents type encoding (see Types), <result>
represents index of result value (2 bytes), <value>
represents index of argument value (2 bytes), <constant>
represents constant value encoded directly into instruction and it's size depends on instruction type (4 bytes for int32
, ...), <label>
is label index in current function, <types...>
are a list of types which begins with one byte for list size (max. 256 types) followed by types and <name>
is placeholder for string which begins with one byte for string length (256 max. length).
Instruction | Encoding | Size |
---|---|---|
alloc |
0x00 + <type> + <result>
|
1+N+2 bytes |
alloc |
0x01 + <type> + <index> + <result>
|
1+N+4+2 bytes |
store |
0x10 + <type> + <address> + <value>
|
1+N+2+2 bytes |
store |
0x11 + <type> + <address> + <constant>
|
1+N+2+M bytes |
store |
0x12 + <type> + <address> + <value> + <index>
|
1+N+2+2+4 bytes |
store |
0x13 + <type> + <address> + <constant> + <index>
|
1+N+2+M+4 bytes |
load |
0x20 + <type> + <address> + <result>
|
1+N+2+2 bytes |
load |
0x21 + <type> + <address> + <result> + <index>
|
1+N+2+2+4 bytes |
add |
0x30 + <type> + <value1> + <value2> + <result>
|
1+N+2+2+2 bytes |
add |
0x31 + <type> + <value1> + <constant> + <result>
|
1+N+2+M+2 bytes |
sub |
0x40 + <type> + <value1> + <value2> + <result>
|
1+N+2+2+2 bytes |
sub |
0x41 + <type> + <value1> + <constant> + <result>
|
1+N+2+M+2 bytes |
sub |
0x42 + <type> + <constant> + <value2> + <result>
|
1+N+M+2+2 bytes |
mul |
0x50 + <type> + <value1> + <value2> + <result>
|
1+N+2+2+2 bytes |
mul |
0x51 + <type> + <value1> + <constant> + <result>
|
1+N+2+M+2 bytes |
div |
0x60 + <type> + <value1> + <value2> + <result>
|
1+N+2+2+2 bytes |
div |
0x61 + <type> + <value1> + <constant> + <result>
|
1+N+2+M+2 bytes |
div |
0x62 + <type> + <constant> + <value2> + <result>
|
1+N+M+2+2 bytes |
rem |
0x70 + <type> + <value1> + <value2> + <result>
|
1+N+2+2+2 bytes |
rem |
0x71 + <type> + <value1> + <constant> + <result>
|
1+N+2+M+2 bytes |
rem |
0x72 + <type> + <constant> + <value2> + <result>
|
1+N+M+2+2 bytes |
cmp |
0x80 + <op> + <type> + <value1> + <value2> + <result>
|
1+1+N+2+2+2 bytes |
cmp |
0x81 + <op> + <type> + <value1> + <constant> + <result>
|
1+1+N+2+M+2 bytes |
and |
0x90 + <type> + <value1> + <value2> + <result>
|
1+N+2+2+2 bytes |
and |
0x91 + <type> + <value1> + <constant> + <result>
|
1+N+2+M+2 bytes |
and |
0x92 + <type> + <constant> + <value2> + <result>
|
1+N+M+2+2 bytes |
or |
0xA0 + <type> + <value1> + <value2> + <result>
|
1+N+2+2+2 bytes |
or |
0xA1 + <type> + <value1> + <constant> + <result>
|
1+N+2+M+2 bytes |
or |
0xA2 + <type> + <constant> + <value2> + <result>
|
1+N+M+2+2 bytes |
xor |
0xB0 + <type> + <value1> + <value2> + <result>
|
1+N+2+2+2 bytes |
xor |
0xB1 + <type> + <value1> + <constant> + <result>
|
1+N+2+M+2 bytes |
xor |
0xB2 + <type> + <constant> + <value2> + <result>
|
1+N+M+2+2 bytes |
branch |
0xC0 + <label>
|
1+2 bytes |
branch |
0xC1 + <value> + <label1> + <label2>
|
1+2+2+2 bytes |
call |
0xD0 + <types...> + <name>
|
1+N+M bytes |
call |
0xD1 + <type> + <types...> + <name> + <result>
|
1+N+M+K+2 bytes |
return |
0xE0 |
1 bytes |
return |
0xE1 + <type> + <value>
|
1+N+2 bytes |
The main building block of the function are blocks which contain instructions.
Type | Size | Description |
---|---|---|
instructions | 2+N bytes | list of instructions |
The functions are containers for blocks and instructions. Each function starts with it's name (might not be unique in the module), followed by return type (0x00
for void
) and list of argument types. The last part of the function is a list blocks.
Type | Size | Description |
---|---|---|
string | 2+N bytes | 16-bit size + characters |
type | N bytes | type encoding, 0x00 for void
|
types | 2+N bytes | list of types for arguments |
blocks | 2+N bytes | list of blocks |
One binary file represent a single module. It's the main structure with is used for serialize and deserialize. It's guarded by binary guard in form of 4 bytes of string SHRD
followed by 2 bytes encoding module format version: 1. byte for major and 2. byte for minor version. The next part is for structures definitions. Each structure used in the code must be defined before so it can be referenced in the code by index. The last part is a list of functions defined in the module. Module can contain multiple functions with same name but must have different argument types (function overloading).
Type | Size | Description |
---|---|---|
guard | 4 bytes | SHRD |
version | 2 bytes | 2 bytes for major and minor version |
structures | 2+N bytes | list of structures |
functions | 2+N bytes | list of functions |