[loop condition] [nested statements] Have you ever wanted to write your own compiler? … Yes? … of course you have! I’ve always wanted to have a go at writing a compiler, and with the recent release of WebAssembly, I had the perfect excuse to have a go.
My original plan was to
invent my own programming language, create a compiler that targets WebAssembly, and share my experiences at (FullStackNYC) . The first part went to plan, I spent many-an-evening building, tinkering and refining my compiler. Unfortunately the last part of my plan didn’t go quite so well. Long delays, and an eventual cancellation [ { “type”: “printStatement”, “expression”: { “type”: “numberLiteral”, “value”: 23.1 } }] , meant that I was not going to make it to New York after all. 😔😢😭
If you haven’t heard of WebAssembly before, and want a really detailed introduction, I’d thoroughly recommend Lin Clark’s Cartoon Guide [{ type: “printStatement”, expression: { type: “binaryExpression”, left: { type: “binaryExpression”, left: { type: “numberLiteral”, value: 42 }, right: { type: “numberLiteral”, value: 10 }, operator: ” “ }, right: { type: “numberLiteral”, value: 2 }, operator: “/” }}] .
You’ll learn the ‘what’ of WebAssembly throughout this blog post, but I do want to briefly touch on the ‘why’ .
From my perspective, this diagram sums it up quite succinctly:
The top diagram shows a simplified timeline for the execution of some JavaScript code within the browser. From left-to-right, the code (typically delivered as a minified mess!) Is parsed into an AST, initially executed in an interpreter, then progressively optimized / re-optimized until it eventually runs really quite quickly. These days JavaScript is fast – it just takes a while to get up to speed.
The bottom diagram is the WebAssembly equivalent. Code written in a wide variety of languages (Rust, C, C #, etc …) is compiled to WebAssembly that is delivered in a binary format. This is very easily decoded, compiled and executed – giving fast and predictable performance.
So why write your own compiler? WebAssembly has been causing quite a stir over the last year. So much so, that it was voted the fifth ‘most loved’ language in Stack Overflow’s developer insights survey .
An interesting result, considering that for most people WebAssembly is a compilation target, rather than a language they will use directly.
This was part of my motivation for proposing the FullStackNYC talk in the first place. The technical aspects of WebAssembly are really fascinating (and remind me of 8-bit computers from a few decades back), yet most people will never get the chance to dabble with WebAssembly itself – it will just be a black box that they compile to.
Writing a compiler is a really good opportunity to delve into the details of WebAssembly to find it what it is and how it works. And it’s fun too!
One final point, it was never my aim to create a fully-featured programming language, or one that is actually any good. My goal was to create ‘enough’ of a language to allow me to write a program that renders a mandelbrot set. This language is compiled to WebAssembly using my compiler, which is written in TypeScript and runs in the browser.
Here it is in it’s full glory:
[loop condition]
I ended up calling the language [loop condition] chasm and you can
play with it online if you like .
Enough rambling – time for some code! A minimal wasm module
Before tackling the compiler, we’ll start with something simpler, creating a minimal WebAssembly module.
Here is an emitter (the term used for the part of a compiler that outputs instructions for the target system), that creates the smallest valid WebAssembly module:
[loop condition] [loop condition] (const) magicModuleHeader =[0x00, 0x61, 0x73, 0x6d]; const (moduleVersion) (=[0x01, 0x00, 0x00, 0x00]; export (const) (emitter) : Emitter =
()
=> Uint8Array . (from) ([ ...magicModuleHeader, ...moduleVersion ]);
It is comprised of two parts, the 'magic' header, which is the ASCII string
0asm , and a version number. These eight bytes form valid WebAssembly (or wasm) module. More typically these would be delivered to the browser as a . wasm file.
In order to execute the WebAssembly module it needs to be instantiated as follows: [loop condition] [loop condition] (const) wasm =(emitter) ();
const (instance) (=(await) WebAssembly . (instantiate) ( (wasm) )
If you run the above you'll find that (instance) does actually do anything because our wasm module does not contain any instructions!
If you're interested in trying out this code for yourself, it is all on GitHub - with a commit for each step
An add function
Let's make the wasm module do something more useful, by implementing a function that adds a couple of floating point numbers together.
WebAssembly is a binary format, which isn't terribly readable (to humans at least), which is why you'll more typically see it written in WebAssembly Text Format (WAT). Here's a module, presented in WAT format, that defines an exported function named ($ add) that takes two floating point parameters, adds them together and returns them:
(module (func $ add (param f (param f [loop condition] ) (result f 069 get_local 0 get_local 1 f . add) (export "add" (func 0)) )
If you just want to experiment with WAT you can use the (wat2wasm) (tool from the WebAssembly Binary Toolkit to compile WAT files into wasm modules.
The above code reveals some interesting details around WebAssembly -
WebAssembly is a low-level language, with a small (approx 82 instruction set, where many of the instructions map quite closely to CPU instructions. This makes it easy to compile wasm modules to CPU-specific machine code.
It has no built in I / O. There are no instructions for writing to the terminal, screen or network. In order to wasm modules to interact with the outside world they need to do so via their host environment, which in the case of the browser is JavaScript.
WebAssembly is a stack machine, in the above example get_local 0 gets the local variable (in this case the function param) at the zeroth index and pushes it onto the stack, as does the subsequent instruction. The
f3.add
instruction pops two values form the stack, adds them together than pushes the value back on the stack.
WebAssembly has just four numeric types, two integer, two floats. More on this later…
Let's update the emitter to output this 'hard coded' WebAssembly module. WebAssembly modules are composed of a pre-defined set of optional sections, each prefixed with a numeric identifier. These include a type section, which encode type signatures, and function section, which indicates the type of each function. I’ll not cover how these are constructed here - they are quite dull. If you're interested, (look at the next commit in the project . [nested statements] The interesting part is the code section. Here is how the above add (function is created in binary:) [loop condition] [loop condition] (const) code =[ Opcodes.get_local /0x20 */, ...unsignedLEB128(0), Opcodes.get_local /0x20 */, ...unsignedLEB128(1), Opcodes.f32_add /0x92 */]; const functionBody (=(encodeVector) () )
/ locals / , ... code [{
type: "printStatement", expression: { type: "binaryExpression", left: { type: "binaryExpression", left: { type: "numberLiteral", value: 42 }, right: { type: "numberLiteral", value: 10 }, operator: " " }, right: { type: "numberLiteral", value: 2 }, operator: "/" }}]
, Opcodes . (end) / 0x0b / ]); const codeSection (=(createSection) () (Section)
. code / 0x0a / , encodeVector ([functionBody])); [loop condition] I've defined an (Opcodes) enum (I'm using TypeScript), which contains all of the wasm instructions. The the unsignedLEB 0726 function is a standard variable length encoding which is used for encoding instruction parameters.
The instructions for a function are combined with the function's local variables (of which there are none in this case), and an (end) (opcode that signals the end of a function. Finally all the functions are encoded into a section. The
encodeVector function simply prefixes a collection of byte arrays with the total length.
And there you have it, the complete module, which is about 60 bytes in total.
The JavaScript hosting code can now be updated to involve this exported function: [loop condition] [loop condition] (const) { (instance) } [Symbol.iterator]=(await) (WebAssembly) . (instantiate)
( (wasm) ); console . (log) ( (instance) . (exports) ) add (5) , (6) ));
Interestingly if you inspect the exported (add) function with the Chrome Dev Tools it identifier it as a 'native function'.
You can see the complete code for this step (with unit tests - go me!) on GitHub . Building a compiler
Now that you’ve seen how to dynamically create wasm modules, it’s time to turn our attention to the task of creating a compiler. We’ll start with a bit of terminology.
Here's some chasm code annotated to show the key components of a language:
[ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }]
Rather than give a 'textbook definition' of each, you'll become familiar with them as the compiler evolves.
The compiler itself will be formed of three parts, the tokenizer [ Opcodes.get_local /0x20 */, ...unsignedLEB128(0), Opcodes.get_local /0x20 */, ...unsignedLEB128(1), Opcodes.f32_add /0x92 */] which breaks up the input program (which is a string), into discrete tokens, the (parser) that takes these tokens and converts them into an Abstract Syntax Tree (AST), and finally the (emitter) (which converts the AST into wasm binary module.
This is a pretty standard compiler architecture:
[{ type: "printStatement", expression: { type: "binaryExpression", left: { type: "binaryExpression", left: { type: "numberLiteral", value: 42 }, right: { type: "numberLiteral", value: 10 }, operator: " " }, right: { type: "numberLiteral", value: 2 }, operator: "/" }}]
Rather than dive into a complete implementation, we’ll tackle a small subset of the problem. The goal is to create a compiler for a language that just supports print statements which print simple numeric literals…
The Tokenizer
The The tokenizer works by advancing through the input string, one character at a time, matching patterns that represent specific token types. The following code creates three matches ( number , keyword [] , and (whitespace) , using simple regular expressions:
[loop condition] [loop condition] (const) keywords =["print"]; // returns a token if the given regex matches at the current index const (regexMatcher) (=( (regex) : (string) , type : (TokenType)
): Matcher => (input) : (string) , (index) : (number) ( => {{ const (match) (=(input) . (substring) () (index) (match ( regex );
return ( match && ({ type , value : (match) [
0] } ; }; const (matchers) (=[
regexMatcher("^[.0-9] " , [] [
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (number) " , regexMatcher ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (`^ ( ($ {) (keywords) . (join ( | ) () , “ (keyword [
regexMatcher("^[.0-9] [loop condition] ), regexMatcher ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] () ^ \ (s ) , (whitespace) ];
(Note, these regular expressions are not terribly robust!) [ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] [nested statements] The (Matcher) interface defines a function that given an input string and an index returns a token if a match occurs.
The main body of the parser iterates over the characters of the string, finding the first match, adding the provided token to the output array :
[loop condition] [loop condition] (export const tokenize : (Tokenizer)
= (input)=> ({ const tokens : (Token) [] =[] []; let (index) (=(0) ; while ([ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (index) (input . (length) ) {{ const matches (=(matchers) . (map) () (m) => (m) ( (input [Symbol.iterator] , (index) (). filter ([loop condition] (f) => f ) const (match) (=matches [0]; if ([ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (match) . (type !==(whitespace) ) ({ tokens . (push) ( match ); } index [] =(match) . (value) . (length) ; }
return tokens ; };
Here is the tokenized output of the program (print) . 1 ": [loop condition] [loop condition] [Symbol.iterator] [] As you can see from the above input, the tokeniser removes whitespace as it has no meaning (for this specific language), it also Ensures that everything in the input string is a valid token. However, it does not make any guarantees about the input being well-formed, for example the tokeniser will happily handle (print print) , which is clearly incorrect.
The array of tokens is next fed into the parser. The Parser
[nested statements] The goal of the parser is the creation of an Abstract Syntax Tree (AST), a tree structure that encodes the relationship between these tokens , resulting in a form that could potentially be sent to an interpreter for direct execution.
The parser iterates through the supplied tokens, consuming them via an eatToken function.
[loop condition] [loop condition] (export const (parse) : (Parser)
==> ({ const (iterator) (=(tokens) [Symbol.iterator] ();
let (currentToken [loop condition] (=(iterator) . (next) (). (value) ; const (eatToken) (=() => (currentToken [loop condition] (=(iterator) . (next) (). (value) ; [...] const (nodes) : (StatementNode) [] =[] []; while ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (index) tokens . (length) ) {{ nodes . (push) ( parseStatement ()); }
return (nodes) ; };
(I've no idea where the concept of eating tokens comes from, it appears to be standard parser terminology, they are clearly hungry beasts!)
[nested statements] The goal of the above parser is to turn the token array into an array of statements, which are the core building blocks of this language . It expects the given tokens to conform to this pattern, and will throw an error (not shown above) if it does not.
[nested statements] The (parseStatement) function expects each statement to start with a keyword - switching on its value:
[loop condition] [loop condition] (const) parseStatement =() [ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }]
=>
{{ if ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (currentToken) . (type=== keyword ) ({ switch ([ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (currentToken) . (value) ) ({
case [
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (print) " : eatToken ();
return ({ type : () (printStatement) , expression : (parseExpression) () }; } } }; Currently the only supported keyword is [loop condition] (print) , in this case it returns an AST node of type printStatement parsing the associated expression.
And here is the expression parser:
[loop condition] [loop condition] (const) (parseExpression) =() [ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }]
=>
{{ let (node) : (ExpressionNode) ; switch ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (currentToken) . (type ) ({
case () (number) :
node =({ type : () (numberLiteral) , value : (Number) ( currentToken . (value) ) }; eatToken ();
return (node) ; } };
[nested statements] In its present form the language only accepts expressions which are composed of a single number - i.e. a numeric literal. Therefore the above expression parser expects the next token to be a number, and when this matches, it returns a node of type (numberLiteral)
Continuing the simple example of the program (print) 1 ", the parser outputs the following AST:
[loop condition] [loop condition]
[] [loop condition] As you can see the AST for this language is an array of statement nodes. Parsing guarantees that the input program is syntactically correct, i.e. It is properly constructed, but it does not of course guarantee that it will execute successfully, runtime errors might still be present (although for this simple language they are not possible!).
We’re onto the final step now…
(The Emitter)
Currently the emitter outputs a hard-coded add function. It now needs to take this AST and emit the appropriate instructions, as follows:
[loop condition] [loop condition] (const) codeFromAst =(ast [Symbol.iterator]=> ({ const (code) (=[]; const (emitExpression) (=(node) => {{ switch ([ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (node) . (type ) ({
case [
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (numberLiteral) " : code . (push) ( (Opcodes) . (f) (_ const) ); code . (push) (...
(ieee) ( (node) . (value) ); break ; } }; ast . (forEach) ( statement => {{ switch ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (statement) . (type ) ({
case [
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (printStatement) " :
emitExpression ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (statement) . expression ); code . (push) ( (Opcodes) . (call) ); code . (push) (...
unsignedLEB) ( (0) ); break ; } });
return (code)
; };
The emitter iterates over the statements that form the ‘root’ of the AST, matching our only statement type - print. Notice that the first thing it does is emit the instructions for the statement expressions, recall that WebAssembly is a stack machine, hence the expression instructions must be processed first leaving the result on the stack.
The print function is implemented via a (call) operation, which invokes the function at index zero.
Previously we've seen how wasm modules can export functions (as per the add example above), they can also import functions, which are supplied when you instantiate the module. Here we provide an env.print function that logs to the console :
[loop condition] [loop condition] (const) (instance) =(await) (WebAssembly
. (instantiate) wasm , () ({ env : ({ print : (console) . (log } });
An interesting result, considering that for most people WebAssembly is a compilation target, rather than a language they will use directly.
One final point, it was never my aim to create a fully-featured programming language, or one that is actually any good. My goal was to create ‘enough’ of a language to allow me to write a program that renders a mandelbrot set. This language is compiled to WebAssembly using my compiler, which is written in TypeScript and runs in the browser.
play with it online if you like .
Enough rambling – time for some code! A minimal wasm module
Before tackling the compiler, we’ll start with something simpler, creating a minimal WebAssembly module.
Here is an emitter (the term used for the part of a compiler that outputs instructions for the target system), that creates the smallest valid WebAssembly module:
[loop condition] [loop condition] (const) magicModuleHeader =[0x00, 0x61, 0x73, 0x6d]; const (moduleVersion) (=[0x01, 0x00, 0x00, 0x00]; export (const) (emitter) : Emitter =
()
=> Uint8Array . (from) ([ ...magicModuleHeader, ...moduleVersion ]);
It is comprised of two parts, the 'magic' header, which is the ASCII string
0asm , and a version number. These eight bytes form valid WebAssembly (or wasm) module. More typically these would be delivered to the browser as a . wasm file.
In order to execute the WebAssembly module it needs to be instantiated as follows: [loop condition] [loop condition] (const) wasm =(emitter) ();
const (instance) (=(await) WebAssembly . (instantiate) ( (wasm) )
If you run the above you'll find that (instance) does actually do anything because our wasm module does not contain any instructions!
If you're interested in trying out this code for yourself, it is all on GitHub - with a commit for each step
An add function
Let's make the wasm module do something more useful, by implementing a function that adds a couple of floating point numbers together.
WebAssembly is a binary format, which isn't terribly readable (to humans at least), which is why you'll more typically see it written in WebAssembly Text Format (WAT). Here's a module, presented in WAT format, that defines an exported function named ($ add) that takes two floating point parameters, adds them together and returns them:
(module (func $ add (param f (param f [loop condition] ) (result f 069 get_local 0 get_local 1 f . add) (export "add" (func 0)) )
If you just want to experiment with WAT you can use the (wat2wasm) (tool from the WebAssembly Binary Toolkit to compile WAT files into wasm modules.
The above code reveals some interesting details around WebAssembly -
WebAssembly is a low-level language, with a small (approx 82 instruction set, where many of the instructions map quite closely to CPU instructions. This makes it easy to compile wasm modules to CPU-specific machine code.
It has no built in I / O. There are no instructions for writing to the terminal, screen or network. In order to wasm modules to interact with the outside world they need to do so via their host environment, which in the case of the browser is JavaScript.
WebAssembly is a stack machine, in the above example get_local 0 gets the local variable (in this case the function param) at the zeroth index and pushes it onto the stack, as does the subsequent instruction. The
f3.add
instruction pops two values form the stack, adds them together than pushes the value back on the stack.
WebAssembly has just four numeric types, two integer, two floats. More on this later…
Let's update the emitter to output this 'hard coded' WebAssembly module. WebAssembly modules are composed of a pre-defined set of optional sections, each prefixed with a numeric identifier. These include a type section, which encode type signatures, and function section, which indicates the type of each function. I’ll not cover how these are constructed here - they are quite dull. If you're interested, (look at the next commit in the project . [nested statements] The interesting part is the code section. Here is how the above add (function is created in binary:) [loop condition] [loop condition] (const) code =[ Opcodes.get_local /0x20 */, ...unsignedLEB128(0), Opcodes.get_local /0x20 */, ...unsignedLEB128(1), Opcodes.f32_add /0x92 */]; const functionBody (=(encodeVector) () )
/ locals / , ... code [{
type: "printStatement", expression: { type: "binaryExpression", left: { type: "binaryExpression", left: { type: "numberLiteral", value: 42 }, right: { type: "numberLiteral", value: 10 }, operator: " " }, right: { type: "numberLiteral", value: 2 }, operator: "/" }}]
, Opcodes . (end) / 0x0b / ]); const codeSection (=(createSection) () (Section)
. code / 0x0a / , encodeVector ([functionBody])); [loop condition] I've defined an (Opcodes) enum (I'm using TypeScript), which contains all of the wasm instructions. The the unsignedLEB 0726 function is a standard variable length encoding which is used for encoding instruction parameters.
The instructions for a function are combined with the function's local variables (of which there are none in this case), and an (end) (opcode that signals the end of a function. Finally all the functions are encoded into a section. The
encodeVector function simply prefixes a collection of byte arrays with the total length.
And there you have it, the complete module, which is about 60 bytes in total.
The JavaScript hosting code can now be updated to involve this exported function: [loop condition] [loop condition] (const) { (instance) } [Symbol.iterator]=(await) (WebAssembly) . (instantiate)
( (wasm) ); console . (log) ( (instance) . (exports) ) add (5) , (6) ));
Interestingly if you inspect the exported (add) function with the Chrome Dev Tools it identifier it as a 'native function'.
You can see the complete code for this step (with unit tests - go me!) on GitHub . Building a compiler
Now that you’ve seen how to dynamically create wasm modules, it’s time to turn our attention to the task of creating a compiler. We’ll start with a bit of terminology.
Here's some chasm code annotated to show the key components of a language:
[ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }]
Rather than give a 'textbook definition' of each, you'll become familiar with them as the compiler evolves.
The compiler itself will be formed of three parts, the tokenizer [ Opcodes.get_local /0x20 */, ...unsignedLEB128(0), Opcodes.get_local /0x20 */, ...unsignedLEB128(1), Opcodes.f32_add /0x92 */] which breaks up the input program (which is a string), into discrete tokens, the (parser) that takes these tokens and converts them into an Abstract Syntax Tree (AST), and finally the (emitter) (which converts the AST into wasm binary module.
This is a pretty standard compiler architecture:
[{ type: "printStatement", expression: { type: "binaryExpression", left: { type: "binaryExpression", left: { type: "numberLiteral", value: 42 }, right: { type: "numberLiteral", value: 10 }, operator: " " }, right: { type: "numberLiteral", value: 2 }, operator: "/" }}]
Rather than dive into a complete implementation, we’ll tackle a small subset of the problem. The goal is to create a compiler for a language that just supports print statements which print simple numeric literals…
The Tokenizer
The The tokenizer works by advancing through the input string, one character at a time, matching patterns that represent specific token types. The following code creates three matches ( number , keyword [] , and (whitespace) , using simple regular expressions:
[loop condition] [loop condition] (const) keywords =["print"]; // returns a token if the given regex matches at the current index const (regexMatcher) (=( (regex) : (string) , type : (TokenType)
): Matcher => (input) : (string) , (index) : (number) ( => {{ const (match) (=(input) . (substring) () (index) (match ( regex );
return ( match && ({ type , value : (match) [
0] } ; }; const (matchers) (=[
regexMatcher("^[.0-9] " , [] [
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (number) " , regexMatcher ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (`^ ( ($ {) (keywords) . (join ( | ) () , “ (keyword [
regexMatcher("^[.0-9] [loop condition] ), regexMatcher ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] () ^ \ (s ) , (whitespace) ];
(Note, these regular expressions are not terribly robust!) [ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] [nested statements] The (Matcher) interface defines a function that given an input string and an index returns a token if a match occurs.
The main body of the parser iterates over the characters of the string, finding the first match, adding the provided token to the output array :
[loop condition] [loop condition] (export const tokenize : (Tokenizer)
= (input)=> ({ const tokens : (Token) [] =[] []; let (index) (=(0) ; while ([ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (index) (input . (length) ) {{ const matches (=(matchers) . (map) () (m) => (m) ( (input [Symbol.iterator] , (index) (). filter ([loop condition] (f) => f ) const (match) (=matches [0]; if ([ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (match) . (type !==(whitespace) ) ({ tokens . (push) ( match ); } index [] =(match) . (value) . (length) ; }
return tokens ; };
Here is the tokenized output of the program (print) . 1 ": [loop condition] [loop condition] [Symbol.iterator] [] As you can see from the above input, the tokeniser removes whitespace as it has no meaning (for this specific language), it also Ensures that everything in the input string is a valid token. However, it does not make any guarantees about the input being well-formed, for example the tokeniser will happily handle (print print) , which is clearly incorrect.
The array of tokens is next fed into the parser. The Parser
[nested statements] The goal of the parser is the creation of an Abstract Syntax Tree (AST), a tree structure that encodes the relationship between these tokens , resulting in a form that could potentially be sent to an interpreter for direct execution.
The parser iterates through the supplied tokens, consuming them via an eatToken function.
[loop condition] [loop condition] (export const (parse) : (Parser)
==> ({ const (iterator) (=(tokens) [Symbol.iterator] ();
let (currentToken [loop condition] (=(iterator) . (next) (). (value) ; const (eatToken) (=() => (currentToken [loop condition] (=(iterator) . (next) (). (value) ; [...] const (nodes) : (StatementNode) [] =[] []; while ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (index) tokens . (length) ) {{ nodes . (push) ( parseStatement ()); }
return (nodes) ; };
(I've no idea where the concept of eating tokens comes from, it appears to be standard parser terminology, they are clearly hungry beasts!)
[nested statements] The goal of the above parser is to turn the token array into an array of statements, which are the core building blocks of this language . It expects the given tokens to conform to this pattern, and will throw an error (not shown above) if it does not.
[nested statements] The (parseStatement) function expects each statement to start with a keyword - switching on its value:
[loop condition] [loop condition] (const) parseStatement =() [ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }]
=>
{{ if ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (currentToken) . (type=== keyword ) ({ switch ([ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (currentToken) . (value) ) ({
case [
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (print) " : eatToken ();
return ({ type : () (printStatement) , expression : (parseExpression) () }; } } }; Currently the only supported keyword is [loop condition] (print) , in this case it returns an AST node of type printStatement parsing the associated expression.
And here is the expression parser:
[loop condition] [loop condition] (const) (parseExpression) =() [ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }]
=>
{{ let (node) : (ExpressionNode) ; switch ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (currentToken) . (type ) ({
case () (number) :
node =({ type : () (numberLiteral) , value : (Number) ( currentToken . (value) ) }; eatToken ();
return (node) ; } };
[nested statements] In its present form the language only accepts expressions which are composed of a single number - i.e. a numeric literal. Therefore the above expression parser expects the next token to be a number, and when this matches, it returns a node of type (numberLiteral)
Continuing the simple example of the program (print) 1 ", the parser outputs the following AST:
[loop condition] [loop condition]
[] [loop condition] As you can see the AST for this language is an array of statement nodes. Parsing guarantees that the input program is syntactically correct, i.e. It is properly constructed, but it does not of course guarantee that it will execute successfully, runtime errors might still be present (although for this simple language they are not possible!).
We’re onto the final step now…
(The Emitter)
Currently the emitter outputs a hard-coded add function. It now needs to take this AST and emit the appropriate instructions, as follows:
[loop condition] [loop condition] (const) codeFromAst =(ast [Symbol.iterator]=> ({ const (code) (=[]; const (emitExpression) (=(node) => {{ switch ([ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (node) . (type ) ({
case [
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (numberLiteral) " : code . (push) ( (Opcodes) . (f) (_ const) ); code . (push) (...
(ieee) ( (node) . (value) ); break ; } }; ast . (forEach) ( statement => {{ switch ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (statement) . (type ) ({
case [
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (printStatement) " :
emitExpression ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (statement) . expression ); code . (push) ( (Opcodes) . (call) ); code . (push) (...
unsignedLEB) ( (0) ); break ; } });
return (code)
; };
The emitter iterates over the statements that form the ‘root’ of the AST, matching our only statement type - print. Notice that the first thing it does is emit the instructions for the statement expressions, recall that WebAssembly is a stack machine, hence the expression instructions must be processed first leaving the result on the stack.
The print function is implemented via a (call) operation, which invokes the function at index zero.
Previously we've seen how wasm modules can export functions (as per the add example above), they can also import functions, which are supplied when you instantiate the module. Here we provide an env.print function that logs to the console :
[loop condition] [loop condition] (const) (instance) =(await) (WebAssembly
. (instantiate) wasm , () ({ env : ({ print : (console) . (log } });
=
()
=> Uint8Array . (from) ([ ...magicModuleHeader, ...moduleVersion ]);
It is comprised of two parts, the 'magic' header, which is the ASCII string
0asm , and a version number. These eight bytes form valid WebAssembly (or wasm) module. More typically these would be delivered to the browser as a . wasm file.
In order to execute the WebAssembly module it needs to be instantiated as follows: [loop condition] [loop condition] (const) wasm =(emitter) ();
const (instance) (=(await) WebAssembly . (instantiate) ( (wasm) )
If you run the above you'll find that (instance) does actually do anything because our wasm module does not contain any instructions!
If you're interested in trying out this code for yourself, it is all on GitHub - with a commit for each step
An add function
Let's make the wasm module do something more useful, by implementing a function that adds a couple of floating point numbers together.
WebAssembly is a binary format, which isn't terribly readable (to humans at least), which is why you'll more typically see it written in WebAssembly Text Format (WAT). Here's a module, presented in WAT format, that defines an exported function named ($ add) that takes two floating point parameters, adds them together and returns them:
(module (func $ add (param f (param f [loop condition] ) (result f 069 get_local 0 get_local 1 f . add) (export "add" (func 0)) )
If you just want to experiment with WAT you can use the (wat2wasm) (tool from the WebAssembly Binary Toolkit to compile WAT files into wasm modules.
The above code reveals some interesting details around WebAssembly -
WebAssembly is a low-level language, with a small (approx 82 instruction set, where many of the instructions map quite closely to CPU instructions. This makes it easy to compile wasm modules to CPU-specific machine code.
It has no built in I / O. There are no instructions for writing to the terminal, screen or network. In order to wasm modules to interact with the outside world they need to do so via their host environment, which in the case of the browser is JavaScript.
WebAssembly is a stack machine, in the above example get_local 0 gets the local variable (in this case the function param) at the zeroth index and pushes it onto the stack, as does the subsequent instruction. The
f3.add
instruction pops two values form the stack, adds them together than pushes the value back on the stack.
WebAssembly has just four numeric types, two integer, two floats. More on this later…
Let's update the emitter to output this 'hard coded' WebAssembly module. WebAssembly modules are composed of a pre-defined set of optional sections, each prefixed with a numeric identifier. These include a type section, which encode type signatures, and function section, which indicates the type of each function. I’ll not cover how these are constructed here - they are quite dull. If you're interested, (look at the next commit in the project . [nested statements] The interesting part is the code section. Here is how the above add (function is created in binary:) [loop condition] [loop condition] (const) code =[ Opcodes.get_local /0x20 */, ...unsignedLEB128(0), Opcodes.get_local /0x20 */, ...unsignedLEB128(1), Opcodes.f32_add /0x92 */]; const functionBody (=(encodeVector) () )
/ locals / , ... code [{
type: "printStatement", expression: { type: "binaryExpression", left: { type: "binaryExpression", left: { type: "numberLiteral", value: 42 }, right: { type: "numberLiteral", value: 10 }, operator: " " }, right: { type: "numberLiteral", value: 2 }, operator: "/" }}]
, Opcodes . (end) / 0x0b / ]); const codeSection (=(createSection) () (Section)
. code / 0x0a / , encodeVector ([functionBody])); [loop condition] I've defined an (Opcodes) enum (I'm using TypeScript), which contains all of the wasm instructions. The the unsignedLEB 0726 function is a standard variable length encoding which is used for encoding instruction parameters.
The instructions for a function are combined with the function's local variables (of which there are none in this case), and an (end) (opcode that signals the end of a function. Finally all the functions are encoded into a section. The
encodeVector function simply prefixes a collection of byte arrays with the total length.
And there you have it, the complete module, which is about 60 bytes in total.
The JavaScript hosting code can now be updated to involve this exported function: [loop condition] [loop condition] (const) { (instance) } [Symbol.iterator]=(await) (WebAssembly) . (instantiate)
( (wasm) ); console . (log) ( (instance) . (exports) ) add (5) , (6) ));
Interestingly if you inspect the exported (add) function with the Chrome Dev Tools it identifier it as a 'native function'.
You can see the complete code for this step (with unit tests - go me!) on GitHub . Building a compiler
Now that you’ve seen how to dynamically create wasm modules, it’s time to turn our attention to the task of creating a compiler. We’ll start with a bit of terminology.
Here's some chasm code annotated to show the key components of a language:
[ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }]
Rather than give a 'textbook definition' of each, you'll become familiar with them as the compiler evolves.
The compiler itself will be formed of three parts, the tokenizer [ Opcodes.get_local /0x20 */, ...unsignedLEB128(0), Opcodes.get_local /0x20 */, ...unsignedLEB128(1), Opcodes.f32_add /0x92 */] which breaks up the input program (which is a string), into discrete tokens, the (parser) that takes these tokens and converts them into an Abstract Syntax Tree (AST), and finally the (emitter) (which converts the AST into wasm binary module.
This is a pretty standard compiler architecture:
[{ type: "printStatement", expression: { type: "binaryExpression", left: { type: "binaryExpression", left: { type: "numberLiteral", value: 42 }, right: { type: "numberLiteral", value: 10 }, operator: " " }, right: { type: "numberLiteral", value: 2 }, operator: "/" }}]
Rather than dive into a complete implementation, we’ll tackle a small subset of the problem. The goal is to create a compiler for a language that just supports print statements which print simple numeric literals…
The Tokenizer
The The tokenizer works by advancing through the input string, one character at a time, matching patterns that represent specific token types. The following code creates three matches ( number , keyword [] , and (whitespace) , using simple regular expressions:
[loop condition] [loop condition] (const) keywords =["print"]; // returns a token if the given regex matches at the current index const (regexMatcher) (=( (regex) : (string) , type : (TokenType)
): Matcher => (input) : (string) , (index) : (number) ( => {{ const (match) (=(input) . (substring) () (index) (match ( regex );
return ( match && ({ type , value : (match) [
0] } ; }; const (matchers) (=[
regexMatcher("^[.0-9] " , [] [
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (number) " , regexMatcher ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (`^ ( ($ {) (keywords) . (join ( | ) () , “ (keyword [
regexMatcher("^[.0-9] [loop condition] ), regexMatcher ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] () ^ \ (s ) , (whitespace) ];
(Note, these regular expressions are not terribly robust!) [ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] [nested statements] The (Matcher) interface defines a function that given an input string and an index returns a token if a match occurs.
The main body of the parser iterates over the characters of the string, finding the first match, adding the provided token to the output array :
[loop condition] [loop condition] (export const tokenize : (Tokenizer)
= (input)=> ({ const tokens : (Token) [] =[] []; let (index) (=(0) ; while ([ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (index) (input . (length) ) {{ const matches (=(matchers) . (map) () (m) => (m) ( (input [Symbol.iterator] , (index) (). filter ([loop condition] (f) => f ) const (match) (=matches [0]; if ([ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (match) . (type !==(whitespace) ) ({ tokens . (push) ( match ); } index [] =(match) . (value) . (length) ; }
return tokens ; };
Here is the tokenized output of the program (print) . 1 ": [loop condition] [loop condition] [Symbol.iterator] [] As you can see from the above input, the tokeniser removes whitespace as it has no meaning (for this specific language), it also Ensures that everything in the input string is a valid token. However, it does not make any guarantees about the input being well-formed, for example the tokeniser will happily handle (print print) , which is clearly incorrect.
The array of tokens is next fed into the parser. The Parser
[nested statements] The goal of the parser is the creation of an Abstract Syntax Tree (AST), a tree structure that encodes the relationship between these tokens , resulting in a form that could potentially be sent to an interpreter for direct execution.
The parser iterates through the supplied tokens, consuming them via an eatToken function.
[loop condition] [loop condition] (export const (parse) : (Parser)
==> ({ const (iterator) (=(tokens) [Symbol.iterator] ();
let (currentToken [loop condition] (=(iterator) . (next) (). (value) ; const (eatToken) (=() => (currentToken [loop condition] (=(iterator) . (next) (). (value) ; [...] const (nodes) : (StatementNode) [] =[] []; while ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (index) tokens . (length) ) {{ nodes . (push) ( parseStatement ()); }
return (nodes) ; };
(I've no idea where the concept of eating tokens comes from, it appears to be standard parser terminology, they are clearly hungry beasts!)
[nested statements] The goal of the above parser is to turn the token array into an array of statements, which are the core building blocks of this language . It expects the given tokens to conform to this pattern, and will throw an error (not shown above) if it does not.
[nested statements] The (parseStatement) function expects each statement to start with a keyword - switching on its value:
[loop condition] [loop condition] (const) parseStatement =() [ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }]
=>
{{ if ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (currentToken) . (type=== keyword ) ({ switch ([ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (currentToken) . (value) ) ({
case [
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (print) " : eatToken ();
return ({ type : () (printStatement) , expression : (parseExpression) () }; } } }; Currently the only supported keyword is [loop condition] (print) , in this case it returns an AST node of type printStatement parsing the associated expression.
And here is the expression parser:
[loop condition] [loop condition] (const) (parseExpression) =() [ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }]
=>
{{ let (node) : (ExpressionNode) ; switch ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (currentToken) . (type ) ({
case () (number) :
node =({ type : () (numberLiteral) , value : (Number) ( currentToken . (value) ) }; eatToken ();
return (node) ; } };
[nested statements] In its present form the language only accepts expressions which are composed of a single number - i.e. a numeric literal. Therefore the above expression parser expects the next token to be a number, and when this matches, it returns a node of type (numberLiteral)
Continuing the simple example of the program (print) 1 ", the parser outputs the following AST:
[loop condition] [loop condition]
[] [loop condition] As you can see the AST for this language is an array of statement nodes. Parsing guarantees that the input program is syntactically correct, i.e. It is properly constructed, but it does not of course guarantee that it will execute successfully, runtime errors might still be present (although for this simple language they are not possible!).
We’re onto the final step now…
(The Emitter)
Currently the emitter outputs a hard-coded add function. It now needs to take this AST and emit the appropriate instructions, as follows:
[loop condition] [loop condition] (const) codeFromAst =(ast [Symbol.iterator]=> ({ const (code) (=[]; const (emitExpression) (=(node) => {{ switch ([ { "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (node) . (type ) ({
case [
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (numberLiteral) " : code . (push) ( (Opcodes) . (f) (_ const) ); code . (push) (...
(ieee) ( (node) . (value) ); break ; } }; ast . (forEach) ( statement => {{ switch ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (statement) . (type ) ({
case [
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (printStatement) " :
emitExpression ([
{ "type": "printStatement", "expression": { "type": "numberLiteral", "value": 23.1 } }] (statement) . expression ); code . (push) ( (Opcodes) . (call) ); code . (push) (...
unsignedLEB) ( (0) ); break ; } });
return (code)
; };
The emitter iterates over the statements that form the ‘root’ of the AST, matching our only statement type - print. Notice that the first thing it does is emit the instructions for the statement expressions, recall that WebAssembly is a stack machine, hence the expression instructions must be processed first leaving the result on the stack.
The print function is implemented via a (call) operation, which invokes the function at index zero.
Previously we've seen how wasm modules can export functions (as per the add example above), they can also import functions, which are supplied when you instantiate the module. Here we provide an env.print function that logs to the console :
[loop condition] [loop condition] (const) (instance) =(await) (WebAssembly
. (instantiate) wasm , () ({ env : ({ print : (console) . (log } });
GIPHY App Key not set. Please check settings