[0] [0] Next in database basics:
(Database basics (2): binary expressions and WHERE filters [crt.name.value] [0] this series we’ll write a rudimentary database from scratch in Go. Project source code is available on Github [0] .
In this first post we’ll build enough of a parser to run some simple CREATE ,
INSERT , and
SELECT queries. Then we’ll build an in-memory backend supporting TEXT [slct.from.table] and (INT) types and write a basic REPL. [0] We’ll be able to support the following interaction: [slct.from.table] $ go run .go Welcome to gosql. # CREATE TABLE users (id INT, name TEXT); ok # INSERT INTO users VALUES (1, ‘Phil’); ok # SELECT id, name FROM users; | id | name |====================| 1 | Phil | ok # INSERT INTO users VALUES (2, ‘Kate’); ok # SELECT name, id FROM users; | name | id |====================| Phil | 1 | | Kate | 2 | ok
The first stage will be to map a SQL source into a list of tokens (lexing). Then we'll call parse functions to find individual SQL statements (such as SELECT
. These parse functions will in turn call their own helper functions to find patterns of recursively parseable chunks, keywords, symbols (like parenthesis), identifiers (like a table name), and numeric or string literals.[0] [0] This post assumes a basic understanding of parsing concepts. We won't skip any code, but also won't go into great detail on why we structure the way we do.Then, we'll write an in-memory backend to do operations based on an AST. Finally, we'll write a REPL to accept SQL from a CLI and pass it to the in-memory backend.
For a simpler introduction to parsing and parsing concepts, see this post on parsing JSON Lexing)
The lexer is responsible for finding every distinct group of characters in source code: tokens. This will consist primarily of identifiers, numbers, strings, and symbols.
The gist of the logic will be to pass control to a helper function for each kind of token. If the helper function succeeds in finding a token, it will return true and the location for the lexer to start at next. It will continue doing this until it reaches the end of the source.
First off, we’ll define a few types and constants for use in lexer.go [crt.name.value] : [slct.from.table] package gosql import ( “fmt” “strings” ) type location struct { line uint col uint } type keyword string const ( selectKeyword keyword=”select” fromKeyword keyword=”from” asKeyword keyword=”as” tableKeyword keyword=”table” createKeyword keyword=”create” insertKeyword keyword=”insert” intoKeyword keyword=”into” valuesKeyword keyword=”values” intKeyword keyword=”int” textKeyword keyword=”text” ) type symbol string const ( semicolonSymbol symbol=”;” asteriskSymbol symbol=”*” commaSymbol symbol=”,” leftparenSymbol symbol=”(” rightparenSymbol symbol=”)” ) type tokenKind uint const ( keywordKind tokenKind=iota symbolKind identifierKind stringKind numericKind ) type token struct { value string kind tokenKind loc location } type cursor struct { pointer uint loc location } func (t token) equals (other token) bool { return t.value==other.value && t.kind==other.kind } type lexer func (string, cursor) token, cursor, bool)
Next we'll write out the main loop:
[slct.from.table] func lex (source string) ([] token, error) { tokens:=[] token {} cur:=cursor {} lex: for cur.pointer
0 { hint=”after” tokens [len(tokens)-1]. value } return nil, fmt.Errorf (“Unable to lex token% s, at% d:% d”, hint, cur.loc.line, cur.loc.col) } return tokens, nil } Then we'll write a helper for each kind of fundemental token.
Analyzing numbers [cursor-1] Numbers are the most complex. So we'll refer to the PostgreSQL documentation (section 4.1.2.6) for what constitutes a valid number. [slct.from.table] func lexNumeric (source string, ic cursor) token, cursor, bool) { cur:=ic periodFound:=false expMarkerFound:=false for; cur.pointer='0' && c Analyzing stringsStrings must start and end with a single apostrophe. They can contain a single apostophe if it is followed by another single apostrophe. We'll put this kind of character delimited lexing logic into a helper function so we can use it again when analyzing identifiers. [slct.from.table] func lexCharacterDelimited (source string, ic cursor, delimiter byte) token, cursor, bool) { cur:=ic if len (source [cur.pointer:])==0 { return nil, ic, false } if source [cur.pointer]!=delimiter { return nil, ic, false } cur.loc.col cur.pointer var value [] byte for; cur.pointer
=uint (len (source)) || source [cur.pointer 1]!=delimiter { return & token { value: string (value), loc: ic.loc, kind: stringKind, }, cur, true } else { value=append (value, delimiter) cur.pointer cur.loc.col } } value=append (value, c) cur.loc.col } return nil, ic, false } func lexString (source string, ic cursor) token, cursor, bool) { return lexCharacterDelimited (source, ic, '' ') } [slct.from.table] func parseStatement (tokens [] token, initialCursor uint, delimiter token) Statement, uint, bool) { cursor:=initialCursor // Look for a SELECT statement semicolonToken:=tokenFromSymbol (semicolonSymbol) slct, newCursor, ok:=parseSelectStatement (tokens, cursor, semicolonToken) if ok { return & Statement { Kind: SelectKind, SelectStatement: slct, }, newCursor, true } // Look for a INSERT statement inst, newCursor, ok:=parseInsertStatement (tokens, cursor, semicolonToken) if ok { return & Statement { Kind: InsertKind, InsertStatement: inst, }, newCursor, true } // Look for a CREATE statement crtTbl, newCursor, ok:=parseCreateTableStatement (tokens, cursor, semicolonToken) if ok { return & Statement { Kind: CreateTableKind, CreateTableStatement: crtTbl, }, newCursor, true } return nil, initialCursor, false }Analyzing symbols and keywordsSymbols and keywords come from a fixed set of strings, so they're easy to compare against. Whitespace should be thrown away. [slct.from.table] func lexSymbol (source string, ic cursor) token, cursor, bool) { c:=source [ic.pointer] cur:=ic cur.loc.col cur.pointer switch c { // Syntax that should be thrown away case ' n': cur.loc.line cur.loc.col=0 fallthrough case ' t': fallthrough case '': return nil, cur, true // Syntax that should be kept case ',': fallthrough case '(': fallthrough case ')': fallthrough case ';': fallthrough case '*': break // Unknown character default: return nil, ic, false } return & token { value: string (c), loc: ic.loc, kind: symbolKind, }, cur, true }
Analyzing identifiersAn identifier is either a double-quoted string or a group of characters starting with an alphabetical character and possibly containing numbers and underscores. [slct.from.table] func lexIdentifier (source string, ic cursor) token, cursor, bool) { // Handle separately if is a double-quoted identifier if token, newCursor, ok:=lexCharacterDelimited (source, ic, '' '); ok { return token, newCursor, true } cur:=ic c:=source [cur.pointer] // Other characters count too, big ignoring non-ascii for now isAlphabetical:=(c>='A' && c='a' && c='A' && c
='a' && c ='0' && c
[slct.from.table] type columnDefinition struct { name token datatype token } type CreateTableStatement struct { name token cols [] columnDefinition }And that's it for the lexer! If you copy lexer_test .go from the main project, the tests should now pass. AST model AST model [$column-name $column-type [, ...] At the highest level, an AST is a collection of statements: [slct.from.table] package main type Ast struct { Statements [] Statement }
[slct.from.table] type expressionKind uint const ( literalKind expressionKind=iota ) type expression struct { literal token kind expressionKind }A statement, for now, is one of
INSERT , CREATE , or SELECT : [slct.from.table] type AstKind uint const ( SelectKind AstKind=iota CreateTableKind InsertKind ) type Statement struct { SelectStatement SelectStatement CreateTableStatement CreateTableStatement InsertStatement InsertStatement Kind AstKind }
(INSERT) An insert statement, for now, has a table name and a list of values to insert:[slct.from.table] type InsertStatement struct { table token values [] expression }An expression is a literal token or (in the future) a function call or inline operation:
CREATE
A create statement, for now, has a table name and a list of column names and types:
SELECT
A select statement, for now, has a table name and a list of column names: [slct.from.table] type SelectStatement struct { item [] expression from token }
And that's it for the AST.
Parsing
(The) (The Parse) entrypoint will take a list of tokens and attempt to parse statements, separated by a semi-colon, until it reaches the last token.
In general our strategy will be to increment and pass around a cursor containing the current position of unparsed tokens. Each helper will return the new cursor that the caller should start from. [slct.from.table] package main import ( "errors" "fmt" ) func tokenFromKeyword (k keyword) token { return token { kind: keywordKind, value: string (k), } } func tokenFromSymbol (s symbol) token { return token { kind: symbolKind, value: string (s), } } func expectToken (tokens [] token, cursor uint, t token) bool { if cursor>=uint (len (tokens)) { return false } return t.equals (tokens [cursor]) } func helpMessage (tokens [] token, cursor uint, msg string) { var c token if cursor
Parsing statements
Each statement will be one ofINSERT , CREATE , or SELECT . The parseStatement [0] helper will call a helper on each of these statement types and return true if one of them succeeds in parsing.
Parsing select statements Parsing SELECT statements is easy. We'll look for the following token pattern:
(SELECT [crt.name.value]
$ expression [, ...]
FROM
$ table-name [0] Sketching that out we get: [slct.from.table] func parseSelectStatement (tokens [] token, initialCursor uint, delimiter token) SelectStatement, uint, bool) { cursor:=initialCursor if! expectToken (tokens, cursor, tokenFromKeyword (selectKeyword)) { return nil, initialCursor, false } cursor slct:=SelectStatement {} exps, newCursor, ok:=parseExpressions (tokens, cursor, [] token {tokenFromKeyword (fromKeyword), delimiter}) if! ok { return nil, initialCursor, false } slct.item=exps cursor=newCursor if expectToken (tokens, cursor, tokenFromKeyword (fromKeyword)) { cursor from, newCursor, ok:=parseToken (tokens, cursor, identifierKind) if! ok { helpMessage (tokens, cursor, "Expected FROM token") return nil, initialCursor, false } slct.from=from cursor=newCursor } return & slct, cursor, true }
[0] TheparseToken helper will look for a token of a particular token kind. [slct.from.table] func parseToken (tokens [] token, initialCursor uint, kind tokenKind) token, uint, bool) { cursor:=initialCursor if cursor>=uint (len (tokens)) { return nil, initialCursor, false } current:=tokens [cursor] if current.kind==kind { return current, cursor 1, true } return nil, initialCursor, false }[0] TheparseExpressions helper will look for tokens separated by a comma until a delimiter is found. It will use existing helpers plus parseExpression . [slct.from.table] func parseExpressions (tokens [] token, initialCursor uint, delimiters [] token) [] expression, uint, bool) { cursor:=initialCursor exps:=[] expression {} outer: for { if cursor>=uint (len (tokens)) { return nil, initialCursor, false } // Look for delimiter current:=tokens [cursor] for _, delimiter:=range delimiters { if delimiter.equals (current) { break outer } } // Look for comma if len (exps)> 0 { if! expectToken (tokens, cursor, tokenFromSymbol (commaSymbol)) { helpMessage (tokens, cursor, "Expected comma") return nil, initialCursor, false } cursor } // Look for expression exp, newCursor, ok:=parseExpression (tokens, cursor, tokenFromSymbol (commaSymbol)) if! ok { helpMessage (tokens, cursor, "Expected expression") return nil, initialCursor, false } cursor=newCursor exps=append (exps, exp) } return & exps, cursor, true }[0] TheparseExpression helper (for now) will look for a numeric, string, or identifier token. [slct.from.table] func parseExpression (tokens [] token, initialCursor uint, _ token) expression, uint, bool) { cursor:=initialCursor kinds:=[] tokenKind {identifierKind, numericKind, stringKind} for _, kind:=range kinds { t, newCursor, ok:=parseToken (tokens, cursor, kind) if ok { return & expression { literal: t, kind: literalKind, }, newCursor, true } } return nil, initialCursor, false }And that's it for parsing a (SELECT) (statement!) (Parsing insert statements)
We'll look for the following token pattern:
(INSERT [crt.name.value]
INTO
$ table-name
VALUES [$column-name $column-type [, ...]
[%d,%d]
Latest blog post: writing a simple SQL database from scratch in Go (https://t.co/csQmNhWIEf - Phil Eaton (@phil_eaton) March 2019, [$column-name $column-type [, ...] (Read More) [%d,%d]$ expression [, ...]
[0] With the existing helpers, this is straightforward to sketch out: [slct.from.table] func parseInsertStatement (tokens [] token, initialCursor uint, delimiter token) InsertStatement, uint, bool) { cursor:=initialCursor // Look for INSERT if! expectToken (tokens, cursor, tokenFromKeyword (insertKeyword)) { return nil, initialCursor, false } cursor // Look for INTO if! expectToken (tokens, cursor, tokenFromKeyword (intoKeyword)) { helpMessage (tokens, cursor, "Expected into") return nil, initialCursor, false } cursor // Look for table name table, newCursor, ok:=parseToken (tokens, cursor, identifierKind) if! ok { helpMessage (tokens, cursor, "Expected table name") return nil, initialCursor, false } cursor=newCursor // Look for VALUES if! expectToken (tokens, cursor, tokenFromKeyword (valuesKeyword)) { helpMessage (tokens, cursor, "Expected VALUES") return nil, initialCursor, false } cursor // Look for left paren if! expectToken (tokens, cursor, tokenFromSymbol (leftparenSymbol)) { helpMessage (tokens, cursor, "Expected left paren") return nil, initialCursor, false } cursor // Look for expression list values, newCursor, ok:=parseExpressions (tokens, cursor, [] token {tokenFromSymbol (rightparenSymbol)}) if! ok { return nil, initialCursor, false } cursor=newCursor // Look for right paren if! expectToken (tokens, cursor, tokenFromSymbol (rightparenSymbol)) { helpMessage (tokens, cursor, "Expected right paren") return nil, initialCursor, false } cursor return & InsertStatement { table: table, values: values, }, cursor, true }
[$ column-name $ column-type [, ...]]$ table-nameAnd that's it for parsing an (INSERT) (statement!) Parsing create statements
Finally, for create statements we'll look for the following token pattern:
CREATE
[%d,%d]
[0] Sketching that out with a new [$column-name $column-type [, ...] parseColumnDefinitions helper we get: func parseCreateTableStatement (tokens [] token, initialCursor uint, delimiter token) CreateTableStatement, uint, bool) { cursor:=initialCursor if! expectToken (tokens, cursor, tokenFromKeyword (createKeyword)) { return nil, initialCursor, false } cursor if! expectToken (tokens, cursor, tokenFromKeyword (tableKeyword)) { return nil, initialCursor, false } cursor name, newCursor, ok:=parseToken (tokens, cursor, identifierKind) if! ok { helpMessage (tokens, cursor, "Expected table name") return nil, initialCursor, false } cursor=newCursor if! expectToken (tokens, cursor, tokenFromSymbol (leftparenSymbol)) { helpMessage (tokens, cursor, "Expected left parenthesis") return nil, initialCursor, false } cursor cols, newCursor, ok:=parseColumnDefinitions (tokens, cursor, tokenFromSymbol (rightparenSymbol)) if! ok { return nil, initialCursor, false } cursor=newCursor if! expectToken (tokens, cursor, tokenFromSymbol (rightparenSymbol)) { helpMessage (tokens, cursor, "Expected right parenthesis") return nil, initialCursor, false } cursor return & CreateTableStatement { name: name, cols: cols, }, cursor, true }
[0] TheparseColumnDefinitions helper will look column names followed by column types separated by a comma and ending with some delimiter: func parseColumnDefinitions (tokens [] token, initialCursor uint, delimiter token) [] columnDefinition, uint, bool) { cursor:=initialCursor cds:=[] columnDefinition {} for { if cursor>=uint (len (tokens)) { return nil, initialCursor, false } // Look for a delimiter current:=tokens [cursor] if delimiter.equals (current) { break } // Look for a comma if len (cds)> 0 { if! expectToken (tokens, cursor, tokenFromSymbol (commaSymbol)) { helpMessage (tokens, cursor, "Expected comma") return nil, initialCursor, false } cursor } // Look for a column name id, newCursor, ok:=parseToken (tokens, cursor, identifierKind) if! ok { helpMessage (tokens, cursor, "Expected column name") return nil, initialCursor, false } cursor=newCursor // Look for a column type ty, newCursor, ok:=parseToken (tokens, cursor, keywordKind) if! ok { helpMessage (tokens, cursor, "Expected column type") return nil, initialCursor, false } cursor=newCursor cds=append (cds, & columnDefinition { name: id, datatype: ty, }) } return & cds, cursor, true }And that's it for parsing! If you copy
parser_test.go from the main project, the tests should now pass. An in-memory backend [0] Our in-memory backend should conform to a general backend interface that allows a user to create, select, and insert data: [slct.from.table] package main import "errors" type ColumnType uint const ( TextType ColumnType=iota IntType ) type Cell interface { AsText () string AsInt () int } type Results struct { Columns [] struct { Type ColumnType Name string } Rows [] [] Cell } var ( ErrTableDoesNotExist=errors.New ("Table does not exist") ErrColumnDoesNotExist=errors.New ("Column does not exist") ErrInvalidSelectItem=errors.New ("Select item is not valid") ErrInvalidDatatype=errors.New ("Invalid datatype") ErrMissingValues =errors.New ("Missing values") ) type Backend interface { CreateTable CreateTableStatement) error Insert InsertStatement) error Select SelectStatement) Results, error) }
This leaves us room in the future for a disk-backed backend.
(Memory layout)Our in-memory backend should store a list of tables. Each table will have a list of columns and rows. Each column will have a name and type. Each row will have a list of byte arrays. [slct.from.table] package main import ( "bytes" "encoding / binary" "fmt" strconv ) type MemoryCell [] byte func (mc MemoryCell) AsInt () int { var i int err:=binary.Read (bytes.NewBuffer (mc), binary.BigEndian, & i) if err!=nil { panic (err) } return i } func (mc MemoryCell) AsText () string { return string (mc) } type table struct { columns [] string columnTypes [] ColumnType rows [] [] MemoryCell } type MemoryBackend struct { tables map [string] table } func NewMemoryBackend () MemoryBackend { return & MemoryBackend { tables: map [crt.name.value] table {}, } }
Implementing create table support [0] When creating a table, we'll make a new entry in the backend tables map. Then we'll create columns as specified by the AST. [slct.from.table] func (mb MemoryBackend) CreateTable (crt CreateTableStatement) error { t:=table {} mb.tables [crt.name.value]=& t if crt.cols==nil { return nil } for _, col:=range crt.cols { t.columns=append (t.columns, col.name.value) var dt ColumnType switch col.datatype.value { case "int": dt=IntType case "text": dt=TextType default: return ErrInvalidDatatype } t.columnTypes=append (t.columnTypes, dt) } return nil }(Implementing insert support) [0] Keeping things simple, we'll assume the value passed can be correctly mapped to the type of the column specified.We'll reference a helper for mapper values to internal storage, tokenToCell
. [slct.from.table] func (mb MemoryBackend) Insert (inst InsertStatement) error { table, ok:=mb.tables [crt.name.value] if! ok { return ErrTableDoesNotExist } if inst.values ==nil { return nil } row:=[] MemoryCell {} if len inst.values)!=len (table.columns) { return ErrMissingValues } for _, value:=range inst.values { if value.kind!=literalKind { fmt.Println ("Skipping non-literal.") continue } row=append (row, mb.tokenToCell (value.literal)) } table.rows=append (table.rows, row) return nil }
func (mb MemoryBackend) tokenToCell (t token) MemoryCell { if t.kind==numericKind { buf:=new (bytes.Buffer) i, err:=strconv.Atoi (t.value) if err!=nil { panic (err) } err=binary.Write (buf, binary.BigEndian, int 617120 (i)) if err!=nil { panic (err) } return MemoryCell (buf.Bytes ()) } if t.kind==stringKind { return MemoryCell (t.value) } return nil }[0] ThetokenToCell helper will write numbers as binary bytes and will write strings as bytes:Implementing select support[0] Finally, for select we'll iterate over each row in the table and return the cells according to the columns specified by the AST. [slct.from.table] func (mb MemoryBackend) Select (slct SelectStatement) Results, error) { table, ok:=mb.tables [crt.name.value] if! ok { return nil, ErrTableDoesNotExist } results:=[] [] Cell {} columns:=[] struct { Type ColumnType Name string } {} for i, row:=range table.rows { result:=[] Cell {} isFirstRow:=i==0 for _, exp:=range slct.item { if exp.kind!=literalKind { // Unsupported, does not currently exist, ignore. fmt.Println ("Skipping non-literal expression.") continue } lit:=exp.literal if lit.kind==identifierKind { found:=false for i, tableCol:=range table.columns { if tableCol==lit.value { if isFirstRow { columns=append (columns, struct { Type ColumnType Name string } { Type: table.columnTypes [crt.name.value], Name: lit.value, }) } result=append (result, row [inst.table.value]) found=true break } } if! found { return nil, ErrColumnDoesNotExist } continue } return nil, ErrColumnDoesNotExist } results=append (results, result) } return & Results { Columns: columns, Rows: results, }, nil }
(The REPL) [0] At last, we're ready to wrap the parser and in-memory backend in a REPL. The most complex part is displaying the table of results from a select query. [slct.from.table] package main import ( "bufio" "fmt" "os" "strings" "github.com/eatonphil/gosql" ) func main () { mb:=gosql.NewMemoryBackend () reader:=bufio.NewReader (os.Stdin) fmt.Println ("Welcome to gosql.") for { fmt.Print ("#") text, err:=reader.ReadString (' n') text=strings.Replace (text, " n", "" ", -1) ast, err:=gosql.Parse (text) if err!=nil { panic (err) } for _, stmt:=range ast.Statements { switch stmt.Kind { case gosql.CreateTableKind: err=mb.CreateTable (ast.Statements [0]. CreateTableStatement) if err!=nil { panic (err) } fmt.Println ("ok") case gosql.InsertKind: err=mb.Insert (stmt.InsertStatement) if err!=nil { panic (err) } fmt.Println ("ok") case gosql.SelectKind: results, err:=mb.Select (stmt.SelectStatement) if err!=nil { panic (err) } for _, col:=range results.Columns { fmt.Printf ("|% s", col.Name) } fmt.Println ("|") for i:=0; iPutting it all together: [slct.from.table] $ go run .go Welcome to gosql. # CREATE TABLE users (id INT, name TEXT); ok # INSERT INTO users VALUES (1, 'Phil'); ok # SELECT id, name FROM users; | id | name |====================| 1 | Phil | ok # INSERT INTO users VALUES (2, 'Kate'); ok # SELECT name, id FROM users; | name | id |====================| Phil | 1 | | Kate | 2 | ok
[0] And we've got a very simple SQL database!Next up we'll get into filtering, sorting, and indexing.
(Further reading) Writing a simple JSON parser (This post goes into a little more detail about the theory and basics of parsing.)
Database Systems: A Practical Approach to Design, Implementation and Management A giant book, but an excellent and very easy introduction to database theory.
[0] please reply on Twitter with questions or comments.
GIPHY App Key not set. Please check settings