in ,

Micro C, Part 3: Generating LLVM, Hacker News

[passing, failing, parsing] [passing, failing, parsing] ([Operand] (part 2 , we completed the user facing part of the compiler with semantic analysis. Now we’re free to focus on the backend, generating LLVM. To do so, we’ll use the llvm-hs-pure library, which embeds LLVM into a Haskell ADT, allowing us to manipulate it without having to set up any FFI. We will then pretty print the generated bytecode with llvm-hs-pretty and call clang on it to generate machine code for our preferred target. Some things to keep in mind when writing LLVM generating code:

There is very little type safety. The usual haskell guarantees of "if it typechecks it probably works" do not apply here at all. After all, we are writing basically writing assembly, so even if we get our code through the LLVM type checker, there's still a good chance that it won do what we want it to. This is where having a comprehensive test suite is invaluable.

Documentation for LLVM in general is quite scant. The

  • official language reference manual is comprehensive, but can be quite terse and provides little in the way of examples, although this gets better all the time and is not nearly as bad now as it was a few years ago. Documentation for llvm-hs specifically is even scarcer: basically just some
  • tiny examples

    and the haskell port of the (kaleidoscope tutorial) , which is by now very outdated. This post will hopefully help matters somewhat but it leaves many parts of the LLVM IR and ecosystem unexplored. Let clang [(String, [AST.Type] help you. You can run it with - emit-llvm on a C file to see what it generates. This is helpful even when you aren't writing a C-like language as you can see how certain high-level language concepts map onto LLVM. [charStar] LLVM Basics [rhs'] The following is a simplified version of how LLVM works in order to create a basic mental model. There are many omissions. LLVM programs are split into (modules) containing toplevel definitions like functions and global variables. For our purposes, we are concerned with emitting a single module containing all of our functions, global variables, and typedefs for our structs. Within functions , code is contained in basic blocks. A basic block is a list of sequential instructions in SSA (Single Static Assignment) form that is terminated in a (conditional or unconditional branch) to another block or in a (return instruction [AST.double, AST.double] [charStar] )

    LLVM, unlike many forms of assembly, is (typed) . Its type system resembles C's in many ways, but there are some notable differences. Instead of C's confusing and target dependent integer types short , long , long long , etc., LLVM has arbitrary N bit width integers (denoted) iN Also, void (is banned in LLVM, so) char or i8 is used instead. LLVM also has native (vector types) for SIMD instructions , but those won't factor into our discussion. There are other exotic features in the type system that help optimization like (poison values) But we won't discuss them here. Codegen.hs

    With that out of the way, we can start writing code. This module has quite a few imports. {- # LANGUAGE RecursiveDo # -} {- # LANGUAGE TupleSections # -}

      {- # LANGUAGE FlexibleContexts # -} 
      - We need these to write a ConvertibleStrings instance for  - ShortByteString   {- # LANGUAGE MultiParamTypeClasses # -} 

    {- # OPTIONS_GHC -fno-warn-orphans # -} (module) Microc.Codegen   ( (codegenProgram)   ) (where (import) (qualified) LLVM.AST.IntegerPredicate (as) IP (import) (qualified) LLVM.AST.FloatingPointPredicate                                                 (as) FP (import) (LLVM.AST) (Operand ) (import) (qualified) LLVM.AST (as) AST (import) (qualified) LLVM.AST.Type (as) AST (import) (qualified) LLVM.AST.Constant (as) C (import) LLVM.AST.Name (import) (LLVM.AST.Typed) (typeOf) ) (import) (qualified) LLVM.IRBuilder.Module (as) L (import) (qualified) LLVM.IRBuilder.Monad (as) L (import) (qualified) LLVM.IRBuilder.Instruction (as) L (import) (qualified) LLVM.IRBuilder.Constant (as) L (import) (LLVM.Prelude) ShortByteString [rhs''] ) (import) (qualified) (Data.Map) (as) M (import) Control.Monad.State (import) (Data.String) fromString ) (import) Microc.Sast (import) (Microc.Ast) (Type) (

    ..
                                                     ,  (Op) ( [(rhs', enclosing), (nextExpt, continueBlock)] )                                                 ,  (Uop) ( [(rhs', enclosing), (nextExpt, continueBlock)] )                                                 ,  Bind ( [(rhs', enclosing), (nextExpt, continueBlock)] )                                                 ,  (Struct) ( [(rhs', enclosing), (nextExpt, continueBlock)] )                                                 )   (import) Data.String.Conversions  (import) (qualified) (Data.Text) (as) (import) (Data.Text)  (Text) )  (import) (Data.Word)  (Word) 

    (import) (Data.List) find ) http://blog.josephmorag.com/images/compiler-if.png (In llvm-hs , values ​​that can be passed as arguments to LLVM instructions
    have type (Operand) . This basically includes all variables as well as declared functions. Just as in Semant [Operand] , we define an [charStar] (Env) type to hold information about our generated operands and how they correspond to the names of their corresponding source variables. Besides the operands, we need to keep track of all struct declarations and all string literals in the source, as we emit a unique global variable for each unique string literal.
    (data) (Env)

    =Env ({operands) :: :: (M) [(rhs', enclosing), (nextExpt, continueBlock)] (Map) (Text) (Operand)                , structs :: :: [ Struct ]                , strings :: :: (M) [charStar] (Map) (Text) (Operand)                }    (deriving) ([charStar] Eq , (Show) (registerOperand) :: :: (MonadState) (Env) (m)=> (Text)

    -> (Operand) -> m () (registerOperand) (name op)=   modify ($)
       env  ->  (env {operands [ ("printbig"     , [AST.i32]=  (M) . (insert name op (operands env)}  
      [(Type, ParameterName)] Utilities 
    Working with LLVM bindings in other languages ​​usually involves passing mutable builder and module context objects to all instruction-emitting functions in order to ensure that all variables have unique names and to maintain the integrity of the module. This is important to ensure that code remains in SSA form.

    Of course, since we're not working in other languages, this approach of passing around mutable objects would be severely un-ergonomic, at best. Fortunately, llvm-hs provides us with monads that emulate this behavior, (ModuleBuilderT) for the module context, and (IRBuilderT for the builder object. We'll establish two type synonyms, LLVM for generating top level entities and (Codegen) for generating basic blocks. (type) (LLVM) (=(L)

    .  ModuleBuilderT  (State) (Env)   (type) (codegen) =(L)   IRBuilderT   (LLVM)   
    We'll also write some utilities to query struct fields defined in our (Env) to convert from MicroC types to LLVM types, and to calculate the sizes of MicroC types. For structs, we emit packed fields , which is pretty bad for performance, but makes calculating sizes very easy. Note that by this phase of the compiler, we no longer report errors to the user, so if anything goes wrong, we'll just crash. (getFields) : :: (MonadState) (Env) (m)=> (Text) -> m [Bind] (getFields) (name)= (do)   ss ( gets structs    (case) (find) s -> (structName s)==(name) ss (of)      (Nothing) -> (error) ("Internal error - struct not found")      (Just) ([charStar] (Struct [] (_) binds [AST.double, AST.double] -> pure binds (charStar) :: :: (AST) [(rhs', enclosing), (nextExpt, continueBlock)] (Type) (charStar)= (AST) . ptr (AST) [(rhs', enclosing), (nextExpt, continueBlock)] i8 - llvm-hs uses ShortByteString for names, but we want - easy conversion to Text with cs from Data.String.Conversions
    (instance) (ConvertibleStrings) (Text) (ShortByteString) (where)   convertString (=fromString . (T) unpack (ltypeOfTyp) :: :: (MonadState) (Env) (m)=> (Type)

    -> (m) AST
    .  (Type) (ltypeOfTyp)=    (case)     (TyVoid) ->  pure  (AST)  void    (TyInt) ->  pure  (AST) (i)    (TyChar) ->  pure  (AST)  i8    (TyFloat) ->  pure  (AST)  double    (TyBool) ->  pure  (AST)  i1    - ((void  is invalid LLVM 
       (Pointer) (TyVoid) -> (pure) $ charStar    - special case to handle recursively defined structures    - TODO: add real cycle checking so that improperly defined
        - recursive types case the compiler to hang forever 
        (Pointer) ([charStar] (TyStruct)  n)  ->      pure  ($) (AST)   (ptr) [AST.double, AST.double] AST   [passing, failing, parsing] NamedTypeReference  (mkName 
  • ($) (cs) ("struct.") (n)))    (Pointer) (t) -> fmap (AST) [(rhs', enclosing), (nextExpt, continueBlock)] ptr (ltypeOfTyp t)    (TyStruct) (n) -> (do)     fields ( getFields n     typs ( (mapM) ltypeOfTyp [(rhs', enclosing), (nextExpt, continueBlock)] bindType) fields      - Packed structs aren't great for performance but very easy to code for now     pure ($) (AST) (StructureType) { (AST) . (isPacked)=(True) , (AST) . elementTypes=typs} (sizeof) :: :: (MonadState) (Env) (m)=> (Type)

    -> (m) Word 2017 (sizeof)= (case)    (TyBool) -> pure (1)    (TyChar) -> pure (1)    (TyInt) -> pure (4)    (TyFloat) -> pure (8)    (TyVoid) -> pure (0)    (Pointer) [(Type, ParameterName)] -> (pure) (8)    (TyStruct) (n) -> (do)     fields ( getFields n     sizes ( mapM (sizeof [AST.double, AST.double] [(rhs', enclosing), (nextExpt, continueBlock)] bindType) fields     pure (sum sizes) [(Type, ParameterName)]

    Expressions (LVals) Now, we 're ready to generate code for expressions. First, LVal s. When generating an LVal , we generate an [charStar] Operand corresponding to the

  • address of the value. That way, we can use it as an argument to the store instruction . For variables, we simply look up the variable name in the Env .
    (codegenLVal) : :: (LValue)

     ->  Codege n  (Operand) (codegenLVal) ([charStar] (SId)  name)  (=(gets (( (M)  . (name)  (operands)   Since we are generating addresses, dereferencing is essentially the inverse of this, so we just generate code for the underlying expression.  
    (codegenLVal) SDeref (e)=(codegenSexpr e) [rhs']
    For struct access, we get to use the fascinating (read: confusing) getelementptr instruction [rhs'] . The instruction only calculates addresses, it does not load memory, so it's a perfect fit for the semantics of codegenLVal . We generate the address of the left hand side of the access and then have to pass two arguments to gep , a zero to access the memory pointed to by the address we just calculated, then the offset of the struct field we want to access, which we calculated in semant. Note that getelementptr handles calculating alignment, so we don't need to do it ourselves. (codegenLVal) (SAccess) (ei)

    =
     do 
       e ' (  (L) [(rhs', enclosing), (nextExpt, continueBlock)] (int) (0)        offset  offset (=(L)   (int)  (fromIntegral i)    (L) .  gep e '[zero, offset]    [(Type, ParameterName)] (Literals) Most literals, as usual, are straightforward.   [(Type, ParameterName)] (codegenSexpr) 

    :: (SExpr) -> (Codegen) (Operand) (codegenSexpr) ([charStar] (TyInt) , (SLiteral) (i)= pure ($) (L) [".mc"] (int) (fromIntegral i) (codegenSexpr) ([charStar] (TyFloat) , (SFliteral) (f)= pure ($) (L) [".mc"] double f (codegenSexpr) ([charStar] (TyBool) , (SBoolLit) (b)= pure ($) (L) [".mc"] (bit) (if) b (then) (1) (else) (0) ) (codegenSexpr) ([charStar] (TyChar) , (SCharLit) (c)= pure ($) (L) [".mc"] (int8 (fromIntegral c)

    Strings, however, are not. We look up the string literal in the Env to see if we've generated a global variable for it before. If so, we just return that. Otherwise, we use globalStringPtr to generate a pointer to a global string variable. We name each variable "0.str", "1.str" etc., since [(Type, ParameterName)] (mkName [AST.double, AST.double] (crashes with non -ASCII input, which we haven't explicitly forbidden in our string literals. Note that globalStringPtr (returns a) Constant which is distinct from an (Operand) , so we need to promote it with AST.ConstantOperand . [Operand] [rhs'] (codegenSexpr) (( (Pointer) (TyChar) , SStrLit (s)=(do    - Generate a new unique global variable for every string literal we see   strs ( gets strings    (case) (M) [charStar] lookup s strs [AST.double, AST.double] (of)      (Nothing) -> (do)        (let) (nm)= mkName (show) (M)

     (size strs)    [(Type, ParameterName)] ". str" 
           op  ( (L)   globalStringPtr (cs s) nm       modify  ($) 
       env  ->  (env {strings)=  (M) . (insert s)  (AST) 
    .  () (ConstantOperand) (op) strs}       pure ( (AST) .   (ConstantOperand)      (Just) (op)  ->  pure op    Null pointers are generated with  inttoptr    (codegenSexpr) () t,  (SNull) ) 
    =  (L) [(rhs', enclosing), (nextExpt, continueBlock)] (inttoptr)  (L) (int) (0) )=Sizeof  is calculated with the  (sizeof) function we wrote earlier.   codegenSexpr  (TyInt) ,  SSizeof [AST.i32] (t) 
    =(L) [(Type, ParameterName)] .  (int)   [charStar] (fromIntegral) (sizeof t) (The)  &  operator finds the address of an  (LVal) , which is already taken care of by  codegenLVal   

    codegenSexpr
     _ 
    ,  SAddr  (e) 
    =(codegenLVal e)   

    (Binary operators)

    For assignment, we calculate the address of the left hand side, the (value) of the right hand side, and then store said value at the address, returning the value . [(Type, ParameterName)] (codegenSexpr) ( (_) , SAssign lhs rhs

    =do   rhs' ( (For the) (Binop) (constructor, we begin by generating code for the left and right and sides. (codegenSexpr) () t, (SBinop) (op lhs rhs)==(do)   lhs' (Operand] (For addition on) (int) s and (float) s, we simply generate the corresponding machine instruction. For pointer addition, getElementPtr [(Type, ParameterName)] takes care of calculating the offset for each pointer type so we don't have to worry about it. () (Add) [(Type, ParameterName)] -> (case) () fst lhs, fst rhs) (of)       ( (Pointer) (_) , (TyInt) [AST.double, AST.double] -> (L) gep lhs' [rhs']       ( (TyInt) , (Pointer) (_) [AST.double, AST.double] -> (L) gep rhs' [lhs']       ( (TyInt) , (TyInt) ) -> (L) . add lhs 'rhs'       ( (TyFloat) , (TyFloat) ) -> (L) . fadd lhs 'rhs'        _ -> (error)

     "Internal error - semant failed"  [".mc"]   
    For For pointer subtraction, we do actually have to calculate the pointer width ourselves and divide the difference in addresses by it. [(Type, ParameterName)] (Sub) -> (let) (zero)==(L) [(Type, ParameterName)] (int) (0) in (case) (fst lhs, fst rhs) (of)       ( (Pointer) (typ,) (Pointer) (typ ') ->
              (if) (typ ') /=(typ) (then) (error)   “Internal error - semant failed”   else  do            lhs ''  ( (L)  .  (ptrtoint lhs') (AST)   [".mc"] (i)            rhs ''  ( (L)  .  (ptrtoint rhs') (AST)   [".mc"] (i)           diff  ( (L)   sub lhs '' rhs ''           width  ( (L)   (int)  

    . (fromIntegral) sizeof typ            (L) . sdiv diff width (Subtracting) (int) s from pointers is similar to adding them, except that we negate the (int) before passing it to getElementPtr
    ([(String, [AST.Type] Pointer [rhs'] (_) , (TyInt) ) -> [(Type, ParameterName)] (do) ​​         rhs '' ( (L) . sub zero rhs'          (L) . gep lhs' [rhs''] (For) int s and (float) s, we again dispatch to the corresponding machine instruction. [(Type, ParameterName)] ([passing, failing, parsing] (TyInt) , (TyInt) ) [charStar] -> (L) sub lhs 'rhs'       ( (TyFloat) , (TyFloat) ) -> (L) . fsub lhs 'rhs'        _ -> (error)
     "Internal error - semant failed"  [".mc"]   (Multiplication and division are easy.)    

    (Mult) [(Type, ParameterName)] -> (case) (t) (of)        (TyInt) -> (L) [(rhs', enclosing), (nextExpt, continueBlock)] mul lhs 'rhs'        (TyFloat) -> (L) [(rhs', enclosing), (nextExpt, continueBlock)] fmul lhs 'rhs'        _ -> (error)
     "Internal error - semant failed"       (Div) ->   (case) (t) of         (TyInt) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)]  sdiv lhs 'rhs'        (TyFloat) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)]  fdiv lhs 'rhs'        _  ->   (error) 
     "Internal error - semant failed"  [".mc"]   (For the exponentiation operator, all remaining cases are raising [AST.i32] (int [".mc"] s to  int  s. We can take this opportunity to write some non-trivial LLVM and implement exponentiation as repeated multiplication directly in the IR. In haskell, the algorithm would be   

    - We can obviously be more terse but this form maps better onto LLVM (raise) lhs rhs= go (1) (rhs) where   go acc expt (=)      (if) (expt)== (0) then acc      (else) let nextAcc (=) (lhs) acc              nextExpt (=(expt) - (1)           (in) go nextAcc nextExpt
    In order to marry SSA with conditionals, LLVM uses (phi nodes.) Phi nodes must all appear at the very beginning of a basic block. There cannot be any non-phi instructions preceding them. The phi instruction takes a list of pairs. The first element of each pair is a value and the second element is the label of a basic block which has an outgoing branch to the block with phi nodes. First, we need to get the label of the enclosing block so that we can start our new block. We then set acc and expt to phi nodes, such that if control flow proceeds into the loop_pow block from the enclosing scope, they are initialized to 1 and (rhs) , respectively, and if control flow is from continue , they are set to (nextAcc) and (nextExpt) . The if [(String, [AST.Type] (clause is handled by issuing a) (condBr) (if the) (expt) has reached 0, at which point we either return the acc or jump back to loop_pow . [Operand] Note that we use mdo , courtesy of {- # LANGUAGE RecursiveDo # -} , instead of do , as we need to forward-reference the doneBlock and continueBlock in our branch instruction. We can't define our blocks with L.block and then branch to them because calling L.block ends the current block and starts a new one. When using other LLVM bindings, one usually has to create all of the blocks and then manually position the builder at the correct location before emitting instructions. However, haskell's laziness allows us to avoid this inelegance and write branching code much more naturally. (Power) -> mdo       enclosing ( (L) currentBlock        (L) . br loop       loop ( (L) (block `) (L) . (named` [(String, [AST.Type] “loop_pow       acc ( (L) (phi [(L.int32 1, enclosing), (nextAcc, continueBlock)] ` (L) . (named`) "acc"
           expt  ( (L)   (phi [(rhs', enclosing), (nextExpt, continueBlock)] ` (L) . (named` 
     "expt"        done  ( (L)   (icmp) (IP)   [passing, failing, parsing] EQ (expt)  (L) .  (int)   (0)         (L) .  condBr done doneBlock continueBlock       continueBlock  ( (L)   (block `) (L)  .  (named` [(String, [AST.Type] continue        nextAcc  ( (L)   (mul acc lhs``) (L)   [".mc"] (named` [(Type, ParameterName)] (“next_acc”)        nextExpt  ( (L)   (sub expt)  (L)  

    . (int) (1) ) (L) . (named`) “next_expt”        (L) . br loop       doneBlock ( (L) (block `) (L) . (named` [(String, [AST.Type] it is done       pure acc

    (It is left as an exercise for the reader to implement a

    more efficient

    exponentiation algorithm in LLVM.)

    The remaining binary operators all map directly onto their LLVM counterparts. [rhs'] (Equal) (-> (case) (fst lhs) of        (TyInt) -> (L)

    ) (icmp) (IP) .  EQ [(Type, ParameterName)]  lhs 'rhs'        (TyBool) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (icmp) (IP) .   (EQ) lhs 'rhs'        (TyChar) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (icmp) (IP) .   (EQ) lhs 'rhs'        (Pointer) [(Type, ParameterName)]   ->  (L)   [charStar] (icmp) (IP)  .  EQ lhs 'rhs'        (TyFloat) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (fcmp) (FP) .   (OEQ) lhs 'rhs'        _  ->   (error) 
     "Internal error - semant failed"       Neq ->   (case) (fst lhs) (of)         (TyInt) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (icmp) (IP) .   (NE) lhs 'rhs'        (TyBool) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (icmp) (IP) .   (NE) lhs 'rhs'        (TyChar) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (icmp) (IP) .   (NE) lhs 'rhs'        (Pointer) [(Type, ParameterName)]   ->  (L)   [charStar] (icmp) (IP)  .  (NE) lhs 'rhs'        (TyFloat) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (fcmp) (FP) .   (ONE) lhs 'rhs'        _  ->   (error) 
     "Internal error - semant failed"       (Less) ->   (case) (fst lhs) (of)         (TyInt) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (icmp) (IP) .   (SLT) lhs 'rhs'        (TyBool) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (icmp) (IP) .   (SLT) lhs 'rhs'        (TyChar) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (icmp) (IP) .   (ULT) lhs 'rhs'        (TyFloat) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (fcmp) (FP) .   (OLT) lhs 'rhs'        _  ->   (error) 
     "Internal error - semant failed"       (Leq) ->   (case) (fst lhs) (of)         (TyInt) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (icmp) (IP) .   (SLE) lhs 'rhs'        (TyBool) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (icmp) (IP) .   (SLE) lhs 'rhs'        (TyChar) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (icmp) (IP) .   (ULE) lhs 'rhs'        (TyFloat) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (fcmp) (FP) .   (OLE) lhs 'rhs'        _  ->   (error) 
     "Internal error - semant failed"       (Greater) ->   (case) (fst lhs) (of)         (TyInt) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (icmp) (IP) .   (SGT) lhs 'rhs'        (TyBool) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (icmp) (IP) .   (SGT) lhs 'rhs'        (TyChar) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (icmp) (IP) .   (UGT) lhs 'rhs'        (TyFloat) ->   (L) [(rhs', enclosing), (nextExpt, continueBlock)] (fcmp) (FP) .   (OGT) lhs 'rhs'        _  ->   (error)