12 KiB
How to build a compiler with LLVM and MLIR
- Episode 1 - Introduction
- Episode 2 - Basic Setup
- Episode 3 - Overview
- Episode 4 - The reader
- Episode 5 - The Abstract Syntax Tree
- Episode 6 - The Semantic Analyzer
- Episode 7 - The Context and Namespace
- Episode 8 - MLIR Basics
- Episode 9 - IR (SLIR) generation
DONE Episode 1 - Introduction
What is it all about?
- Create a programming lang
- Guide for contributors
- A LLVM/MLIR guide
The Plan
- Git branches
- No live coding
- Feel free to contribute
Serene and a bit of history
- Other Implementations
-
Requirements
- C++ 14
- CMake
- Repository: https://devheroes.codes/Serene
- Website: lxsameer.com Email: lxsameer@gnu.org
DONE Episode 2 - Basic Setup
CLOSED: [2021-07-10 Sat 09:04]
Installing Requirements
LLVM and Clang
- mlir-tblgen
ccache (optional)
Building Serene and the builder
- git hooks
Source tree structure
dev.org
resources and TODOs
DONE Episode 3 - Overview
CLOSED: [2021-07-19 Mon 09:41]
Generic Compiler
- Modern Compiler Implementation in ML: Basic Techniques
- Compilers: Principles, Techniques, and Tools (The Dragon Book)
Common Steps
-
Frontend
- Lexical analyzer (Lexer)
- Syntax analyzer (Parser)
- Semantic analyzer
-
Middleend
- Intermediate code generation
- Code optimizer
-
Backend
- Target code generation
LLVM
/Serene/serene/src/commit/6ef69d329ad504fb456ac69dcd7bf5ef44a830d3/docs/llvm.org
Watch Introdution to LLVM
Quick overview
Deducted from https://www.aosabook.org/en/llvm.html
- It's a set of libraries to create a compiler.
- Well engineered.
- we can focus only on the fronted of the compiler and what is actually important to us and leave the tricky stuff to LLVM.
- LLVM IR enables us to use multiple languages together.
- It supports many targets.
- We can benefit from already made IR level optimizers.
- ….
MLIR
/Serene/serene/src/commit/6ef69d329ad504fb456ac69dcd7bf5ef44a830d3/docs/mlir.llvm.org
- With MLIR dialects provide higher level semantics than LLVM IR.
- It's easier to reason about higher level IR that is modeled after the AST rather than a low level IR.
- We can use the pass infrastructure to efficiently process and transform the IR.
- With many ready to use dialects we can really focus on our language and us the other dialect when ever necessary.
- …
Serene
A Compiler frontend
Flow
serenec
in parses the command lines argsreader
reads the input file and generates anAST
semantic analyzer
walks theAST
and generates a newAST
and rewrites the necessary nodes.slir
generator generatesslir
dialect code fromAST
.- We lower
slir
to other dialects of the MLIR which we call the resultmlir
. - Then, We lower everything to the
LLVMIR dialect
and call itlir
(lowered IR). - Finally we fully lower
lir
toLLVM IR
and pass it to the object generator to generate object files. - Call the default
c compiler
to link the object files and generate the machine code.
DONE Episode 4 - The reader
CLOSED: [2021-07-27 Tue 22:50]
What is a Parser ?
To put it simply, Parser converts the source code to an AST
Algorithms
- LL(k)
- LR
- LALR
- PEG
- …..
Read More:
Our Parser
- We have a hand written LL(1.5) like parser/lexer since lisp already has a structure.
;; pseudo code
(def some-fn (fn (x y)
(+ x y)))
(defn main ()
(println "Result: " (some-fn 3 8)))
- LL(1.5)?
- O(n)
DONE Episode 5 - The Abstract Syntax Tree
CLOSED: [2021-07-30 Fri 14:01]
What is an AST?
Ast is a tree representation of the abstract syntactic structure of source code. It's just a tree made of nodes that each node is a data structure describing the syntax.
;; pseudo code
(def main (fn () 4))
(prn (main))
The Expression
abstract class
Expressions
- Expressions vs Statements
- Serene(Lisp) and expressions
Node & AST
DONE Episode 6 - The Semantic Analyzer
CLOSED: [2021-08-21 Sat 18:44]
Qs
- Why didn't we implement a linked list?
- Why we are using the
std::vector
instead of llvm collections?
What is Semantic Analysis?
- Semantic Analysis makes sure that the given program is semantically correct.
- Type checkr works as part of this step as well.
;; pseudo code
(4 main)
Semantic Analysis and rewrites
We need to reform the AST to reflect the semantics of Serene closly.
;; pseudo code
(def main (fn () 4))
(prn (main))
Let's run the compiler to see the semantic analysis in action.
Let's check out the code
DONE Episode 7 - The Context and Namespace
CLOSED: [2021-09-04 Sat 10:53]
Namespaces
Unit of compilation
Usually maps to a file
keeps the state and evironment
SereneContext vs LLVM Context vs MLIR Context
Compilers global state
The owner of LLVM/MLIR contexts
Holds the namespace table
Probably will contain the primitive types as well
DONE Episode 8 - MLIR Basics
CLOSED: [2021-09-17 Fri 10:18]
Serene Changes
- Introducing a SourceManager
- Reader changes
- serenec cli interface in changing
Disclaimer
I'm not an expert in MLIR
Why?
- A bit of history
- LLVM IR is to low level
- We need an IR to implement high level concepts and flows MLIR is a framework to build a compiler with your own IR. kinda :P
- Reusability
- …
Language
Overview
- SSA Based (https://en.wikipedia.org/wiki/Static_single_assignment_form)
- Typed
- Context free(for lack of better words)
Dialects
- A collection of operations
- Custom types
- Meta data
- We can use a mixture of different dialects
builtin dialects:
- std
- llvm
- math
- async
- …
Opetations
- Higher level of abstraction
- Not instructions
- SSA forms
- Tablegen backend
- Verifiers and printers
Attributes
Blocks & Regions
Types
- Extesible
Pass Infrastructure
Analysis and transformation infrastructure
- We will implement most of our semantic analysis logic and type checker as passes
Pattern Rewriting
- Tablegen backed
Operation Definition Specification
Examples
Not: You need mlir-mode
and llvm-mode
available to you for the code highlighting of
the following code blocks. Both of those are distributed with the LLVM.
General syntax
%result:2 = "somedialect.blah"(%x#2) { some.attribute = true, other_attribute = 3 }
: (!somedialect<"example_type">) -> (!somedialect<"foo_s">, i8)
loc(callsite("main" at "main.srn":10:8))
Blocks and Regions
func @simple(i64, i1) -> i64 {
^bb0(%a: i64, %cond: i1): // Code dominated by ^bb0 may refer to %a
cond_br %cond, ^bb1, ^bb2
^bb1:
br ^bb3(%a: i64) // Branch passes %a as the argument
^bb2:
%b = addi %a, %a : i64
br ^bb3(%b: i64) // Branch passes %b as the argument
// ^bb3 receives an argument, named %c, from predecessors
// and passes it on to bb4 along with %a. %a is referenced
// directly from its defining operation and is not passed through
// an argument of ^bb3.
^bb3(%c: i64):
//br ^bb4(%c, %a : i64, i64)
"serene.ifop"(%c) ({ // if %a is in-scope in the containing region...
// then %a is in-scope here too.
%new_value = "another_op"(%c) : (i64) -> (i64)
^someblock(%new_value):
%x = "some_other_op"() {value = 4 : i64} : () -> i64
}) : (i64) -> (i64)
^bb4(%d : i64, %e : i64):
%0 = addi %d, %e : i64
return %0 : i64 // Return is also a terminator.
}
SLIR example
Command line arguments to emir slir
./builder run --build-dir ./build -emit slir `pwd`/docs/examples/hello_world.srn
Output:
module @user {
%0 = "serene.fn"() ( {
%2 = "serene.value"() {value = 0 : i64} : () -> i64
return %2 : i64
}) {args = {}, name = "main", sym_visibility = "public"} : () -> i64
%1 = "serene.fn"() ( {
%2 = "serene.value"() {value = 0 : i64} : () -> i64
return %2 : i64
}) {args = {n = i64, v = i64, y = i64}, name = "main1", sym_visibility = "public"} : () -> i64
}
Serene's MLIR (maybe we need a better name)
Command line arguments to emir mlir
./builder run --build-dir ./build -emit mlir `pwd`/docs/examples/hello_world.srn
Output:
module @user {
func @main() -> i64 {
%c3_i64 = constant 3 : i64
return %c3_i64 : i64
}
func @main1(%arg0: i64, %arg1: i64, %arg2: i64) -> i64 {
%c3_i64 = constant 3 : i64
return %c3_i64 : i64
}
}
LIR
Command line arguments to emir lir
./builder run --build-dir ./build -emit lir `pwd`/docs/examples/hello_world.srn
Output:
module @user {
llvm.func @main() -> i64 {
%0 = llvm.mlir.constant(3 : i64) : i64
llvm.return %0 : i64
}
llvm.func @main1(%arg0: i64, %arg1: i64, %arg2: i64) -> i64 {
%0 = llvm.mlir.constant(3 : i64) : i64
llvm.return %0 : i64
}
}
LLVMIR
Command line arguments to emir llvmir
./builder run --build-dir ./build -emit ir `pwd`/docs/examples/hello_world.srn
Output:
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
declare i8* @malloc(i64 %0)
declare void @free(i8* %0)
define i64 @main() !dbg !3 {
ret i64 3, !dbg !7
}
define i64 @main1(i64 %0, i64 %1, i64 %2) !dbg !9 {
ret i64 3, !dbg !10
}
!llvm.dbg.cu = !{!0}
!llvm.module.flags = !{!2}
!0 = distinct !DICompileUnit(language: DW_LANG_C, file: !1, producer: "mlir", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug)
!1 = !DIFile(filename: "LLVMDialectModule", directory: "/")
!2 = !{i32 2, !"Debug Info Version", i32 3}
!3 = distinct !DISubprogram(name: "main", linkageName: "main", scope: null, file: !4, type: !5, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !6)
!4 = !DIFile(filename: "REPL", directory: "/home/lxsameer/src/serene/serene/build")
!5 = !DISubroutineType(types: !6)
!6 = !{}
!7 = !DILocation(line: 0, column: 10, scope: !8)
!8 = !DILexicalBlockFile(scope: !3, file: !4, discriminator: 0)
!9 = distinct !DISubprogram(name: "main1", linkageName: "main1", scope: null, file: !4, line: 1, type: !5, scopeLine: 1, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !6)
!10 = !DILocation(line: 1, column: 11, scope: !11)
!11 = !DILexicalBlockFile(scope: !9, file: !4, discriminator: 0)
Episode 9 - IR (SLIR) generation
Updates:
- Source manager
- Diagnostic Engine
- JIT
There will be an episode dedicated to eache of these
How does IR generation works
- Pass around MLIR context
- Create Builder objects that creates operations in specific locations
- ModuleOp
- Namespace
How to define a new dialect
- Pure C++
- Tablegen
SLIR
The SLIR goal
- An IR that follows the AST
- Rename?
Steps
- Define the new dialect
- Setup the tablegen
- Define the operations
- Walk the AST and generate the operations