18 KiB
How to build a compiler with LLVM and MLIR
- Episode 1 - Introduction
- Episode 2 - Basic Setup
- Episode 3 - Overview
- Episode 4 - The reader
- Episode 5 - The Abstract Syntax Tree
- Episode 6 - The Semantic Analyzer
- Episode 7 - The Context and Namespace
- Episode 8 - MLIR Basics
- Episode 9 - IR (SLIR) generation
- Episode 10 - Pass Infrastructure
- The next Step
- Updates:
- What is a Pass
- Passes are the unit of abstraction for optimization and transformation in LLVM/MLIR
- Compilation is all about transforming the input data and produce an output
- Almost like a function composition
- The big picture
- Pass Managers (Pipelines) are made out of a collection of passes and can be nested
- The most of the interesting parts of the compiler reside in Passes.
- We will probably spend most of our time working with passes
- Pass Infrastructure
- ODS or C++
- Operation is the main abstract unit of transformation
- OperationPass is the base class for all the passes.
- We need to override
runOnOperation
- There's some rules you need to follow when defining your Pass
- Passes are either OpSpecific or OpAgnostic
- How transformation works?
- Analyses and Passes
- Pass management and nested pass managers
- Episode 11 - Lowering SLIR
- Episode 12 - Target code generation
- Episode 13 - Source Managers
DONE Episode 1 - Introduction
What is it all about?
- Create a programming lang
- Guide for contributors
- A LLVM/MLIR guide
The Plan
- Git branches
- No live coding
- Feel free to contribute
Serene and a bit of history
- Other Implementations
-
Requirements
- C++ 14
- CMake
- Repository: https://devheroes.codes/Serene
- Website: lxsameer.com Email: lxsameer@gnu.org
DONE Episode 2 - Basic Setup
CLOSED: [2021-07-10 Sat 09:04]
Installing Requirements
LLVM and Clang
- mlir-tblgen
ccache (optional)
Building Serene and the builder
- git hooks
Source tree structure
dev.org
resources and TODOs
DONE Episode 3 - Overview
CLOSED: [2021-07-19 Mon 09:41]
Generic Compiler
- Modern Compiler Implementation in ML: Basic Techniques
- Compilers: Principles, Techniques, and Tools (The Dragon Book)
Common Steps
-
Frontend
- Lexical analyzer (Lexer)
- Syntax analyzer (Parser)
- Semantic analyzer
-
Middleend
- Intermediate code generation
- Code optimizer
-
Backend
- Target code generation
LLVM
/Serene/serene/src/commit/76a9106559fde13730d13ae44f193b5980ffa1ef/docs/llvm.org
Watch Introdution to LLVM
Quick overview
Deducted from https://www.aosabook.org/en/llvm.html
- It's a set of libraries to create a compiler.
- Well engineered.
- we can focus only on the fronted of the compiler and what is actually important to us and leave the tricky stuff to LLVM.
- LLVM IR enables us to use multiple languages together.
- It supports many targets.
- We can benefit from already made IR level optimizers.
- ….
MLIR
/Serene/serene/src/commit/76a9106559fde13730d13ae44f193b5980ffa1ef/docs/mlir.llvm.org
- With MLIR dialects provide higher level semantics than LLVM IR.
- It's easier to reason about higher level IR that is modeled after the AST rather than a low level IR.
- We can use the pass infrastructure to efficiently process and transform the IR.
- With many ready to use dialects we can really focus on our language and us the other dialect when ever necessary.
- …
Serene
A Compiler frontend
Flow
serenec
in parses the command lines argsreader
reads the input file and generates anAST
semantic analyzer
walks theAST
and generates a newAST
and rewrites the necessary nodes.slir
generator generatesslir
dialect code fromAST
.- We lower
slir
to other dialects of the MLIR which we call the resultmlir
. - Then, We lower everything to the
LLVMIR dialect
and call itlir
(lowered IR). - Finally we fully lower
lir
toLLVM IR
and pass it to the object generator to generate object files. - Call the default
c compiler
to link the object files and generate the machine code.
DONE Episode 4 - The reader
CLOSED: [2021-07-27 Tue 22:50]
What is a Parser ?
To put it simply, Parser converts the source code to an AST
Algorithms
- LL(k)
- LR
- LALR
- PEG
- …..
Read More:
Our Parser
- We have a hand written LL(1.5) like parser/lexer since lisp already has a structure.
;; pseudo code
(def some-fn (fn (x y)
(+ x y)))
(defn main ()
(println "Result: " (some-fn 3 8)))
- LL(1.5)?
- O(n)
DONE Episode 5 - The Abstract Syntax Tree
CLOSED: [2021-07-30 Fri 14:01]
What is an AST?
Ast is a tree representation of the abstract syntactic structure of source code. It's just a tree made of nodes that each node is a data structure describing the syntax.
;; pseudo code
(def main (fn () 4))
(prn (main))
The Expression
abstract class
Expressions
- Expressions vs Statements
- Serene(Lisp) and expressions
Node & AST
DONE Episode 6 - The Semantic Analyzer
CLOSED: [2021-08-21 Sat 18:44]
Qs
- Why didn't we implement a linked list?
- Why we are using the
std::vector
instead of llvm collections?
What is Semantic Analysis?
- Semantic Analysis makes sure that the given program is semantically correct.
- Type checkr works as part of this step as well.
;; pseudo code
(4 main)
Semantic Analysis and rewrites
We need to reform the AST to reflect the semantics of Serene closly.
;; pseudo code
(def main (fn () 4))
(prn (main))
Let's run the compiler to see the semantic analysis in action.
Let's check out the code
DONE Episode 7 - The Context and Namespace
CLOSED: [2021-09-04 Sat 10:53]
Namespaces
Unit of compilation
Usually maps to a file
keeps the state and evironment
SereneContext vs LLVM Context vs MLIR Context
Compilers global state
The owner of LLVM/MLIR contexts
Holds the namespace table
Probably will contain the primitive types as well
DONE Episode 8 - MLIR Basics
CLOSED: [2021-09-17 Fri 10:18]
Serene Changes
- Introducing a SourceManager
- Reader changes
- serenec cli interface in changing
Disclaimer
I'm not an expert in MLIR
Why?
- A bit of history
- LLVM IR is to low level
- We need an IR to implement high level concepts and flows MLIR is a framework to build a compiler with your own IR. kinda :P
- Reusability
- …
Language
Overview
- SSA Based (https://en.wikipedia.org/wiki/Static_single_assignment_form)
- Typed
- Context free(for lack of better words)
Dialects
- A collection of operations
- Custom types
- Meta data
- We can use a mixture of different dialects
builtin dialects:
- std
- llvm
- math
- async
- …
Opetations
- Higher level of abstraction
- Not instructions
- SSA forms
- Tablegen backend
- Verifiers and printers
Attributes
Blocks & Regions
Types
- Extesible
Pass Infrastructure
Analysis and transformation infrastructure
- We will implement most of our semantic analysis logic and type checker as passes
Pattern Rewriting
- Tablegen backed
Operation Definition Specification
Examples
Not: You need mlir-mode
and llvm-mode
available to you for the code highlighting of
the following code blocks. Both of those are distributed with the LLVM.
General syntax
%result:2 = "somedialect.blah"(%x#2) { some.attribute = true, other_attribute = 3 }
: (!somedialect<"example_type">) -> (!somedialect<"foo_s">, i8)
loc(callsite("main" at "main.srn":10:8))
Blocks and Regions
func @simple(i64, i1) -> i64 {
^bb0(%a: i64, %cond: i1): // Code dominated by ^bb0 may refer to %a
cond_br %cond, ^bb1, ^bb2
^bb1:
br ^bb3(%a: i64) // Branch passes %a as the argument
^bb2:
%b = addi %a, %a : i64
br ^bb3(%b: i64) // Branch passes %b as the argument
// ^bb3 receives an argument, named %c, from predecessors
// and passes it on to bb4 along with %a. %a is referenced
// directly from its defining operation and is not passed through
// an argument of ^bb3.
^bb3(%c: i64):
//br ^bb4(%c, %a : i64, i64)
"serene.ifop"(%c) ({ // if %a is in-scope in the containing region...
// then %a is in-scope here too.
%new_value = "another_op"(%c) : (i64) -> (i64)
^someblock(%new_value):
%x = "some_other_op"() {value = 4 : i64} : () -> i64
}) : (i64) -> (i64)
^bb4(%d : i64, %e : i64):
%0 = addi %d, %e : i64
return %0 : i64 // Return is also a terminator.
}
SLIR example
Command line arguments to emir slir
./builder run --build-dir ./build -emit slir `pwd`/docs/examples/hello_world.srn
Output:
module @user {
%0 = "serene.fn"() ( {
%2 = "serene.value"() {value = 0 : i64} : () -> i64
return %2 : i64
}) {args = {}, name = "main", sym_visibility = "public"} : () -> i64
%1 = "serene.fn"() ( {
%2 = "serene.value"() {value = 0 : i64} : () -> i64
return %2 : i64
}) {args = {n = i64, v = i64, y = i64}, name = "main1", sym_visibility = "public"} : () -> i64
}
Serene's MLIR (maybe we need a better name)
Command line arguments to emir mlir
./builder run --build-dir ./build -emit mlir `pwd`/docs/examples/hello_world.srn
Output:
module @user {
func @main() -> i64 {
%c3_i64 = constant 3 : i64
return %c3_i64 : i64
}
func @main1(%arg0: i64, %arg1: i64, %arg2: i64) -> i64 {
%c3_i64 = constant 3 : i64
return %c3_i64 : i64
}
}
LIR
Command line arguments to emir lir
./builder run --build-dir ./build -emit lir `pwd`/docs/examples/hello_world.srn
Output:
module @user {
llvm.func @main() -> i64 {
%0 = llvm.mlir.constant(3 : i64) : i64
llvm.return %0 : i64
}
llvm.func @main1(%arg0: i64, %arg1: i64, %arg2: i64) -> i64 {
%0 = llvm.mlir.constant(3 : i64) : i64
llvm.return %0 : i64
}
}
LLVMIR
Command line arguments to emir llvmir
./builder run --build-dir ./build -emit ir `pwd`/docs/examples/hello_world.srn
Output:
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
declare i8* @malloc(i64 %0)
declare void @free(i8* %0)
define i64 @main() !dbg !3 {
ret i64 3, !dbg !7
}
define i64 @main1(i64 %0, i64 %1, i64 %2) !dbg !9 {
ret i64 3, !dbg !10
}
!llvm.dbg.cu = !{!0}
!llvm.module.flags = !{!2}
!0 = distinct !DICompileUnit(language: DW_LANG_C, file: !1, producer: "mlir", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug)
!1 = !DIFile(filename: "LLVMDialectModule", directory: "/")
!2 = !{i32 2, !"Debug Info Version", i32 3}
!3 = distinct !DISubprogram(name: "main", linkageName: "main", scope: null, file: !4, type: !5, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !6)
!4 = !DIFile(filename: "REPL", directory: "/home/lxsameer/src/serene/serene/build")
!5 = !DISubroutineType(types: !6)
!6 = !{}
!7 = !DILocation(line: 0, column: 10, scope: !8)
!8 = !DILexicalBlockFile(scope: !3, file: !4, discriminator: 0)
!9 = distinct !DISubprogram(name: "main1", linkageName: "main1", scope: null, file: !4, line: 1, type: !5, scopeLine: 1, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0, retainedNodes: !6)
!10 = !DILocation(line: 1, column: 11, scope: !11)
!11 = !DILexicalBlockFile(scope: !9, file: !4, discriminator: 0)
DONE Episode 9 - IR (SLIR) generation
CLOSED: [2021-10-01 Fri 18:56]
Updates:
- Source manager
- Diagnostic Engine
- JIT
There will be an episode dedicated to eache of these
How does IR generation works
- Pass around MLIR context
- Create Builder objects that creates operations in specific locations
- ModuleOp
- Namespace
How to define a new dialect
- Pure C++
- Tablegen
SLIR
The SLIR goal
- An IR that follows the AST
- Rename?
Steps
- Define the new dialect
- Setup the tablegen
- Define the operations
- Walk the AST and generate the operations
DONE Episode 10 - Pass Infrastructure
CLOSED: [2021-10-15 Fri 14:17]
The next Step
Updates:
CMake changes
What is a Pass
Passes are the unit of abstraction for optimization and transformation in LLVM/MLIR
Compilation is all about transforming the input data and produce an output
Source code -> IR X -> IR Y -> IR Z -> … -> Target Code
Almost like a function composition
The big picture
Pass Managers (Pipelines) are made out of a collection of passes and can be nested
The most of the interesting parts of the compiler reside in Passes.
We will probably spend most of our time working with passes
Pass Infrastructure
ODS or C++
Operation is the main abstract unit of transformation
OperationPass is the base class for all the passes.
We need to override runOnOperation
There's some rules you need to follow when defining your Pass
Must not maintain any global mutable state
Must not modify the state of another operation not nested within the current operation being operated on
…
Passes are either OpSpecific or OpAgnostic
OpSpecific
struct MyFunctionPass : public PassWrapper<MyFunctionPass,
OperationPass<FuncOp>> {
void runOnOperation() override {
// Get the current FuncOp operation being operated on.
FuncOp f = getOperation();
// Walk the operations within the function.
f.walk([](Operation *inst) {
// ....
});
}
};
/// Register this pass so that it can be built via from a textual pass pipeline.
/// (Pass registration is discussed more below)
void registerMyPass() {
PassRegistration<MyFunctionPass>();
}
OpAgnostic
struct MyOperationPass : public PassWrapper<MyOperationPass, OperationPass<>> {
void runOnOperation() override {
// Get the current operation being operated on.
Operation *op = getOperation();
// ...
}
};
How transformation works?
Analyses and Passes
Pass management and nested pass managers
// Create a top-level `PassManager` class. If an operation type is not
// explicitly specific, the default is the builtin `module` operation.
PassManager pm(ctx);
// Note: We could also create the above `PassManager` this way.
PassManager pm(ctx, /*operationName=*/"builtin.module");
// Add a pass on the top-level module operation.
pm.addPass(std::make_unique<MyModulePass>());
// Nest a pass manager that operates on `spirv.module` operations nested
// directly under the top-level module.
OpPassManager &nestedModulePM = pm.nest<spirv::ModuleOp>();
nestedModulePM.addPass(std::make_unique<MySPIRVModulePass>());
// Nest a pass manager that operates on functions within the nested SPIRV
// module.
OpPassManager &nestedFunctionPM = nestedModulePM.nest<FuncOp>();
nestedFunctionPM.addPass(std::make_unique<MyFunctionPass>());
// Run the pass manager on the top-level module.
ModuleOp m = ...;
if (failed(pm.run(m))) {
// Handle the failure
}
DONE Episode 11 - Lowering SLIR
CLOSED: [2021-11-01 Mon 15:14]
Overview
- What is a Pass?
- Pass Manager
Dialect lowering
Why?
Transforming a dialect to another dialect or LLVM IR
The goal is to lower SLIR to LLVM IR directly or indirectly.
Dialect Conversions
This framework allows for transforming a set of illegal operations to a set of legal ones.
Target Conversion
Rewrite Patterns
Type Converter
Full vs Partial Conversion
DONE Episode 12 - Target code generation
CLOSED: [2021-11-04 Thu 00:57]
Updates:
JIT work
Emacs dev mode
So far….
Next Step
Compile to object files
Link object files to create an executable
End of wiring for static compilers
What is an object file?
Symbols
- A pair of a name and a value
- Value of a defined symbol is an offset in the
Content
- undefined symbols
Relocations
Are computation to perform on the Content
. For example, "set
this location in the contents to the value of this symbol plus this addend".
Linker will apply all the relocations in an object file on link time and if it can not resolve an undefined symbol most of the time it will raise an error (depending on the relocation and the symbol).
Contents
- Are what memory should look like during the execution
- Have a size
- Have a type
- Have an array of bytes
-
Has sections like:
- .text: The target code generated by the compiler
- .data: The values of initialized variables
- .rdata: Static unnamed data like literal strings, protocol tables and ….
- .bss: Uninitialized variables (the content can be omitted or striped and assume to contain only zeros)
Linking process
During the linking process, linker assigns an address to each defined symbol
and tries to resolve
undefined symbols
Linker will
- reads the object files
-
reads the contents
- as raw data
- figures out the length
- reads the symbols and create a symbol table
- link undefined symbols to their definitions (possibly from other obj fils or libs)
- decide where all the content should go in the memory
- sort them based on the type
- concat them together
- apply relocations
- write the result to a file as an executable
AOT vs JIT
Let's look at some code
Resources:
Episode 13 - Source Managers
FAQ:
- What tools are you using?
Updates:
- Still JIT
- We're going to start the JIT discussion from next EP
Forgot to show case the code generation
I didn't show it in action
What is a source manager
- It owns and manages are the source buffers
-
All of our interactions with source files will happen though Source manager
- Including reading files
- Loading namespaces
- Including namespaces
- …
- LLVM provides a
SourceMgr
class that we're not using it