Apply doc patch from PR136.

llvm-svn: 10198
This commit is contained in:
Brian Gaeke 2003-11-24 17:03:38 +00:00
parent bb718f14e0
commit 31715d3a42
1 changed files with 347 additions and 59 deletions

View File

@ -6,9 +6,21 @@
</head>
<body>
<div class="doc_title">Stacker: An Example Of Using LLVM</div>
<hr>
<ol>
<li><a href="#abstract">Abstract</a></li>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#lessons">Lessons I Learned About LLVM</a>
<ol>
<li><a href="#value">Everything's a Value!</a></li>
<li><a href="#terminate">Terminate Those Blocks!</a></li>
<li><a href="#blocks">Concrete Blocks</a></li>
<li><a href="#push_back">push_back Is Your Friend</a></li>
<li><a href="#gep">The Wily GetElementPtrInst</a></li>
<li><a href="#linkage">Getting Linkage Types Right</a></li>
<li><a href="#constants">Constants Are Easier Than That!</a></li>
</ol>
</li>
<li><a href="#lexicon">The Stacker Lexicon</a>
<ol>
<li><a href="#stack">The Stack</a>
@ -18,12 +30,24 @@
<li><a href="#builtins">Built-Ins</a>
</ol>
</li>
<li><a href="#directory">The Directory Structure </a>
<li><a href="#example">Prime: A Complete Example</a></li>
<li><a href="#internal">Internal Code Details</a>
<ol>
<li><a href="#directory">The Directory Structure </a></li>
<li><a href="#lexer">The Lexer</a></li>
<li><a href="#parser">The Parser</a></li>
<li><a href="#compiler">The Compiler</a></li>
<li><a href="#runtime">The Runtime</a></li>
<li><a href="#driver">Compiler Driver</a></li>
<li><a href="#tests">Test Programs</a></li>
</ol>
</li>
</ol>
<div class="doc_text">
<p><b>Written by <a href="mailto:rspencer@x10sys.com">Reid Spencer</a> </b></p>
<p> </p>
</div>
<hr>
<!-- ======================================================================= -->
<div class="doc_section"> <a name="abstract">Abstract </a></div>
<div class="doc_text">
@ -80,31 +104,266 @@ written Stacker definitions have that characteristic. </p>
<p>Exercise for the reader: how could you make this a one line program?</p>
</div>
<!-- ======================================================================= -->
<div class="doc_section"><a name="stack"></a>Lessons Learned About LLVM</div>
<div class="doc_section"><a name="lessons"></a>Lessons I Learned About LLVM</div>
<div class="doc_text">
<p>Stacker was written for two purposes: (a) to get the author over the
learning curve and (b) to provide a simple example of how to write a compiler
using LLVM. During the development of Stacker, many lessons about LLVM were
learned. Those lessons are described in the following subsections.<p>
</div>
<!-- ======================================================================= -->
<div class="doc_subsection"><a name="value"></a>Everything's a Value!</div>
<div class="doc_text">
<p>Although I knew that LLVM used a Single Static Assignment (SSA) format,
it wasn't obvious to me how prevalent this idea was in LLVM until I really
started using it. Reading the Programmer's Manual and Language Reference I
noted that most of the important LLVM IR (Intermediate Representation) C++
classes were derived from the Value class. The full power of that simple
design only became fully understood once I started constructing executable
expressions for Stacker.</p>
<p>This really makes your programming go faster. Think about compiling code
for the following C/C++ expression: (a|b)*((x+1)/(y+1)). You could write a
function using LLVM that does exactly that, this way:</p>
<pre><code>
Value*
expression(BasicBlock*bb, Value* a, Value* b, Value* x, Value* y )
{
Instruction* tail = bb->getTerminator();
ConstantSInt* one = ConstantSInt::get( Type::IntTy, 1);
BinaryOperator* or1 =
new BinaryOperator::create( Instruction::Or, a, b, "", tail );
BinaryOperator* add1 =
new BinaryOperator::create( Instruction::Add, x, one, "", tail );
BinaryOperator* add2 =
new BinaryOperator::create( Instruction::Add, y, one, "", tail );
BinaryOperator* div1 =
new BinaryOperator::create( Instruction::Div, add1, add2, "", tail);
BinaryOperator* mult1 =
new BinaryOperator::create( Instruction::Mul, or1, div1, "", tail );
return mult1;
}
</code></pre>
<p>"Okay, big deal," you say. It is a big deal. Here's why. Note that I didn't
have to tell this function which kinds of Values are being passed in. They could be
instructions, Constants, Global Variables, etc. Furthermore, if you specify Values
that are incorrect for this sequence of operations, LLVM will either notice right
away (at compilation time) or the LLVM Verifier will pick up the inconsistency
when the compiler runs. In no case will you make a type error that gets passed
through to the generated program. This <em>really</em> helps you write a compiler
that always generates correct code!<p>
<p>The second point is that we don't have to worry about branching, registers,
stack variables, saving partial results, etc. The instructions we create
<em>are</em> the values we use. Note that all that was created in the above
code is a Constant value and five operators. Each of the instructions <em>is</em>
the resulting value of that instruction.</p>
<p>The lesson is this: <em>SSA form is very powerful: there is no difference
between a value and the instruction that created it.</em> This is fully
enforced by the LLVM IR. Use it to your best advantage.</p>
</div>
<!-- ======================================================================= -->
<div class="doc_subsection"><a name="terminate"></a>Terminate Those Blocks!</div>
<div class="doc_text">
<p>I had to learn about terminating blocks the hard way: using the debugger
to figure out what the LLVM verifier was trying to tell me and begging for
help on the LLVMdev mailing list. I hope you avoid this experience.</p>
<p>Emblazon this rule in your mind:</p>
<ul>
<li><em>All</em> <code>BasicBlock</code>s in your compiler <b>must</b> be
terminated with a terminating instruction (branch, return, etc.).
</li>
</ul>
<p>Terminating instructions are a semantic requirement of the LLVM IR. There
is no facility for implicitly chaining together blocks placed into a function
in the order they occur. Indeed, in the general case, blocks will not be
added to the function in the order of execution because of the recursive
way compilers are written.</p>
<p>Furthermore, if you don't terminate your blocks, your compiler code will
compile just fine. You won't find out about the problem until you're running
the compiler and the module you just created fails on the LLVM Verifier.</p>
</div>
<!-- ======================================================================= -->
<div class="doc_subsection"><a name="blocks"></a>Concrete Blocks</div>
<div class="doc_text">
<p>After a little initial fumbling around, I quickly caught on to how blocks
should be constructed. The use of the standard template library really helps
simply the interface. In general, here's what I learned:
<ol>
<li><em>Create your blocks early.</em> While writing your compiler, you
will encounter several situations where you know apriori that you will
need several blocks. For example, if-then-else, switch, while and for
statements in C/C++ all need multiple blocks for expression in LVVM.
The rule is, create them early.</li>
<li><em>Terminate your blocks early.</em> This just reduces the chances
that you forget to terminate your blocks which is required (go
<a href="#terminate">here</a> for more).
<li><em>Use getTerminator() for instruction insertion.</em> I noticed early on
that many of the constructors for the Instruction classes take an optional
<code>insert_before</code> argument. At first, I thought this was a mistake
because clearly the normal mode of inserting instructions would be one at
a time <em>after</em> some other instruction, not <em>before</em>. However,
if you hold on to your terminating instruction (or use the handy dandy
<code>getTerminator()</code> method on a <code>BasicBlock</code>), it can
always be used as the <code>insert_before</code> argument to your instruction
constructors. This causes the instruction to automatically be inserted in
the RightPlace&tm; place, just before the terminating instruction. The
nice thing about this design is that you can pass blocks around and insert
new instructions into them without ever known what instructions came
before. This makes for some very clean compiler design.</li>
</ol>
<p>The foregoing is such an important principal, its worth making an idiom:</p>
<pre>
<code>
BasicBlock* bb = new BasicBlock();</li>
bb->getInstList().push_back( new Branch( ... ) );
new Instruction(..., bb->getTerminator() );
</code>
</pre>
<p>To make this clear, consider the typical if-then-else statement
(see StackerCompiler::handle_if() method). We can set this up
in a single function using LLVM in the following way: </p>
<pre>
using namespace llvm;
BasicBlock*
MyCompiler::handle_if( BasicBlock* bb, SetCondInst* condition )
{
// Create the blocks to contain code in the structure of if/then/else
BasicBlock* then = new BasicBlock();
BasicBlock* else = new BasicBlock();
BasicBlock* exit = new BasicBlock();
// Insert the branch instruction for the "if"
bb->getInstList().push_back( new BranchInst( then, else, condition ) );
// Set up the terminating instructions
then->getInstList().push_back( new BranchInst( exit ) );
else->getInstList().push_back( new BranchInst( exit ) );
// Fill in the then part .. details excised for brevity
this->fill_in( then );
// Fill in the else part .. details excised for brevity
this->fill_in( else );
// Return a block to the caller that can be filled in with the code
// that follows the if/then/else construct.
return exit;
}
</pre>
<p>Presumably in the foregoing, the calls to the "fill_in" method would add
the instructions for the "then" and "else" parts. They would use the third part
of the idiom almost exclusively (inserting new instructions before the
terminator). Furthermore, they could even recurse back to <code>handle_if</code>
should they encounter another if/then/else statement and it will all "just work".
<p>
<p>Note how cleanly this all works out. In particular, the push_back methods on
the <code>BasicBlock</code>'s instruction list. These are lists of type
<code>Instruction</code> which also happen to be <code>Value</code>s. To create
the "if" branch we merely instantiate a <code>BranchInst</code> that takes as
arguments the blocks to branch to and the condition to branch on. The blocks
act like branch labels! This new <code>BranchInst</code> terminates
the <code>BasicBlock</code> provided as an argument. To give the caller a way
to keep inserting after calling <code>handle_if</code> we create an "exit" block
which is returned to the caller. Note that the "exit" block is used as the
terminator for both the "then" and the "else" blocks. This gaurantees that no
matter what else "handle_if" or "fill_in" does, they end up at the "exit" block.
</p>
</div>
<!-- ======================================================================= -->
<div class="doc_subsection"><a name="push_back"></a>push_back Is Your Friend</div>
<div class="doc_text">
<p>
One of the first things I noticed is the frequent use of the "push_back"
method on the various lists. This is so common that it is worth mentioning.
The "push_back" inserts a value into an STL list, vector, array, etc. at the
end. The method might have also been named "insert_tail" or "append".
Althought I've used STL quite frequently, my use of push_back wasn't very
high in other programs. In LLVM, you'll use it all the time.
</p>
</div>
<!-- ======================================================================= -->
<div class="doc_subsection"><a name="gep"></a>The Wily GetElementPtrInst</div>
<div class="doc_text">
<p>
It took a little getting used to and several rounds of postings to the LLVM
mail list to wrap my head around this instruction correctly. Even though I had
read the Language Reference and Programmer's Manual a couple times each, I still
missed a few <em>very</em> key points:
</p>
<ul>
<li>GetElementPtrInst gives you back a Value for the last thing indexed</em>
<li>All global variables in LLVM are <em>pointers</em>.
<li>Pointers must also be dereferenced with the GetElementPtrInst instruction.
</ul>
<p>This means that when you look up an element in the global variable (assuming
its a struct or array), you <em>must</em> deference the pointer first! For many
things, this leads to the idiom:
</p>
<pre><code>
std::vector<Value*> index_vector;
index_vector.push_back( ConstantSInt::get( Type::LongTy, 0 );
// ... push other indices ...
GetElementPtrInst* gep = new GetElementPtrInst( ptr, index_vector );
</code></pre>
<p>For example, suppose we have a global variable whose type is [24 x int]. The
variable itself represents a <em>pointer</em> to that array. To subscript the
array, we need two indices, not just one. The first index (0) dereferences the
pointer. The second index subscripts the array. If you're a "C" programmer, this
will run against your grain because you'll naturally think of the global array
variable and the address of its first element as the same. That tripped me up
for a while until I realized that they really do differ .. by <em>type</em>.
Remember that LLVM is a strongly typed language itself. Absolutely everything
has a type. The "type" of the global variable is [24 x int]*. That is, its
a pointer to an array of 24 ints. When you dereference that global variable with
a single index, you now have a " [24 x int]" type, the pointer is gone. Although
the pointer value of the dereferenced global and the address of the zero'th element
in the array will be the same, they differ in their type. The zero'th element has
type "int" while the pointer value has type "[24 x int]".</p>
<p>Get this one aspect of LLVM right in your head and you'll save yourself
a lot of compiler writing headaches down the road.</p>
</div>
<!-- ======================================================================= -->
<div class="doc_subsection"><a name="linkage"></a>Getting Linkage Types Right</div>
<div class="doc_text"><p>To be completed.</p></div>
<div class="doc_subsection"><a name="linkage"></a>Everything's a Value!</div>
<div class="doc_text"><p>To be completed.</p></div>
<div class="doc_subsection"><a name="linkage"></a>The Wily GetElementPtrInst</div>
<div class="doc_text"><p>To be completed.</p></div>
<div class="doc_subsection"><a name="linkage"></a>Constants Are Easier Than That!</div>
<div class="doc_text"><p>To be completed.</p></div>
<div class="doc_subsection"><a name="linkage"></a>Terminate Those Blocks!</div>
<div class="doc_text"><p>To be completed.</p></div>
<div class="doc_subsection"><a name="linkage"></a>new,get,create .. Its All The Same</div>
<div class="doc_text"><p>To be completed.</p></div>
<div class="doc_subsection"><a name="linkage"></a>Utility Functions To The Rescue</div>
<div class="doc_text"><p>To be completed.</p></div>
<div class="doc_subsection"><a name="linkage"></a>push_back Is Your Friend</div>
<div class="doc_text"><p>To be completed.</p></div>
<div class="doc_subsection"><a name="linkage"></a>Block Heads Come First</div>
<div class="doc_text"><p>To be completed.</p></div>
<div class="doc_text">
<p>Linkage types in LLVM can be a little confusing, especially if your compiler
writing mind has affixed very hard concepts to particular words like "weak",
"external", "global", "linkonce", etc. LLVM does <em>not</em> use the precise
definitions of say ELF or GCC even though they share common terms. To be fair,
the concepts are related and similar but not precisely the same. This can lead
you to think you know what a linkage type represents but in fact it is slightly
different. I recommend you read the
<a href="LangRef.html#linkage"> Language Reference on this topic</a> very
carefully.<p>
<p>Here are some handy tips that I discovered along the way:</p>
<ul>
<li>Unitialized means external. That is, the symbol is declared in the current
module and can be used by that module but it is not defined by that module.</li>
<li>Setting an initializer changes a global's linkage type from whatever it was
to a normal, defind global (not external). You'll need to call the setLinkage()
method to reset it if you specify the initializer after the GlobalValue has been
constructed. This is important for LinkOnce and Weak linkage types.</li>
<li>Appending linkage can be used to keep track of compilation information at
runtime. It could be used, for example, to build a full table of all the C++
virtual tables or hold the C++ RTTI data, or whatever. Appending linkage can
only be applied to arrays. The arrays are concatenated together at link time.</li>
</ul>
</div>
<!-- ======================================================================= -->
<div class="doc_subsection"><a name="constants"></a>Constants Are Easier Than That!</div>
<div class="doc_text">
<p>
Constants in LLVM took a little getting used to until I discovered a few utility
functions in the LLVM IR that make things easier. Here's what I learned: </p>
<ul>
<li>Constants are Values like anything else and can be operands of instructions</li>
<li>Integer constants, frequently needed can be created using the static "get"
methods of the ConstantInt, ConstantSInt, and ConstantUInt classes. The nice thing
about these is that you can "get" any kind of integer quickly.</li>
<li>There's a special method on Constant class which allows you to get the null
constant for <em>any</em> type. This is really handy for initializing large
arrays or structures, etc.</li>
</ul>
</div>
<!-- ======================================================================= -->
<div class="doc_section"> <a name="lexicon">The Stacker Lexicon</a></div>
<div class="doc_subsection"><a name="stack"></a>The Stack</div>
@ -184,7 +443,7 @@ depending on what they do. The groups are as follows:</p>
their operands. <br/> The words are: ABS NEG + - * / MOD */ ++ -- MIN MAX</li>
<li><em>Stack</em>These words manipulate the stack directly by moving
its elements around.<br/> The words are: DROP DUP SWAP OVER ROT DUP2 DROP2 PICK TUCK</li>
<li><em>Memory></em>These words allocate, free and manipulate memory
<li><em>Memory</em>These words allocate, free and manipulate memory
areas outside the stack.<br/>The words are: MALLOC FREE GET PUT</li>
<li><em>Control</em>These words alter the normal left to right flow
of execution.<br/>The words are: IF ELSE ENDIF WHILE END RETURN EXIT RECURSE</li>
@ -696,39 +955,19 @@ using the following construction:</p>
</table>
</div>
<!-- ======================================================================= -->
<div class="doc_section"> <a name="directory">Directory Structure</a></div>
<div class="doc_section"> <a name="example">Prime: A Complete Example</a></div>
<div class="doc_text">
<p>The source code, test programs, and sample programs can all be found
under the LLVM "projects" directory. You will need to obtain the LLVM sources
to find it (either via anonymous CVS or a tarball. See the
<a href="GettingStarted.html">Getting Started</a> document).</p>
<p>Under the "projects" directory there is a directory named "stacker". That
directory contains everything, as follows:</p>
<ul>
<li><em>lib</em> - contains most of the source code
<ul>
<li><em>lib/compiler</em> - contains the compiler library
<li><em>lib/runtime</em> - contains the runtime library
</ul></li>
<li><em>test</em> - contains the test programs</li>
<li><em>tools</em> - contains the Stacker compiler main program, stkrc
<ul>
<li><em>lib/stkrc</em> - contains the Stacker compiler main program
</ul</li>
<li><em>sample</em> - contains the sample programs</li>
</ul>
</div>
<!-- ======================================================================= -->
<div class="doc_section"> <a name="directory">Prime: A Complete Example</a></div>
<div class="doc_text">
<p>The following fully documented program highlights many of features of both
the Stacker language and what is possible with LLVM. The program simply
prints out the prime numbers until it reaches
<p>The following fully documented program highlights many features of both
the Stacker language and what is possible with LLVM. The program has two modes
of operations. If you provide numeric arguments to the program, it checks to see
if those arguments are prime numbers, prints out the results. Without any
aruments, the program prints out any prime numbers it finds between 1 and one
million (there's a log of them!). The source code comments below tell the
remainder of the story.
</p>
</div>
<div class="doc_text">
<p><code>
<![CDATA[
<pre><code>
################################################################################
#
# Brute force prime number generator
@ -964,19 +1203,68 @@ prints out the prime numbers until it reaches
ENDIF
0 ( push return code )
;
]]>
</code>
</p>
</pre>
</div>
<!-- ======================================================================= -->
<div class="doc_section"> <a name="lexicon">Internals</a></div>
<div class="doc_text"><p>To be completed.</p></div>
<div class="doc_subsection"><a name="stack"></a>The Lexer</div>
<div class="doc_subsection"><a name="stack"></a>The Parser</div>
<div class="doc_subsection"><a name="stack"></a>The Compiler</div>
<div class="doc_subsection"><a name="stack"></a>The Stack</div>
<div class="doc_subsection"><a name="stack"></a>Definitions Are Functions</div>
<div class="doc_subsection"><a name="stack"></a>Words Are BasicBlocks</div>
<div class="doc_section"> <a name="internal">Internals</a></div>
<div class="doc_text">
<p><b>This section is under construction.</b>
<p>In the mean time, you can always read the code! It has comments!</p>
</div>
<!-- ======================================================================= -->
<div class="doc_subsection"> <a name="directory">Directory Structure</a></div>
<div class="doc_text">
<p>The source code, test programs, and sample programs can all be found
under the LLVM "projects" directory. You will need to obtain the LLVM sources
to find it (either via anonymous CVS or a tarball. See the
<a href="GettingStarted.html">Getting Started</a> document).</p>
<p>Under the "projects" directory there is a directory named "stacker". That
directory contains everything, as follows:</p>
<ul>
<li><em>lib</em> - contains most of the source code
<ul>
<li><em>lib/compiler</em> - contains the compiler library
<li><em>lib/runtime</em> - contains the runtime library
</ul></li>
<li><em>test</em> - contains the test programs</li>
<li><em>tools</em> - contains the Stacker compiler main program, stkrc
<ul>
<li><em>lib/stkrc</em> - contains the Stacker compiler main program
</ul</li>
<li><em>sample</em> - contains the sample programs</li>
</ul>
</div>
<!-- ======================================================================= -->
<div class="doc_subsection"><a name="lexer"></a>The Lexer</div>
<div class="doc_text">
<p>See projects/Stacker/lib/compiler/Lexer.l</p>
</p></div>
<!-- ======================================================================= -->
<div class="doc_subsection"><a name="parser"></a>The Parser</div>
<div class="doc_text">
<p>See projects/Stacker/lib/compiler/StackerParser.y</p>
</p></div>
<!-- ======================================================================= -->
<div class="doc_subsection"><a name="compiler"></a>The Compiler</div>
<div class="doc_text">
<p>See projects/Stacker/lib/compiler/StackerCompiler.cpp</p>
</p></div>
<!-- ======================================================================= -->
<div class="doc_subsection"><a name="runtime"></a>The Runtime</div>
<div class="doc_text">
<p>See projects/Stacker/lib/runtime/stacker_rt.c</p>
</p></div>
<!-- ======================================================================= -->
<div class="doc_subsection"><a name="driver"></a>Compiler Driver</div>
<div class="doc_text">
<p>See projects/Stacker/tools/stkrc/stkrc.cpp</p>
</p></div>
<!-- ======================================================================= -->
<div class="doc_subsection"><a name="tests"></a>Test Programs</div>
<div class="doc_text">
<p>See projects/Stacker/test/*.st</p>
</p></div>
<!-- ======================================================================= -->
<hr>
<div class="doc_footer">