[docs 3/5] Add per-directory high-level docs

This commit introduces a module.md file for several CProver directories.
Each of these is turned into a page under the Modules section in the
generated Doxygen documentation.

The intention is that developers wishing to contribute to one specific
aspect of CProver can get a high-level architectural overview of a
particular directory; the documentation describes the input to and
output from that directory, and introduces the main classes or entry
points.

By way of a "table of contents," the file cbmc/module.md contains a
diagram describing how each of the directories is invoked by CBMC in
order, and the nodes of the diagram hyperlink to the appropriate
documentation. The intention is that developers wishing to contribute to
CBMC as a whole can understand the entire process, from source files to
bug reports and counterexample production.

This documentation is derived from Mark Tuttle's notes on a talk given
by Michael Tautschnig.
This commit is contained in:
Kareem Khazem 2017-07-05 08:51:42 +01:00
parent f5be7f1190
commit 3a58226cf4
7 changed files with 554 additions and 2 deletions

View File

@ -1,8 +1,11 @@
CProver Documentation
=====================
These pages contain both user tutorials and automatically-generated API
documentation. Users can download CProver tools from the
\author Kareem Khazem
These pages contain user tutorials, automatically-generated API
documentation, and higher-level architectural overviews for the
CProver codebase. Users can download CProver tools from the
<a href="http://www.cprover.org/">CProver website</a>; contributors
should use the <a href="https://github.com/diffblue/cbmc">repository</a>
hosted on GitHub.
@ -21,4 +24,12 @@ hosted on GitHub.
members in the search bar at top-right or use one of the links in the
sidebar.
* For higher-level architectural information, each of the pages under
the "Modules" link in the sidebar gives an overview of a directory in
the CProver codebase.
* The \ref module_cbmc "CBMC guided tour" is a good start for new
contributors to CBMC. It describes the stages through which CBMC
transforms source files into bug reports and counterexamples, linking
to the relevant documentation for each stage.
\defgroup module_hidden _hidden

114
src/ansi-c/module.md Normal file
View File

@ -0,0 +1,114 @@
\ingroup module_hidden
\defgroup module_ansi-c ANSI-C Language Front-end
\author Kareem Khazem
\section preprocessing Preprocessing & Parsing
In the \ref ansi-c and \ref java_bytecode directories
**Key classes:**
* \ref languaget and its subclasses
* ansi_c_parse_treet
\dot
digraph G {
node [shape=box];
rankdir="LR";
1 [shape=none, label=""];
2 [label="preprocessing & parsing"];
3 [shape=none, label=""];
1 -> 2 [label="Command line options, file names"];
2 -> 3 [label="Parse tree"];
}
\enddot
---
\section type-checking Type-checking
In the \ref ansi-c and \ref java_bytecode directories.
**Key classes:**
* \ref languaget and its subclasses
* \ref irept
* \ref irep_idt
* \ref symbolt
* symbol_tablet
\dot
digraph G {
node [shape=box];
rankdir="LR";
1 [shape=none, label=""];
2 [label="type checking"];
3 [shape=none, label=""];
1 -> 2 [label="Parse tree"];
2 -> 3 [label="Symbol table"];
}
\enddot
This stage generates a symbol table, mapping identifiers to symbols;
\ref symbolt "symbols" are tuples of (value, type, location, flags).
This is a good point to introduce the \ref irept ("internal
representation") class---the base type of many of CBMC's hierarchical
data structures. In particular, \ref exprt "expressions",
\ref typet "types" and \ref codet "statements" are all subtypes of
\ref irept.
An irep is a tree of ireps. A subtlety is that an irep is actually the
root of _three_ (possibly empty) trees, i.e. it has three disjoint sets
of children: \ref irept::get_sub() returns a list of children, and
\ref irept::get_named_sub() and \ref irept::get_comments() each return an
association from names to children. **Most clients never use these
functions directly**, as subtypes of irept generally provide more
descriptive functions. For example, the operands of an
\ref exprt "expression" (\ref exprt::op0() "op0", op1 etc) are
really that expression's children; the
\ref code_assignt::lhs() "left-hand" and right-hand side of an
\ref code_assignt "assignment" are the children of that assignment.
The \ref irept::pretty() function provides a descriptive string
representation of an irep.
\ref irep_idt "irep_idts" ("identifiers") are strings that use sharing
to improve memory consumption. A common pattern is a map from irep_idts
to ireps. A goto-program contains a single symbol table (with a single
scope), meaning that the names of identifiers in the target program are
lightly mangled in order to make them globally unique. If there is an
identifier `foo` in the target program, the `name` field of `foo`'s
\ref symbolt "symbol" in the goto-program will be
* `foo` if it is global;
* <code>bar\::foo</code> if it is a parameter to a function `bar()`;
* <code>bar\::3\::foo</code> if it is a local variable in a function
`bar()`, where `3` is a counter that is incremented every time a
newly-scoped `foo` is encountered in that function.
The use of *sharing* to save memory is a pervasive design decision in
the implementation of ireps and identifiers. Sharing makes equality
comparisons fast (as there is no need to traverse entire trees), and
this is especially important given the large number of map lookups
throughout the codebase. More importantly, the use of sharing saves vast
amounts of memory, as there is plenty of duplication within the
goto-program data structures. For example, every statement, and every
sub-expression of a statement, contains a \ref source_locationt
that indicates the source file and location that it came from. Every
symbol in every expression has a field indicating its type and location;
etc. Although each of these are constructed as separate objects, the
values that they eventually point to are shared throughout the codebase,
decreasing memory consumption dramatically.
The Type Checking stage turns a parse tree into a
\ref symbol_tablet "symbol table". In this context, the 'symbols'
consist of code statements as well as what might more traditionally be
called symbols. Thus, for example:
* The statement `int foo = 11;` is converted into a symbol whose type is
integer_typet and value is the \ref constant_exprt
"constant expression" `11`; that symbol is stored in the symbol table
using the mangled name of `foo` as the key;
* The function definition `void foo(){ int x = 11; bar(); }` is
converted into a symbol whose type is \ref code_typet (not to be
confused with \ref typet or \ref codet!); the code_typet contains the
parameter and return types of the function. The value of the symbol is
the function's body (a \ref codet), and the symbol is stored in the
symbol table with `foo` as the key.

47
src/cbmc/module.md Normal file
View File

@ -0,0 +1,47 @@
\ingroup module_hidden
\defgroup module_cbmc CBMC tour
\author Kareem Khazem
CBMC takes C code or a goto-binary as input and tries to emit traces of
executions that lead to crashes or undefined behaviour. The diagram
below shows the intermediate steps in this process.
\dot
digraph G {
rankdir="TB";
node [shape=box, fontcolor=blue];
subgraph top {
rank=same;
1 -> 2 -> 3 -> 4;
}
subgraph bottom {
rank=same;
5 -> 6 -> 7 -> 8 -> 9;
}
/* shift bottom subgraph over */
9 -> 1 [color=white];
4 -> 5;
1 [label="command line\nparsing" URL="\ref cbmc_parse_optionst"];
2 [label="preprocessing,\nparsing" URL="\ref preprocessing"];
3 [label="language\ntype-checking" URL="\ref type-checking"];
4 [label="goto\nconversion" URL="\ref goto-conversion"];
5 [label="instrumentation" URL="\ref instrumentation"];
6 [label="symbolic\nexecution" URL="\ref symbolic-execution"];
7 [label="SAT/SMT\nencoding" URL="\ref sat-smt-encoding"];
8 [label="decision\nprocedure" URL="\ref decision-procedure"];
9 [label="counter example\nproduction" URL="\ref counter-example-production"];
}
\enddot
The \ref cprover-manual "CProver Manual" describes CBMC from a user
perspective. Each node in the diagram above links to the appropriate
class or module documentation, describing that particular stage in the
CBMC pipeline.

273
src/goto-programs/module.md Normal file
View File

@ -0,0 +1,273 @@
\ingroup module_hidden
\defgroup module_goto-programs Goto Conversion & Instrumentation
\author Kareem Khazem
\section goto-conversion Goto Conversion
In the \ref goto-programs directory.
**Key classes:**
* goto_programt
* goto_functionst
* \ref goto_program_templatet::instructiont
\dot
digraph G {
node [shape=box];
rankdir="LR";
1 [shape=none, label=""];
2 [label="goto conversion"];
3 [shape=none, label=""];
1 -> 2 [label="Symbol table"];
2 -> 3 [label="goto-programs, goto-functions, symbol table"];
}
\enddot
At this stage, CBMC constructs a goto-program from a symbol table. It
does not use the parse tree or the source file at all for this step. This
may seem surprising, because the symbols are stored in a hash table and
therefore have no intrinsic order; nevertheless, every \ref symbolt is
associated with a \ref source_locationt, allowing CBMC to figure out the
lexical order.
The structure of what is informally called a goto-program follows. The
entire target program is converted to a single \ref goto_functionst
object. The goto functions contains a set of \ref goto_programt objects;
each of these correspond to a "function" or "method" in the target
program. Each goto_program contains a list of
\ref goto_program_templatet::instructiont "instructions"; each
instruction contains an associated statement---these are subtypes of
\ref codet. Each instruction also contains a "target", which will be
empty for now.
\dot
digraph G{
graph [nojustify=true];
node [shape=box];
compound=true;
subgraph cluster_src {
1 [shape="none", label="source files"];
2 [label="file1.c\n-----------\nint main(){
int x = 5;
if(x > 7){
x = 9; } }
void foo(){}"];
1 -> 2 [color=white];
100 [label="file2.c\n--------\nchar bar(){ }"];
2 -> 100 [color=white];
}
1 -> 3 [label="corresponds to", lhead=cluster_goto,
ltail=cluster_src];
subgraph cluster_goto {
3 [label="a\ngoto_functionst", URL="\ref goto_functionst", shape=none];
4 [label="function_map\n(a map from irep_idt\nto goto_programt)"];
}
4 -> 5 [lhead=cluster_funmap, label="value:"];
subgraph cluster_funmap {
9 [label="bar\n(an irep_idt)", URL="\ref irep_idt"];
10 [label="a goto_programt", URL="\ref goto_programt"];
9->10 [label="maps to"];
5 [label="main\n(an irep_idt)", URL="\ref irep_idt"];
7 [label="foo\n(an irep_idt)", URL="\ref irep_idt"];
8 [label="a goto_programt", URL="\ref goto_programt"];
7->8 [label="maps to"];
subgraph cluster_goto_program {
11 [shape=none, label="a\ngoto_programt", URL="\ref goto_programt"];
12 [label="instructions\n(a list of instructiont)"];
}
5 -> 11 [lhead=cluster_goto_program, label="maps to:"];
}
12 -> target1 [lhead=cluster_instructions];
subgraph cluster_instructions {
subgraph cluster_ins1{
code1 [label="code", URL="\ref codet"];
target1 [label="target"];
}
target1 -> target2 [color=white,lhead=cluster_ins2,
ltail=cluster_ins1];
subgraph cluster_ins2{
code2 [label="code", URL="\ref codet"];
target2 [label="target"];
}
target2 -> target3 [color=white,lhead=cluster_ins3,
ltail=cluster_ins2];
subgraph cluster_ins3{
code3 [label="code", URL="\ref codet"];
target3 [label="target"];
}
target3 -> target4 [color=white,lhead=cluster_ins4,
ltail=cluster_ins3];
subgraph cluster_ins4{
code4 [label="code", URL="\ref codet"];
target4 [label="target"];
}
}
subgraph cluster_decl {
decl1 [label="type:\ncode_declt", URL="\ref code_declt",
shape=none];
subgraph cluster_decl_in{
cluster_decl_in_1 [label="symbol()", shape=none];
cluster_decl_in_2 [label="x"];
cluster_decl_in_1 -> cluster_decl_in_2 [color=white];
}
decl1 -> cluster_decl_in_1 [lhead="cluster_decl_in", color=white];
}
code1 -> decl1 [lhead=cluster_decl];
subgraph cluster_assign1 {
assign1 [label="type:\ncode_assignt", URL="\ref code_assignt",
shape=none];
subgraph cluster_assign1_in{
cluster_assign1_in_1 [label="lhs()", shape=none];
cluster_assign1_in_2 [label="x"];
cluster_assign1_in_1 -> cluster_assign1_in_2 [color=white];
cluster_assign1_in_3 [label="rhs()", shape=none];
cluster_assign1_in_4 [label="5"];
cluster_assign1_in_3 -> cluster_assign1_in_4 [color=white];
}
assign1 -> cluster_assign1_in_1 [lhead="cluster_assign1_in", color=white];
}
code2 -> assign1 [lhead=cluster_assign1];
subgraph cluster_if {
if [label="type:\ncode_ifthenelset", URL="\ref code_ifthenelset",
shape=none];
if_body [label="..."];
if -> if_body [color=white];
}
code3 -> if [lhead=cluster_if];
subgraph cluster_assign2 {
assign2 [label="type:\ncode_assignt", URL="\ref code_assignt",
shape=none];
subgraph cluster_assign2_in{
cluster_assign2_in_1 [label="lhs()", shape=none];
cluster_assign2_in_2 [label="x"];
cluster_assign2_in_1 -> cluster_assign2_in_2 [color=white];
cluster_assign2_in_3 [label="rhs()", shape=none];
cluster_assign2_in_4 [label="9"];
cluster_assign2_in_3 -> cluster_assign2_in_4 [color=white];
}
assign2 -> cluster_assign2_in_1 [lhead="cluster_assign2_in", color=white];
}
code4 -> assign2 [lhead=cluster_assign2];
}
\enddot
This is not the final form of the goto-functions, since the lists of
instructions will be 'normalized' in the next step (Instrumentation),
which removes some instructions and adds targets to others.
Note that goto_programt and goto_functionst are each template
instantiations; they are currently the *only* specialization of
goto_program_templatet and goto_functions_templatet, respectively. This
means that the generated Doxygen documentation can be somewhat obtuse
about the actual types of things, and is unable to generate links to the
correct classes. Note that the
\ref goto_program_templatet::instructiont::code "code" member of a
goto_programt's instruction has type \ref codet (its type in the
goto_program_templatet documentation is given as "codeT", as this is the
name of the template's type parameter); similarly, the type of a guard
of an instruction is \ref guardt.
---
\section instrumentation Instrumentation
In the \ref goto-programs directory.
**Key classes:**
* goto_programt
* goto_functionst
* \ref goto_program_templatet::instructiont
\dot
digraph G {
node [shape=box];
rankdir="LR";
1 [shape=none, label=""];
2 [label="goto conversion"];
3 [shape=none, label=""];
1 -> 2 [label="goto-programs, goto-functions, symbol table"];
2 -> 3 [label="transformed goto-programs"];
}
\enddot
This stage applies several transformations to the goto-programs from the
previous stage:
* The diagram in the previous stage showed a goto_programt with four
instructions, but real programs usually yield instruction lists that
are littered with \ref code_skipt "skip" statements. The
instrumentation stage removes the majority of these.
* Function pointers are removed. They are turned into switch statements
(but see the next point; switch statements are further transformed).
* Compound blocks are eliminated. There are several subclasses of
\ref codet that count as 'compound blocks;' therefore, later stages in
the CBMC pipeline that switch over the codet subtype of a particular
instruction should not need to consider these types. In particular:
* code_ifthenelset is turned into GOTOs. In particular, the bodies of
the conditionals are flattened into lists of instructions, inline
with the rest of the instruction in the goto_programt. The guard of
the conditional is placed onto the
\ref goto_program_templatet::instructiont::guard "guard" member of
an instruction whose code member is of type \ref code_gotot. The
\ref goto_program_templatet::instructiont::targets "targets" member
of that instruction points to the appropriate branch of the
conditional. (Note that although instructions have a list of
targets, in practice an instruction should only ever have at most
one target; you should check this invariant with an assertion if you
rely on it).
The order of instructions in a list of instructions---as well as the
targets of GOTOs---are both displayed as arrows when viewing a
goto-program as a Graphviz DOT file with `goto-instrument --dot`.
The semantics of a goto-program is: the next instruction is the next
instruction in the list, unless the current instruction has a
target; in that case, check the guard of the instruction, and jump
to the target if the guard evaluates to true.
* switch statements, for and while loops, and try-catches are also
transformed into lists of instructions guarded by GOTOs.
* \ref code_blockt "code blocks" are transformed into lists of
instructions.
* \ref code_returnt "return statements" are transformed into
(unconditional) GOTOs whose target is the \ref END_FUNCTION
instruction. Each goto_programt should have precisely one such
instruction. Note the presence of \ref code_deadt, which has a
\ref code_deadt::symbol() "symbol()" member. Deads mark symbols that
have just gone out of scope; typically, a GOTO that jumps to an
END_FUNCTION instruction is preceded by a series of deads. Deads also
follow sequences of instructions that were part of the body of a
block (loop, conditional etc.) if there were symbols declared in that
block.
This stage concludes the *analysis-independent* program transformations.

44
src/goto-symex/module.md Normal file
View File

@ -0,0 +1,44 @@
\ingroup module_hidden
\defgroup module_goto-symex Symbolic Execution & Counterexample Production
\author Kareem Khazem
**Key classes:**
* goto_symex_statet
* goto_symext
\dot
digraph G {
node [shape=box];
rankdir="LR";
1 [shape=none, label=""];
2 [label="goto conversion"];
3 [shape=none, label=""];
1 -> 2 [label="goto-programs, goto-functions, symbol table"];
2 -> 3 [label="equations"];
}
\enddot
---
\section counter-example-production Counter Example Production
In the \ref goto-symex directory.
**Key classes:**
* symex_target_equationt
* prop_convt
* \ref bmct
* fault_localizationt
* counterexample_beautificationt
\dot
digraph G {
node [shape=box];
rankdir="LR";
1 [shape=none, label=""];
2 [label="goto conversion"];
3 [shape=none, label=""];
1 -> 2 [label="solutions"];
2 -> 3 [label="counter-examples"];
}
\enddot

49
src/solvers/module.md Normal file
View File

@ -0,0 +1,49 @@
\ingroup module_hidden
\defgroup module_solvers SAT/SMT Encoding and Decision Procedure
\author Kareem Khazem
\section sat-smt-encoding SAT/SMT Encoding
In the \ref solvers directory.
**Key classes:**
* \ref literalt
* \ref boolbvt
* \ref propt
\dot
digraph G {
node [shape=box];
rankdir="LR";
1 [shape=none, label=""];
2 [label="goto conversion"];
3 [shape=none, label=""];
1 -> 2 [label="equations"];
2 -> 3 [label="propositional variables as bitvectors, constraints"];
}
\enddot
---
\section decision-procedure Decision Procedure
In the \ref solvers directory.
**Key classes:**
* symex_target_equationt
* \ref propt
* \ref bmct
\dot
digraph G {
node [shape=box];
rankdir="LR";
1 [shape=none, label=""];
2 [label="goto conversion"];
3 [shape=none, label=""];
1 -> 2 [label="propositional variables as bitvectors, constraints"];
2 -> 3 [label="solutions"];
}
\enddot

View File

@ -0,0 +1,14 @@
\ingroup module_hidden
\defgroup module_command-line-parsing Command Line Parsing
\dot
digraph G {
node [shape=box];
rankdir="LR";
1 [shape=none, label=""];
2 [label="command line parsing"];
3 [shape=none, label=""];
1 -> 2 [label="C files or goto-binaries"];
2 -> 3 [label="Command line options, file names"];
}
\enddot