2 \page cbmc-guide CBMC Guide
9 First off; read the \ref cprover-manual "CProver Manual". It describes
10 how to get, build and use CBMC and SATABS. This document covers the
11 internals of the system and how to get started on development.
16 Apart from the (user-orientated) CPROVER manual and this document, most
17 of the rest of the documentation is inline in the code as `doxygen` and
18 some comments. A man page for CBMC, goto-cc and goto-instrument is
19 contained in the `doc/` directory and gives some options for these
20 tools. All of these could be improved and patches are very welcome. In
21 some cases the algorithms used are described in the relevant papers.
26 CPROVER is structured in a similar fashion to a compiler. It has
27 language specific front-ends which perform limited syntactic analysis
28 and then convert to an intermediate format. The intermediate format can
29 be output to files (this is what `goto-cc` does) and are (informally)
30 referred to as “goto binaries” or “goto programs”. The back-end are
31 tools process this format, either directly from the front-end or from
32 it’s saved output. These include a wide range of analysis and
33 transformation tools (see Section \[section:other-apps\]).
38 CPROVER is written in a fairly minimalist subset of C++; templates and
39 meta-programming are avoided except where necessary. The standard
40 library is used but in many cases there are alternatives provided in
41 `util/` (see Section \[section:util\]) which are preferred. Boost is
44 Patches should be formatted so that code is indented with two space
45 characters, not tab and wrapped to 75 or 72 columns. Headers for doxygen
46 should be given (and preferably filled!) and the author will be the
47 person who first created the file.
49 Identifiers should be lower case with underscores to separate words.
50 Types (classes, structures and typedefs) names must[^1] end with a `t`.
51 Types that model types (i.e. C types in the program that is being
52 interpreted) are named with `_typet`. For example `ui_message_handlert`
53 rather than `UI_message_handlert` or `UIMessageHandler` and
59 Fixes, changes and enhancements to the CPROVER code base should be
60 developed against the `trunk` version and submitted to Daniel as patches
61 produced by `diff -Naur` or `svn diff`. Entire applications are best
62 developed independently (`git svn` is a popular choice for tracking the
63 main trunk but also having local development) until it is clear what
64 their utility, future and maintenance is likely to be.
66 Other Useful Code {#section:other-apps}
69 The CPROVER subversion archive contains a number of separate programs.
70 Others are developed separately as patches or separate
71 branches.Interfaces are have been and are continuing to stablise but
72 older code may require work to compile and function correctly.
76 * `CBMC`: A bounded model checking tool for C and C++. See Section
79 * `goto-cc`: A drop-in, flag compatible replacement for GCC and other
80 compilers that produces goto-programs rather than executable binaries.
81 See Section \[section:goto-cc\].
83 * `goto-instrument`: A collection of functions for instrumenting and
84 modifying goto-programs. See Section \[section:goto-instrument\].
86 Model checkers and similar tools:
88 * `SatABS`: A CEGAR model checker using predicate abstraction. Is
89 roughly 10,000 lines of code (on top of the CPROVER code base) and is
90 developed in its own subversion archive. It uses an external model
91 checker to find potentially feasible paths. Key limitations are
92 related to code with pointers and there is scope for significant
95 * `Scratch`: Alistair Donaldson’s k-induction based tool. The
96 front-end is in the old project CVS and some of the functionality is
99 * `Wolverine`: An implementation of Ken McMillan’s IMPACT algorithm
100 for sequential programs. In the old project CVS.
102 * `C-Impact`: An implementation of Ken McMillan’s IMPACT algorithm for
103 parallel programs. In the old project CVS.
105 * `LoopFrog`: A loop summarisation tool.
107 * `???`: Christoph’s termination analyser.
109 Test case generation:
111 * `cover`: A basic test-input generation tool. In the old
114 * `FShell`: A test-input generation tool that allows the user to
115 specify the desired coverage using a custom language (which includes
116 regular expressions over paths). It uses incremental SAT and is thus
117 faster than the naïve “add assertions one at a time and use the
118 counter-examples” approach. Is developed in its own subversion.
120 Alternative front-ends and input translators:
122 * `Scoot`: A System-C to C translator. Probably in the old
125 * `???`: A Simulink to C translator. In the old project CVS.
127 * `???`: A Verilog front-end. In the old project CVS.
129 * `???`: A converter from Codewarrior project files to Makefiles. In
134 * `ai`: Leo’s hybrid abstract interpretation / CEGAR tool.
136 * `DeltaCheck?`: Ajitha’s slicing tool, aimed at locating changes and
137 differential verification. In the old project CVS.
139 There are tools based on the CPROVER framework from other research
140 groups which are not listed here.
145 This section walks through the code bases in a rough order of interest /
146 comprehensibility to the new developer.
151 At the moment just contains the CBMC man page.
156 The regression tests are currently being moved from CVS. The
157 `regression/` directory contains all of those that have
158 been moved. They are grouped into directories for each of the tools.
159 Each of these contains a directory per test case, containing a C or C++
160 file that triggers the bug and a `.dsc` file that describes
161 the tests, expected output and so on. There is a Perl script,
162 `test.pl` that is used to invoke the tests as:
164 ../test.pl -c PATH_TO_CBMC
166 The `–help` option gives instructions for use and the
167 format of the description files.
172 The source code is divided into a number of sub-directories, each
173 containing the code for a different part of the system. In the top level
174 files there are only a few files:
176 * `config.inc`: The user-editable configuration parameters for the
177 build process. The main use of this file is setting the paths for the
178 various external SAT solvers that are used. As such, anyone building
179 from source will likely need to edit this.
181 * `Makefile`: The main systems Make file. Parallel builds are
182 supported and encouraged; please don’t break them!
184 * `common`: System specific magic required to get the system to build.
185 This should only need to be edited if porting CBMC to a new platform /
188 * `doxygen.cfg`: The config file for doxygen.cfg
190 ### `util/` {#section:util}
192 `util/` contains the low-level data structures and
193 manipulation functions that are used through-out the CPROVER code-base.
194 For almost any low-level task, the code required is probably in
195 `util/`. Key files include:
197 * `irep.h`: This contains the definition of `irept`, the basis of many
198 of the data structures in the project. They should not be used
199 directly; one of the derived classes should be used. For more
200 information see Section \[section:irept\].
202 * `expr.h`: The parent class for all of the expressions. Provides a
203 number of generic functions, `exprt` can be used with these but when
204 creating data, subclasses of `exprt` should be used.
206 * `std_expr.h`: Provides subclasses of `exprt` for common kinds of
207 expression for example `plus_exprt`, `minus_exprt`,
208 `dereference_exprt`. These are the intended interface for creating
211 * `std_types.h`: Provides subclasses of `typet` (a subclass of
212 `irept`) to model C and C++ types. This is one of the preferred
213 interfaces to `irept`. The front-ends handle type promotion and most
214 coercision so the type system and checking goto-programs is simpler
217 * `dstring.h`: The CPROVER string class. This enables sharing between
218 strings which significantly reduces the amount of memory required and
219 speeds comparison. `dstring` should not be used directly, `irep_idt`
220 should be used instead, which (dependent on build options) is an alias
223 * `mp_arith.h`: The wrapper class for multi-precision arithmetic
224 within CPROVER. Also see `arith_tools.h`.
226 * `ieee_float.h`: The arbitrary precision float model used within
227 CPROVER. Based on `mp_integer`s.
229 * `context.h`: A generic container for symbol table like constructs
230 such as namespaces. Lookup gives type, location of declaration, name,
231 ‘pretty name’, whether it is static or not.
233 * `namespace.h`: The preferred interface for the context class. The
234 key function is `lookup` which converts a string (`irep_idt`) to a
235 symbol which gives the scope of declaration, type and so on. This
236 works for functions as well as variables.
240 This contains the basic interfaces and support classes for programming
241 language front ends. Developers only really need look at this if they
242 are adding support for a new language. It’s main users are the two (in
243 trunk) language front-ends; `ansi-c/` and
248 Contains the front-end for ANSI C, plus a variety of common extensions.
249 This parses the file, performs some basic sanity checks (this is one
250 area in which the UI could be improved; patches most welcome) and then
251 produces a goto-program (see below). The parser is a traditional Flex /
254 `internal_addition.c` contains the implementation of various ‘magic’
255 functions that are that allow control of the analysis from the source
256 code level. These include assertions, assumptions, atomic blocks, memory
257 fences and rounding modes.
259 The `library/` subdirectory contains versions of some of the C standard
260 header files that make use of the CPROVER built-in functions. This
261 allows CPROVER programs to be ‘aware’ of the functionality and model it
262 correctly. Examples include `stdio.c`, `string.c`, `setjmp.c` and
263 various threading interfaces.
267 This directory contains the C++ front-end. It supports the subset of C++
268 commonly found in embedded and system applications. Consequentially it
269 doesn’t have full support for templates and many of the more advanced
270 and obscure C++ features. The subset of the language that can be handled
271 is being extended over time so bug reports of programs that cannot be
274 The functionality is very similar to the ANSI C front end; parsing the
275 code and converting to goto-programs. It makes use of code from
276 `langapi` and `ansi-c`.
280 Goto programs are the intermediate representation of the CPROVER tool
281 chain. They are language independent and similar to many of the compiler
282 intermediate languages. Section \[section:goto-programs\] describes the
283 `goto_programt` and `goto_functionst` data structures in detail. However
284 it useful to understand some of the basic concepts. Each function is a
285 list of instructions, each of which has a type (one of 18 kinds of
286 instruction), a code expression, a guard expression and potentially some
287 targets for the next instruction. They are not natively in static
288 single-assign (SSA) form. Transitions are nondeterministic (although in
289 practise the guards on the transitions normally cover form a disjoint
290 cover of all possibilities). Local variables have non-deterministic
291 values if they are not initialised. Variables and data within the
292 program is commonly one of three types (parameterised by width):
293 `unsignedbv_typet`, `signedbv_typet` and `floatbv_typet`, see
294 `util/std_types.h` for more information. Goto programs can be serialised
295 in a binary (wrapped in ELF headers) format or in XML (see the various
296 `_serialization` files).
298 The `cbmc` option `–show-goto-programs` is often a good starting point
299 as it outputs goto-programs in a human readable form. However there are
300 a few things to be aware of. Functions have an internal name (for
301 example `c::f00`) and a ‘pretty name’ (for example `f00`) and which is
302 used depends on whether it is internal or being presented to the user.
303 The `main` method is the ‘logical’ main which is not necessarily the
304 main method from the code. In the output `NONDET` is use to represent a
305 nondeterministic assignment to a variable. Likewise `IF` as a beautified
306 `GOTO` instruction where the guard expression is used as the condition.
307 `RETURN` instructions may be dropped if they precede an `END_FUNCTION`
308 instruction. The comment lines are generated from the `locationt` field
309 of the `instructiont` structure.
311 `goto-programs/` is one of the few places in the CPROVER codebase that
312 templates are used. The intention is to allow the general architecture
313 of program and functions to be used for other formalisms. At the moment
314 most of the templates have a single instantiation; for example
315 `goto_functionst` and `goto_function_templatet` and `goto_programt` and
316 `goto_program_templatet`.
320 This directory contains a symbolic evaluation system for goto-programs.
321 This takes a goto-program and translates it to an equation system by
322 traversing the program, branching and merging and unwinding loops as
323 needed. Each reverse goto has a separate counter (the actual counting is
324 handled by `cbmc`, see the `–unwind` and `–unwind-set` options). When a
325 counter limit is reach, an assertion can be added to explicitly show
326 when analysis is incomplete. The symbolic execution includes constant
327 folding so loops that have a constant number of iterations will be
328 handled completely (assuming the unwinding limit is sufficient).
330 The output of the symbolic execution is a system of equations; an object
331 containing a list of `symex_target_elements`, each of which are
332 equalities between `expr` expressions. See `symex_target_equation.h`.
333 The output is in static, single assignment (SSA) form, which is *not*
334 the case for goto-programs.
336 ### `pointer-analysis/`
338 To perform symbolic execution on programs with dereferencing of
339 arbitrary pointers, some alias analysis is needed. `pointer-analysis`
340 contains the three levels of analysis; flow and context insensitive,
341 context sensitive and flow and context sensitive. The code needed is
342 subtle and sophisticated and thus there may be bugs.
346 The `solvers/` directory contains interfaces to a number of
347 different decision procedures, roughly one per directory.
349 * prop/: The basic and common functionality. The key file is
350 `prop_conv.h` which defines `prop_convt`. This is the base class that
351 is used to interface to the decision procedures. The key functions are
352 `convert` which takes an `exprt` and converts it to the appropriate,
353 solver specific, data structures and `dec_solve` (inherited from
354 `decision_proceduret`) which invokes the actual decision procedures.
355 Individual decision procedures (named `*_dect`) objects can be created
356 but `prop_convt` is the preferred interface for code that uses them.
358 * flattening/: A library that converts operations to bit-vectors,
359 including calling the conversions in `floatbv` as necessary. Is
360 implemented as a simple conversion (with caching) and then a
361 post-processing function that adds extra constraints. This is not used
362 by the SMT or CVC back-ends.
364 * dplib/: Provides the `dplib_dect` object which used the decision
365 procedure library from “Decision Procedures : An Algorithmic Point of
368 * cvc/: Provides the `cvc_dect` type which interfaces to the old (pre
369 SMTLib) input format for the CVC family of solvers. This format is
370 still supported by depreciated in favour of SMTLib 2.
372 * smt1/: Provides the `smt1_dect` type which converts the formulae to
373 SMTLib version 1 and then invokes one of Boolector, CVC3, OpenSMT,
374 Yices, MathSAT or Z3. Again, note that this format is depreciated.
376 * smt2/: Provides the `smt2_dect` type which functions in a similar
377 way to `smt1_dect`, calling Boolector, CVC3, MathSAT, Yices or Z3.
378 Note that the interaction with the solver is batched and uses
379 temporary files rather than using the interactive command supported by
380 SMTLib 2. With the `–fpa` option, this output mode will not flatten
381 the floating point arithmetic and instead output the proposed SMTLib
382 floating point standard.
384 * qbf/: Back-ends for a variety of QBF solvers. Appears to be no
385 longer used or maintained.
387 * sat/: Back-ends for a variety of SAT solvers and DIMACS output.
389 ### `cbmc/` {#section:CBMC}
391 This contains the first full application. CBMC is a bounded model
392 checker that uses the front ends (`ansi-c`, `cpp`, goto-program or
393 others) to create a goto-program, `goto-symex` to unwind the loops the
394 given number of times and to produce and equation system and finally
395 `solvers` to find a counter-example (technically, `goto-symex` is then
396 used to construct the counter-example trace).
398 ### `goto-cc/` {#section:goto-cc}
400 `goto-cc` is a compiler replacement that just performs the first step of
401 the process; converting C or C++ programs to goto-binaries. It is
402 intended to be dropped in to an existing build procedure in place of the
403 compiler, thus it emulates flags that would affect the semantics of the
404 code produced. Which set of flags are emulated depends on the naming of
405 the `goto-cc/` binary. If it is called `goto-cc` then it emulates GCC
406 flags, `goto-armcc` emulates the ARM compiler, `goto-cl` emulates VCC
407 and `goto-cw` emulates the Code Warrior compiler. The output of this
408 tool can then be used with `cbmc` or `goto-instrument`.
410 ### `goto-instrument/` {#section:goto-instrument}
412 The `goto-instrument/` directory contains a number of tools, one per
413 file, that are built into the `goto-instrument` program. All of them
414 take in a goto-program (produced by `goto-cc`) and either modify it or
415 perform some analysis. Examples include `nondet_static.cpp` which
416 initialises static variables to a non-deterministic value,
417 `nondet_volatile.cpp` which assigns a non-deterministic value to any
418 volatile variable before it is read and `weak_memory.h` which performs
419 the necessary transformations to reason about weak memory models. The
420 exception to the “one file for each piece of functionality” rule are the
421 program instrumentation options (mostly those given as “Safety checks”
422 in the `goto-instrument` help text) which are included in the
423 `goto-program/` directory. An example of this is
424 `goto-program/stack_depth.h` and the general rule seems to be that
425 transformations and instrumentation that `cbmc` uses should be in
426 `goto-program/`, others should be in `goto-instrument`.
428 `goto-instrument` is a very good template for new analysis tools. New
429 developers are advised to copy the directory, remove all files apart
430 from `main.*`, `parseoptions.*` and the `Makefile` and use these as the
431 skeleton of their application. The `doit()` method in `parseoptions.cpp`
432 is the preferred location for the top level control for the program.
436 Probably the code to emulate a linker. This allows multiple ‘object
437 files’ (goto-programs) to be linked into one ‘executable’ (another
438 goto-program), thus allowing existing build systems to be used to build
439 complete goto-program binaries.
443 CPROVER is distributed with its own multi-precision arithmetic library;
444 mainly for historical and portability reasons. The library is externally
445 developed and thus `big-int` contains the source as it is distributed.
446 This should not be used directly, see `util/mp_arith.h` for the CPROVER
451 CPROVER has optional XML output for results and there is an XML format
452 for goto-programs. It is used to interface to various IDEs. The
453 `xmllang/` directory contains the parser and helper functions for
454 handling this format.
458 This library contains the code that is used to convert floating point
459 variables (`floatbv`) to bit vectors (`bv`). This is referred to as
460 ‘bit-blasting’ and is called in the `solver` code during conversion to
461 SAT or SMT. It also contains the abstraction code described in the
467 This section discusses some of the key data-structures used in the
470 `irept` {#section:irept}
471 ------------------------
473 There are a large number of kind of tree structured or tree-like data in
474 CPROVER. `irept` provides a single, unified representation for all of
475 these, allowing structure sharing and reference counting of data. As
476 such `irept` is the basic unit of data in CPROVER. Each `irept`
477 contains[^2] a basic unit of data (of type `dt`) which contains four
480 * `data`: A string[^3], which is returned when the `id()` function is
483 * `named_sub`: A map from `irep_namet` (a string) to an `irept`. This
484 is used for named children, i.e. subexpressions, parameters, etc.
486 * `comments`: Another map from `irep_namet` to `irept` which is used
487 for annotations and other ‘non-semantic’ information
489 * `sub`: A vector of `irept` which is used to store ordered but
492 The `irept::pretty` function outputs the contents of an `irept` directly
493 and can be used to understand an debug problems with `irept`s.
495 On their own `irept`s do not “mean” anything; they are effectively
496 generic tree nodes. Their interpretation depends on the contents of
497 result of the `id` function (the `data`) field. `util/irep_ids.txt`
498 contains the complete list of `id` values. During the build process it
499 is used to generate `util/irep_ids.h` which gives constants for each id
500 (named `ID_`). These can then be used to identify what kind of data
501 `irept` stores and thus what can be done with it.
503 To simplify this process, there are a variety of classes that inherit
504 from `irept`, roughly corresponding to the ids listed (i.e. `ID_or`
505 (the string `"or”`) corresponds to the class `or_exprt`). These give
506 semantically relevant accessor functions for the data; effectively
507 different APIs for the same underlying data structure. None of these
508 classes add fields (only methods) and so static casting can be used. The
509 inheritance graph of the subclasses of `irept` is a useful starting
510 point for working out how to manipulate data.
512 There are three main groups of classes (or APIs); those derived from
513 `typet`, `codet` and `exprt` respectively. Although all of these inherit
514 from `irept`, these are the most abstract level that code should handle
515 data. If code is manipulating plain `irept`s then something is wrong
516 with the architecture of the code.
518 Many of the key descendent of `exprt` are declared in `std_expr.h`. All
519 expressions have a named subfield / annotation which gives the type of
520 the expression (slightly simplified from C/C++ as `unsignedbv_typet`,
521 `signedbv_typet`, `floatbv_typet`, etc.). All type conversions are
522 explicit with an expression with `id() == ID_typecast` and an ‘interface
523 class’ named `typecast_exprt`. One key descendent of `exprt` is
524 `symbol_exprt` which creates `irept` instances with the id of “symbol”.
525 These are used to represent variables; the name of which can be found
526 using the `get_identifier` accessor function.
528 `codet` inherits from `exprt` and is defined in `std_code.h`. They
529 represent executable code; statements in C rather than expressions. In
530 the front-end there are versions of these that hold whole code blocks,
531 but in goto-programs these have been flattened so that each `irept`
532 represents one sequence point (almost one line of code / one
533 semi-colon). The most common descendents of `codet` are `code_assignt`
534 so a common pattern is to cast the `codet` to an assignment and then
535 recurse on the expression on either side.
537 `goto-programs` {#section:goto-programs}
538 ----------------------------------------
540 The common starting point for working with goto-programs is the
541 `read_goto_binary` function which populates an object of
542 `goto_functionst` type. This is defined in `goto_functions.h` and is an
543 instantiation of the template `goto_functions_templatet` which is
544 contained in `goto_functions_template.h`. They are wrappers around a map
545 from strings to `goto_programt`’s and iteration macros are provided.
546 Note that `goto_function_templatet` (no `s`) is defined in the same
547 header as `goto_functions_templatet` and is gives the C type for the
548 function and Boolean which indicates whether the body is available
549 (before linking this might not always be true). Also note the slightly
550 counter-intuitive naming; `goto_functionst` instances are the top level
551 structure representing the program and contain `goto_programt` instances
552 which represent the individual functions. At the time of writing
553 `goto_functionst` is the only instantiation of the template
554 `goto_functions_templatet` but other could be produced if a different
555 data-structures / kinds of models were needed for functions.
557 `goto_programt` is also an instantiation of a template. In a similar
558 fashion it is `goto_program_templatet` and allows the types of the guard
559 and expression used in instructions to be parameterised. Again, this is
560 currently the only use of the template. As such there are only really
561 helper functions in `goto_program.h` and thus `goto_program_template.h`
562 is probably the key file that describes the representation of (C)
563 functions in the goto-program format. It is reasonably stable and
564 reasonably documented and thus is a good place to start looking at the
567 An instance of `goto_program_templatet` is effectively a list of
568 instructions (and inner template called `instructiont`). It is important
569 to use the copy and insertion functions that are provided as iterators
570 are used to link instructions to their predecessors and targets and
571 careless manipulation of the list could break these. Likewise there are
572 helper macros for iterating over the instructions in an instance of
573 `goto_program_templatet` and the use of these is good style and strongly
576 Individual instructions are instances of type `instructiont`. They
577 represent one step in the function. Each has a type, an instance of
578 `goto_program_instruction_typet` which denotes what kind of instruction
579 it is. They can be computational (such as `ASSIGN` or `FUNCTION_CALL`),
580 logical (such as `ASSUME` and `ASSERT`) or informational (such as
581 `LOCATION` and `DEAD`). At the time of writing there are 18 possible
582 values for `goto_program_instruction_typet` / kinds of instruction.
583 Instructions also have a guard field (the condition under which it is
584 executed) and a code field (what the instruction does). These may be
585 empty depending on the kind of instruction. In the default
586 instantiations these are of type `exprt` and `codet` respectively and
587 thus covered by the previous discussion of `irept` and its descendents.
588 The next instructions (remembering that transitions are guarded by
589 non-deterministic) are given by the list `targets` (with the
590 corresponding list of labels `labels`) and the corresponding set of
591 previous instructions is get by `incoming_edges`. Finally `instructiont`
592 have informational `function` and `location` fields that indicate where
593 they are in the code.
595 [^1]: There are a couple of exceptions, including the graph classes
597 [^2]: Or references, if reference counted data sharing is enabled. It is
598 enabled by default; see the `SHARING` macro.
600 [^3]: When `USE_DSTRING` is enabled (it is by default), this is actually
601 a `dstring` and thus an integer which is a reference into a string table