2 \page background-concepts Background Concepts
4 \author Martin Brain, Peter Schrammel, Johannes Kloos
6 The purpose of this section is to explain several key concepts used throughout
7 the CPROVER framework at a high level, ignoring the details of the actual implementation.
8 In particular, we will discuss different ways to represent programs in memory,
9 three important analysis methods and some commonly used terms.
11 \section representations_section Representations
13 One of the first questions we should be considering is how we represent programs
14 in such a way that we can easily analyze and reason about them.
16 As it turns out, the best way to do this is to use a variety of
17 different representations, each representing a different level of
18 abstraction. These representations are designed in such a way that for
19 each analysis we want to perform, there is an appropriate
20 representation, and it is easy to go from representations that are
21 close to the source code to representations that focus on specific
22 semantic aspects of the program.
24 The representations that the CPROVER framework uses mirror those
25 used in modern compilers such as LLVM and gcc. I will point out those
26 places where the CPROVER framework does things differently, attempting to give
27 rationales wherever possible.
29 One in-depth resource for most of this section is the classic
30 [compiler construction text book ''Compilers: Principles, Techniques and Tools''](https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools)
31 by Aho, Lam, Sethi and Ullman.
33 To illustrate the different concepts, we will consider a small example
34 program. While the program is in C, the general ideas apply to other
35 languages as well - see later sections of this manual to understand how
36 the specific features of those languages are handled. Our running
37 example will be a program that calculates factorials.
42 * For simplicity's sake, we just give the forward
43 * declarations of atoi and printf.
45 int atoi(const char *);
46 int printf(const char *, ...);
48 unsigned long factorial(unsigned n) {
49 unsigned long fac = 1;
50 for (unsigned int i = 1; i <= n; i++) {
56 /* Error handling elided - this is just for illustration. */
57 int main(int argc, const char **argv) {
58 unsigned n = atoi(argv[1]);
59 printf("%u! = %lu\n", n, factorial(n));
64 The question of this first section is: how do we represent this program
65 in memory so that we can do something useful with it?
67 One possibility would be to just store the program as a string, but this
68 is clearly impractical: even finding whether there is an assignment to a
69 specific variable would require significant parsing effort. For this
70 reason, the first step is to parse the program text into a more abstract
73 \subsection AST_section AST
75 The first step in representing a program in memory is to parse the
76 program, at the same time checking for syntax errors, and store the
77 parsing result in memory.
79 The key data structure that stores the result of this step is known as an
80 **Abstract Syntax Tree**, or **AST** for short
81 (cf. [Wikipedia](https://en.wikipedia.org/wiki/Abstract_syntax_tree)).
82 ASTs are still relatively
83 close to the source code, and represent the structure of the source code
84 while abstracting away from syntactic details, e.g., dropping
85 parentheses, semicolons and braces as long as those are only used for
88 Considering the example of the C program given above, we first notice
89 that the program describes (in C terms) a single *translation unit*,
90 consisting of four top-level *declarations* (the two function forward
91 declarations of `atoi` and `printf`, and the function definitions of
92 `factorial` and `main`). Let us start considering the specification of
93 `atoi`. This gives rise to a subtree modeling that we have a function
94 called `atoi` whose return type is `int`, with an unnamed argument of type
95 `const char *`. We can represent this using a tree that has, for instance, the
96 following structure (this is a simplified version of the tree that the CPROVER framework
99 \dot "AST for the `atoi` declaration"
101 global [label="Global - code\nType: int\nName: atoi",shape=rectangle]
102 parameters [label="Parameters",shape=rectangle]
103 parameter0 [label="Parameter\nType: const char *",shape=rectangle]
104 global -> parameters -> parameter0
108 This graph shows the (simplified) AST structure for the `atoi` function.
109 The top level says that this is a global entity, namely one that has
110 code (i.e., a function), called `atoi` and yielding `int`. Furthermore,
111 it has a child node initiating a parameter list, and there is a node in
112 the parameter list for each parameter, giving its type and name, if a
115 Extending this idea, we can represent the structure of the `factorial`
116 function using ASTs. The idea here is that the code itself has a
117 hierarchical structure. In the case of C, this starts with the block
118 structure of the code: at the top, we start with a block of code, having
119 three children, each being a ''statement'' node:
120 1. `unsigned long fac = 1`
121 2. `for (unsigned int i = 1; i <= n; i++) { fac *= i }`
124 The first statement is already a basic statement: we represent it as a
125 local declaration (similar to the global declarations above) of a
128 \dot "AST for `unsigned long fac = 1`"
130 global [label="Code\nStatement: Declaration\nType: unsigned long\nName: fac\nInitial Value: 1",shape=rectangle]
134 The second statement is a compound statement, which we can decompose
135 further. At the top level, we have a node stating that this is a `for`
136 statement, with four child nodes:
137 1. Another declaration node, declaring variable `i`.
138 2. An expression node, with operator `<=` and two children
139 giving the LHS as variable `i` and the RHS as variable `n`.
140 3. An expression node with post-fix operator `++` and a child
141 giving the variable `i` as argument.
142 4. A block node, starting a new code block. This node has one child:
143 1. An expression node with top-level operator `*=` and two child
144 nodes giving the LHS as variable `fac` and the RHS as variable
146 All in all, the AST for this piece of code looks like this:
148 \dot "AST for the `for` loop"
150 node [shape=rectangle]
151 for [label="Code\nStatement: for"]
152 decli [label="Code\nStatement: Declaration\nType: unsigned\nName: i\nInitial Value: 1"]
153 cond [label="Expression: <="]
154 condlhs [label="Expression: Variable i"]
155 condrhs [label="Expression: Variable n"]
156 inc [label="Code\nExpression: ++ - postfix"]
157 incarg [label="Expression: Variable i"]
158 body [label="Code\nStatement: Block"]
159 mult [label="Code\nExpression: *="]
160 multlhs [label="Expression: Variable fac"]
161 multrhs [label="Expression: Variable i"]
162 for -> {decli,cond,inc,body}
163 cond -> {condlhs,condrhs}
165 body -> mult -> {multlhs, multrhs}
169 Finally, the third statement is again simple: it consists of a return
170 statement node, with a child node for the variable expression `fac`.
171 Since the AST is very similar to the first AST above, we omit it here.
172 All in all, the AST for the function body looks like this:
174 \dot "AST for the body of `factorial`"
176 node [shape=rectangle]
177 factorial [label="Code\nStatement: Block"]
178 declfac [label="Code\nStatement: Declaration\nType: unsigned long\nName: fac\nInitial Value: 1"]
179 for [label="Code\nStatement: for"]
180 decli [label="Code\nStatement: Declaration\nType: unsigned\nName: i\nInitial Value: 1"]
181 cond [label="Expression: <="]
182 condlhs [label="Expression: Variable i"]
183 condrhs [label="Expression: Variable n"]
184 inc [label="Code\nExpression: ++ - postfix"]
185 incarg [label="Expression: Variable i"]
186 body [label="Code\nStatement: Block"]
187 mult [label="Code\nExpression: *="]
188 multlhs [label="Expression: Variable fac"]
189 multrhs [label="Expression: Variable i"]
190 return [label="Code\nStatement: return"]
191 returnexpr [label="Expression: Variable fac"]
192 factorial -> {declfac, for, return}
193 for -> {decli,cond,inc,body}
194 cond -> {condlhs,condrhs}
196 body -> mult -> {multlhs, multrhs}
201 Using the AST for the function body, we can easily produce the
202 definition of the `factorial` function:
204 \dot "AST for `factorial` (full definition)"
206 node [shape=rectangle]
207 globalfactorial [label="Global - code\nType: unsigned long\nName: factorial"]
208 parameters [label="Parameters"]
209 parameter [label="Parameter\nType: unsigned\nName: n"]
210 factorial [label="Code\nStatement: Block"]
211 declfac [label="Code\nStatement: Declaration\nType: unsigned long\nName: fac\nInitial Value: 1"]
212 for [label="Code\nStatement: for"]
213 decli [label="Code\nStatement: Declaration\nType: unsigned\nName: i\nInitial Value: 1"]
214 cond [label="Expression: <="]
215 condlhs [label="Expression: Variable i"]
216 condrhs [label="Expression: Variable n"]
217 inc [label="Code\nExpression: ++ - postfix"]
218 incarg [label="Expression: Variable i"]
219 body [label="Code\nStatement: Block"]
220 mult [label="Code\nExpression: *="]
221 multlhs [label="Expression: Variable fac"]
222 multrhs [label="Expression: Variable i"]
223 return [label="Code\nStatement: return"]
224 returnexpr [label="Expression: Variable fac"]
225 factorial -> {declfac, for, return}
226 for -> {decli,cond,inc,body}
227 cond -> {condlhs,condrhs}
229 body -> mult -> {multlhs, multrhs}
231 globalfactorial -> {parameters,factorial}
232 parameters -> parameter
236 In the end, we produce a sequence of trees modeling each declaration
237 in the translation unit (i.e., the file `factorial.c`).
239 This data structure is already useful: at this level, we can easily
240 derive simple information such as ''which functions are being defined?'',
241 ''what are the arguments to a given function'' and so on.
243 \subsubsection symbol_table_section Symbol tables and variable disambiguation
245 Nevertheless, for many analyses, this representation needs to be
246 transformed further. In general, the first step is to **resolve variable
247 names**. This is done using an auxiliary data structure known as the
250 The issue that this step addresses is that the meaning of a variable
251 name depends on a given *scope* - for instance, in a C program, we can
252 define the variable `i` as a local variable in two different functions,
253 so that the name `i` refers to two different objects, depending on which
254 function is being executed at a given point in time.
256 To resolve this problem, a first transformation step is performed,
257 changing (short) variable names to some form of unique identifier. This
258 ensures that each variable identifier is used only once, and all
259 accesses to a given variable can be easily found by just looking for the
260 variable identifier. For instance, we could change each variable name
261 into a pair of the name and a serial number. The serial number gets
262 increased whenever a variable of that name is declared. In the example,
263 the ASTs for `factorial` and `main` after resolving variable names would
264 look roughly like this:
266 \dot "ASTs with resolved variables"
268 node [shape=rectangle]
269 globalfactorial [label="Global - code\nType: unsigned long\nName: factorial"]
270 parameters [label="Parameters"]
271 parameter [label="Parameter\nType: unsigned\nName: n.1"]
272 factorial [label="Code\nStatement: Block"]
273 declfac [label="Code\nStatement: Declaration\nType: unsigned long\nName: fac.1\nInitial Value: 1"]
274 for [label="Code\nStatement: for"]
275 decli [label="Code\nStatement: Declaration\nType: unsigned\nName: i.1\nInitial Value: 1"]
276 cond [label="Expression: <="]
277 condlhs [label="Expression: Variable i.1"]
278 condrhs [label="Expression: Variable n.1"]
279 inc [label="Code\nExpression: ++ - postfix"]
280 incarg [label="Expression: Variable i.1"]
281 body [label="Code\nStatement: Block"]
282 mult [label="Code\nExpression: *="]
283 multlhs [label="Expression: Variable fac.1"]
284 multrhs [label="Expression: Variable i.1"]
285 return [label="Code\nStatement: return"]
286 returnexpr [label="Expression: Variable fac.1"]
287 factorial -> {declfac, for, return}
288 for -> {decli,cond,inc,body}
289 cond -> {condlhs,condrhs}
291 body -> mult -> {multlhs, multrhs}
293 globalfactorial -> {parameters,factorial}
294 parameters -> parameter
296 globalmain [label="Global - code\nType: int\nName: main"]
297 parametersmain [label="Parameters"]
298 parameterargc [label="Parameter\nType: int\nName: argc.1"]
299 parameterargv [label="Parameter\nType: const char **\nName: argv.1"]
300 mainbody [label="Code\nStatement: Block"]
301 main0 [label="Code\nStatement: Declaration\nType: unsigned\nName: n.2\nInitial Value: Expression"]
302 main1 [label="Code\nExpression: Call"]
303 main1fun [label="Expression: Global atoi"]
304 main1arg [label="Arguments"]
305 main1arg1 [label="Expression: Index"]
306 main1arg11 [label="Expression: Variable argv.1"]
307 main1arg12 [label="Expression: 1"]
308 main2 [label="Code\nExpression: Call"]
309 main2fun [label="Expression: Global printf"]
310 main2arg [label="Arguments"]
311 main2arg1 [label="Expression: ..."]
312 main2arg2 [label="Expression: Variable n.2"]
313 main2arg3 [label="Expression: Call"]
314 main2arg3fun [label="Expression: Global factorial"]
315 main2arg3arg [label="Arguments"]
316 main2arg3arg1 [label="Expression: Variable n.2"]
317 main3 [label="Code\nStatement: return"]
318 main3expr [label="Expression: 0"]
319 globalmain -> {parametersmain,mainbody}
320 parametersmain -> {parameterargc, parameterargv}
321 mainbody -> {main0,main1,main2}
322 main1 -> {main1fun, main1arg}
323 main1arg -> main1arg1 -> {main1arg11,main1arg12}
324 main2 -> {main2fun, main2arg}
325 main2arg -> {main2arg1,main2arg2,main2arg3}
326 main2arg3-> {main2arg3fun, main2arg3arg}
327 main2arg3arg -> main2arg3arg1
332 Note that the parameter `n` in `factorial` and the local variable `n` in
333 `main` are now disambiguated as `n.1` and `n.2`; furthermore, we leave
334 the names of global objects as-is. In the CPROVER framework, a more
335 elaborate system is used: local variables are prefixed with the function names,
336 and further disambiguation is performed by adding indices. For brevity,
339 Further information on ASTs can be found in the
340 [Wikipedia page](https://en.wikipedia.org/wiki/Abstract_syntax_tree)
341 and the materials linked there. Additionally, there is an
342 [interesting discussion on StackOverflow](https://stackoverflow.com/questions/1888854/what-is-the-difference-between-an-abstract-syntax-tree-and-a-concrete-syntax-tree)
343 about abstract versus concrete syntax trees.
345 At this level, we can already perform a number of interesting analyses,
346 such as basic type checking. But for more advanced analyses, other
347 representations are better: the AST contains many different kinds of nodes
348 (essentially, one per type of program construct), and has a rather
351 \subsection IR_section Intermediate Representations (IR)
353 The ASTs in the previous section represent the syntax of a program, including
354 all the features of a given programming language. But in practice, most
355 programming languages have a large amount of ''syntactic sugar'': constructs
356 that are, technically speaking, redundant, but make programming a lot easier.
357 For analysis, this means that if we immediately try to work on the initial
358 AST, we would have to handle all these various cases.
360 To simplify analysis, it pays off to bring the AST into simpler forms,
361 known as **intermediate representations** (short **IR**). An IR is usually
362 given as some form of AST, using a more restricted subset or a variant of the
363 original language that is easier to analyze than the original version.
365 Taking the example from above, we rewrite the program into a simpler form of
366 the C language: instead of allowing powerful control constructs such as
367 `while` and `for`, we reduce everything to `if` and `goto`. In fact, we even
368 restrict `if` statements: an `if` statement should always be of the form
369 `if (*condition*) goto *target* else goto *target*;`. As it turns out, this is
370 sufficient to represent every C program. The factorial function in our
371 example program can then be rewritten as follows:
374 unsigned long factorial(unsigned n) {
375 unsigned long fac = 1;
376 // Replace the for loop with if and goto
379 if (i <= n) goto for_loop_entry else goto for_loop_end;
389 We leave it up to the reader to verify that both versions of the function
390 behave the same way, and to draw the function as an AST.
392 In the CPROVER framework, a number of different IRs are employed to
393 simplify the program under analysis into a simple core language step-by-step.
394 In particular, expressions are brought into much simpler forms.
395 This sequence of transformations is described in later chapters.
397 TODO: Link to corresponding sections.
399 \subsection CFG_section Control Flow Graphs (CFG)
401 Another important representation of a program can be gained by transforming
402 the program structure into a
403 [**control flow graph**](https://en.wikipedia.org/wiki/Control_flow_graph),
404 short **CFG**. While the
405 AST focuses more on the syntactic structure of the program, keeping
406 constructs like while loops and similar forms of structured control flow,
407 the CFG uses a unified graph representation for control flow.
409 In general, for analyses based around Abstract Interpretation
410 (see \ref abstract_interpretation_section), it
411 is usually preferable to use a CFG representation, while other analyses,
412 such as variable scope detection, may be easier to perform on ASTs.
414 The general idea is to present the program as a graph. The nodes of the graph
415 are instructions or sequences of instructions. In general, the nodes are
416 **basic blocks**: a basic block is a sequence of statements that is always
417 executed in order from beginning to end. The edges of the graph describe how
418 the program execution may move from one basic block to the next.
419 Note that single statements are always basic blocks; this is the representation
420 used inside the CPROVER framework. In the examples below, we try to use maximal
421 basic blocks (i.e., basic blocks that are as large as possible); this can be
422 advantageous for some analyses.
424 Let us consider the factorial function as an example. As a reminder,
425 here is the code, in IR:
427 unsigned long fac = 1;
430 if (i <= n) goto for_loop_entry else goto for_loop_end;
439 We rewrite the code with disambiguated variables (building the
440 AST from it is left as an exercise):
442 unsigned long fac.1 = 1;
443 unsigned int i.1 = 1;
445 if (i.1 <= n.1) goto for_loop_entry else goto for_loop_end;
454 This function consists of four basic blocks:
455 1. `unsigned long fac.1 = 1;`
456 `unsigned int i.1 = 1;`
457 2. `if (i.1 <= n.1) goto for_loop_entry else goto for_loop_end`
458 (this block has a label, `for_loop_start`).
459 3. `fac.1 *= i.1`<br/>
461 `goto for_loop_start`
462 (this block has a label, `for_loop_entry`).
464 (this block has a label, `for_loop_end`).
466 One way to understand which functions form basic blocks is to consider the
467 successors of each instruction. If we have two instructions A and B, we say
468 that B is a *successor* of A if, after executing A, we can execute B without any
469 intervening instructions. For instance, in the example above, the loop
470 initialization statement `unsigned long int i.1 = 1` is a successor of
471 `unsigned long fac.1 = 1`. On the other hand, `return fac.1` is not a successor
472 of `unsigned long fac.1 = 1`: we always have to execute some other intermediate
473 statements to reach the return statement.
475 Now, consider the `if` statement,
476 `if (i.1 <= n.1) goto for_loop_entry else goto for_loop_end`.
477 This statement has *two* successors: `fac.1 *= i.1` and `return fac.1`.
479 Similarly, we say that A is a *predecessor* of B if B is a successor of A.
480 We find that the `if` statement has two predecessors,
481 `unsigned int i.1 = 1` and `goto for_loop_start`.
483 A basic block is a sequence of instructions with the following property:
484 1. If B comes directly after A in the sequence, B must be the sole successor of A.
485 2. If A comes directly before B in the sequence, A must be the sole predecessor of B.
487 In particular, each member of the sequence but the first must have exactly one
488 predecessor, and each member of the sequence but the last must have exactly one
490 These criteria explain why we have the basic blocks described above.
492 Putting everything together, we get a control flow graph like this:
494 \dot "Control flow graph for factorial"
496 node [shape=rectangle]
497 bb1 [label="Declaration: unsigned long fac.1, value 1\nDeclaration: unsigned i.1, value 1",peripheries=2]
498 bb2 [label="Expression: i.1 <= n.1"]
499 bb3 [label="Expression: fac.1 *= i.1\nExpression: i.1 ++",junk="*"]
501 bb4 [label="Return: fac.1",style=filled,fillcolor=gray80]
503 bb2 -> bb3 [label="true"]
504 bb2 -> bb4 [label="false"]
509 The graph can be read as follows: each node corresponds to a basic
510 block. The initial basic block (where the function is entered) is marked
511 with a double border, while those basic blocks that leave the function
512 have a gray background. An edge from a basic block B to a basic block B',
513 means that if the execution reaches the end of B, execution may continue in B'.
514 Some edges are labeled: the edge leaving the comparison
515 basic block with the label `true`, for instance, can only be taken if
516 the comparison did, in fact, return `true`.
518 Note that this representation makes it very easy to interpret the
519 program, keeping just two pieces of state: the current position (which
520 basic block and which line), and the values of all variables (in real
521 software, this would also include parts of the heap and the call stack).
522 Execution proceeds as follows: as long as there are still instructions
523 to execute in the current basic block, run the next instruction and move
524 the current position to right after that instruction. At the end of a
525 basic block, take one of the available outgoing edges to another basic block.
527 In the CPROVER framework, we often do not construct CFGs explicitly, but instead
528 use an IR that is constructed in a way very similar to CFGs, known as
529 ''GOTO programs''. This IR is used to implement a number
530 of static analyses, as described in sections \ref analyses-frameworks
531 and \ref analyses-specific-analyses.
533 \subsection SSA_section SSA
535 While control flow graphs are already quite useful for static analysis,
536 some techniques benefit from a further transformation to a
537 representation know as **static single assignment**, short **SSA**. The
538 point of this step is to ensure that we can talk about the entire history
539 of assignments to a given variable. This is achieved by renaming
540 variables again: whenever we assign to a variable, we *clone* this
541 variable by giving it a new name. This ensures that each variable
542 appearing in the resulting program is written to exactly once (but it
543 may be read arbitrarily many times). In this
544 way, we can refer to earlier values of the variable by just referencing
545 the name of an older incarnation of the variable.
547 We illustrate this transformation by first showing how the body of the
548 `for` loop of `factorial` is transformed. We currently have:
551 expression: fac.1 *= i.1
555 We now give a second number to each variable, counting the number of
556 assignments so far. Thus, the SSA version of this code turns out to be
558 expression: fac.1.2 = fac1.1 * i.1.1
559 expression: i.1.2 = i.1.1 + 1
561 This representation now allows us to state facts such as
562 ''`i` is increasing'' by writing `i.1.1 < i.1.2`.
564 At this point, we run into a complication. Consider the following piece
565 of code for illustration:
567 // Given some integers a, b
573 The corresponding control flow graph looks like this:
575 \dot "CFG for maximum function"
577 node [shape=rectangle]
578 first [label="declare x\nx = a\na < b"]
579 second [label="x = b"]
580 third [label="return x"]
581 first -> second [label="true"]
582 first -> third [label="false"]
587 When we try to transform to SSA, we get:
588 \dot "CFG for maximum function - SSA, attempt"
590 node [shape=rectangle]
591 first [label="\nx.1 = a\na < b"]
592 second [label="x.2 = b"]
593 third [label="return ???"]
594 first -> second [label="true"]
595 first -> third [label="false"]
599 Depending on which path the execution takes, we have to return either
600 `x.1` or `x.2`! The way to make this work is to introduce a function
601 Φ that selects the right instance of the variable; in the
602 example, we would have
603 \dot "CFG for maximum function - SSA using Φ"
605 node [shape=rectangle]
606 first [label="\nx.1 = a\na < b"]
607 second [label="x.2 = b"]
608 third [label="return Φ(x.1,x.2)"]
609 first -> second [label="true"]
610 first -> third [label="false"]
615 In the CPROVER framework, we provide a precise implementation of Φ, using explicitly tracked
616 information about which branches were taken by the program.
617 There are also some differences in how loops are
618 handled (finite unrolling in CPROVER, versus a Φ-based approach in
619 compilers); this approach will be discussed in a later chapter.
621 For the time being, let us come back to `factorial`. We can now give
622 an SSA using Φ functions:
624 \dot "Control flow graph in SSA for factorial"
626 node [shape=rectangle]
627 bb1 [label="unsigned long fac.1.1 = 1\nunsigned i.1.1 = 1",peripheries=2]
628 bb2 [label="i.1.2 = Φ(i.1.1, i.1.3)\nfac.1.2 = Φ(i.1.1, i.1.3)\ni.1.2 <= n.1.1"]
629 bb3 [label="fac.1.3 = fac.1.2 * i.1.2\ni.1.3 = i.1.2 + 1"]
631 bb4 [label="Return fac.1.2",style=filled,fillcolor=gray80]
633 bb2 -> bb3 [label="true"]
634 bb2 -> bb4 [label="false"]
638 The details of SSA construction, plus some discussion of how it is
639 used in compilers, can be found in the
640 [original paper](https://dl.acm.org/citation.cfm?doid=115372.115320).
642 The SSA is an extremely helpful representation when one
643 wishes to perform model checking on the program (see next section),
644 since it is much easier to extract the logic formulas used in this
645 technique from an SSA compared to a CFG (or, even worse, an AST). That
646 being said, the CPROVER framework takes a different route, opting to convert to
647 intermediate representation known as GOTO programs instead.
649 \section analysis_techniques_section Analysis techniques
651 \subsection BMC_section Bounded model checking
653 One of the most important analysis techniques by the CPROVER framework,
654 implemented using the CBMC (and JBMC) tools, is **bounded model checking**,
655 a specific instance of a method known as
656 [Model Checking](https://en.wikipedia.org/wiki/Model_checking).
658 The basic question that model checking tries to answer is: given some system
659 (in our case, a program) and some property, can we find an execution of the
660 system such that it reaches a state where the property holds?
661 If yes, we would like to know how the program reaches this state - at the very
662 least, we want to see what inputs are required, but in general, we would prefer
663 having a **trace**, which shows what statements are executed and in which order.
665 In general, a trace describes which statements of the program were executed,
666 and which intermediate states were reached. Often, it is sufficient to
667 only provide part of the intermediate states (omitting some entirely, and
668 only mentioning parts that cannot be easily reconstructed in others).
670 As it turns out, model checking for programs is, in general, a hard problem.
671 Part of the reason for this is that many model checking algorithms strive for
672 a form of ''completeness'' where they either find a trace or return a proof
673 that such a trace cannot possibly exist.
675 Since we are interested in generating test cases, we prefer a different
676 approach: it may be that a certain target state is reachable only after
677 a very long execution, or not at all, but this information does not help
678 us in constructing test cases. For this reason, we introduce an
679 **execution bound** that describes how deep we go when analyzing a program.
681 Model checking techniques using such execution bounds are known as
682 **bounded model checking**; they will return either a trace, or a statement
683 that says ''the target state could not be reached in n steps'', for a
684 given bound n. Thus, for a given bound, we always get an
685 **underapproximation** of all states that can be reached: we
686 can certainly find those reachable within the given bound, but we may
687 miss states that can be reached only with more steps. Conversely, we will
688 never claim that a state is not reachable within a certain bound if there is,
689 in fact, a way of reaching this state.
691 The bounded model checking techniques used by the CPROVER framework are
692 based on *symbolic model checking*, a family of model checking techniques that
693 work on sets of program states and use advanced tools such as SAT solvers (more
694 on that below) to calculate the set of reachable states. The key step here
695 is to encode both the program and the set of states using an appropriate logic,
696 mostly *propositional logic* and (fragments of) *first-order logic*.
698 In the following, we will quickly discuss propositional logic, in combination
699 with SAT solving, and show how to build a simple bounded model checker.
700 Actual bounded model checking for software requires
701 a number of additional steps and concepts, which will be introduced as required
704 \subsubsection propositional_logic_subsubsection Propositional logic and SAT solving
706 Many of the concepts in this section can be found in more detail in the
707 Wikipedia article on [Propositional logic](https://en.wikipedia.org/wiki/Propositional_calculus).
709 Let us start by looking at **propositional formulas**. A propositional formula
710 consists of **propositional variables**, say *x*, *y* and *z*, that can
711 take the Boolean values **true** and **false**, connected together with
712 logical operators (often called *junctors*), namely *and*, *or* and *not*.
713 Sometimes, one introduces additional junctors, such as *xor* or *implies*,
714 but these can be defined in terms of the three basic junctors just described.
716 Examples of propositional formulas would be ''*x* and *y*'' or
717 ''not *x* or *y* or *z*''. We can **evaluate** formulas by setting each
718 variable to a Boolean value and reducing using the follows rules:
719 - *x* and **false** = **false** and *x* = **false** or **false** = not **true** = **false**
720 - *x* or **true** = **true** or *x* = **true** and **true** = not **false** = **true**
722 An important related question is: given a propositional formula,
723 is there a variable assignment that makes it evaluate to true?
724 This is known as the [**SAT problem**](https://en.wikipedia.org/wiki/Boolean_satisfiability_problem).
725 The most important things to know about SAT are:
726 1. It forms the basis for bounded model checking algorithms;
727 2. It is a very hard problem to solve in general: it is NP-complete,
728 meaning that it is easy to check a solution, but (as far as we know) hard
730 3. There has been impressive research in
731 [**SAT solvers**](https://en.wikipedia.org/wiki/Boolean_satisfiability_problem#Algorithms_for_solving_SAT)
732 that work well in practice for the kinds of formulas that we encounter in
733 model checking. A commonly-used SAT solver is [minisat](http://minisat.se).
734 4. SAT solvers use a specific input format for propositional formulas, known
735 as the [**conjunctive normal form**](https://en.wikipedia.org/wiki/Conjunctive_normal_form).
736 For details, see the linked Wikipedia page;
737 roughly, a conjunctive normal form formula is a propositional
738 formula with a specific shape: at the lowest level are
739 *atoms*, which are propositional variables ''*x*'' and
740 negated propositional variables ''not *x*'';
741 the next layer above are *clauses*, which are sequences
742 of atoms connected with ''or'', e.g. ''not *x* or *y* or *z*''.
743 The top layer consists sequences of clauses,
744 connected with ''and''.
746 As an example in how to use a SAT solver, consider the following
747 formula (in conjunctive normal form):
749 ''(*x* or *y*) and (*x* or not *y*) and *x* and *y*''
751 We can represent this formula in the minisat input format as:
760 Compare the [Minisat user guide](https://www.dwheeler.com/essays/minisat-user-guide.html).
761 Try to run minisat on this example. What would you expect, and what result do you get?
763 Next, try running minisat on the following formula:
764 ''(*x* or *y*) and (*x* or not *y*) and (not *x*) and *y*''
767 \subsubsection bitvector_subsubsection Using SAT to reason about data: Bit vectors
769 So far, we have seen how to reason about simple propositional formulas. This
770 seems to be quite far from our goal of reasoning about the behavior of programs.
771 As a first step, let us see how we can lift SAT-based reasoning to reasoning about
772 data such as machine integers.
774 Remember the `factorial` function. It starts with the line
776 unsigned long fac = 1;
778 Now, suppose that `fac` will be represented as a binary number with 64 bits (this
779 is standard on, e.g., Linux). So,
780 if we wish to reason about the contents of the variable `fac`, we might as well
781 represent it as a vector of 64 propositional variables, say `fac`<sub>0</sub> to
782 `fac`<sub>63</sub>, where
783 `fac` = 2<sup>63</sup> `fac`<sub>63</sub> + ... + 2<sup>0</sup> `fac`<sub>0</sub>.
784 We can then assert that `fac`=1 using
785 the propositional formula
786 `fac`<sub>63</sub> = 0 and ... and `fac`<sub>1</sub> = 0 and `fac`<sub>0</sub> = 1,
787 where we define the formula A = B as ''(A or not B) and (B or not A)''.
789 We call this a *bit vector* representation. Compare the Wikipedia page on
790 [Binary numbers](https://en.wikipedia.org/wiki/Binary_number).
792 Bit vector representations can also be used to describe operations on binary
793 numbers. For instance, suppose we have two four-bit numbers A_3, ... A_0
794 (representing a number A) and B_3, ..., B_0 (representing a number b)
795 and wish to add them. As detailed on the page on
796 [Binary adders](https://en.wikipedia.org/wiki/Adder_(electronics)),
797 we define three additional bit vectors, the *carries* C<sub>0</sub>, ..., C<sub>3</sub>,
798 the *partial sum* P<sub>0</sub>, ..., P<sub>4</sub> and
799 the *sum* S<sub>0</sub>, ..., S<sub>4</sub> , representing a number S such that
801 Note that the sum vector has one bit more - why? How
802 is this related to arithmetic overflow in C?
804 For convenience, we first define the *half-adder*. This is given as two
805 formulas, HA_S(A,B) = (A and not B) or (B and not A), which gives the
806 sum of the bits A and B, and HA_C(A,B) = A and B, which indicates
807 whether the result of the sum was too big to fit into one bit (so we carry
808 a one to the next bit).
810 Using the half-adder formulas, we can define the *full adder*, again given as
811 two formulas, one for sum and the other for carry. We have
812 FA_S(A,B,C_in) = HA(HA(A,B),C_in), giving the sum of A, B and C_in,
813 and FA_C(A,B,C_in) = HA_C(A,B) or (C_in and HA_S(A,B)), which states
814 whether the result is too big to fit into a bit. Note that the full adder
815 has an additional input for carries; in the following step, we will use it
816 to chain the full adders together to compute the actual sum.
818 Using the helper variables we have above, we can give the sum of the four-bit
820 > C_0 = FA_C(A_0,B_0,0) and C_1 = FA_C(A_1,B_1,C_0) and
821 > C_2 = FA_C(A_2,B_2,C_1) and C_3 = FA_C(A_3,B_3,C_2) and
822 > S_0 = FA_S(A_0,B_0,0) and S_1 = FA_S(A_1,B_1,C_0) and
823 > and S_2 = FA_S(A_2,B_2,C_1) and S_3 = FA_S(A_3,B_3,C_2)
824 > and S_3 = FA_S(0,0,C_3).
826 Other arithmetic operations on binary numbers can be expressed using propositional
827 logic as well; the details can be found in the linked articles, as well
828 as [Two's complement](https://en.wikipedia.org/wiki/Two%27s_complement) for
829 handling signed integers and [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754)
830 for floating point numbers.
832 In the following, we will simply write formulas such as S=A+B, with the
833 understanding that this is internally represented using the appropriate bit
836 \subsubsection bmc_subsubsection How bounded model checking works
838 At this point, we can start considering how to express program behavior in
841 Let us start with a very simple program:
843 int sum(int a, int b)
848 To describe the behavior of this program, we introduce the appropriately-sized
849 bit vectors A and B, and an additional helper vector return. The A and
850 B bit vectors reflect the values of the parameters `a` and `b`, while
851 return contains the return value of the function. As we have seen above, we
852 can describe the value of `a+b` as A+B -- remember that this is an
853 abbreviation for a moderately complex formula on bit vectors!
855 From the semantics of the `return` instruction, we know that this program will
856 return the value `a+b`, so we can describe its behavior as return = A+B.
858 Let us consider a slightly more complex program.
867 We again introduce several bit vectors. For the parameter `x`, we introduce a
868 bit vector X, and for the return value, return. But we also have to deal
869 with the (local) variable `y`, which gets two assignments. Furthermore, we
870 now have a program with three instructions.
872 Thinking about the approach so far, we find that we cannot directly translate
873 `y=y+x` into a propositional formula: On the left-hand side of the `=`, we
874 have the ''new'' value of `y`, while the right-hand side uses the ''old'' value.
875 But propositional formulas do not distinguish between ''old'' and ''new'' values,
876 so we need to find a way to distinguish between the two. The easiest solution
877 is to use the Static Single Assignment form described in section \ref SSA_section.
878 We transform the program into SSA (slightly simplifying the notation):
880 int calculate(int x.1)
887 In this form, we know that each variable is assigned to at most once.
888 To capture the behavior of this program, we translate it statement-by-statement
889 into a propositional formula. We introduce two bit vectors Y1 and Y2 to
890 stand for `y.1` and `y.2` (we map X to `x.1` and return to the return
891 value). `int y.1 = x.1 * x.1` becomes Y1 = X * X, `y.2 = y.1 + x.1` becomes
892 Y2 = Y1 + X and `return y.2` becomes return = Y2.
894 To tie the three formulas together into a description of the whole program,
895 we note that the three instructions form a single basic block, so we know they
896 are always executed as a unit. In this case, it is sufficient to simply connect
897 them with ''and'': Y1 = X * X and Y2 = Y1 + X and return = Y2. Note that thanks
898 to SSA, we do not need to pay attention to control flow here.
900 One example of non-trivial control flow are `if` statements. Consider the
903 int max(int a, int b)
913 Bringing this into SSA form, we have the following program (we write `Phi` for Φ):
915 int max(int a, int b)
922 return Phi(result.1,result.2);
925 We again introduce bit vectors A (for parameter `a`), B (for parameters `b`),
926 R1 (for `result.1`), R2 (for `result.2`) and
927 return (for the return value). The interesting question in this case is
928 how we can handle the Φ node: so far, it is a ''magic'' operator that selects
931 As a first step, we modify the SSA form slightly by introducing an additional
932 propositional variable C that tracks which branch of the `if` was taken.
933 We call this variable the *code guard variable*, or *guard* for short.
934 Additionally, we add C to the Φ node as a new first parameter, describing
935 which input to use as a result.
936 The corresponding program looks something like this:
938 int max(int a, int b)
941 bit C; /* Track which branch was taken */
943 /* if (C) - not needed anymore thanks to SSA */
947 return Phi(C,result.1,result.2);
950 For the encoding of the program, we introduce the implication junctor, written
951 ⇒, where ''A ⇒ B'' is equivalent to ''(not A) or B''.
952 It can be understood as ''if A holds, B must hold as well''.
954 With these ingredients, we can encode the program. First of all, we translate
955 the basic statements of the program:
956 - `C = a<b` maps to C = A<B, for an appropriate formula A<B.
957 - `result.1 = b` becomes R1 = B, and `result.2 = a` becomes R2 = A.
959 Finally, the Φ node is again resolved using the ⇒ junctor: we
960 can encode the `return` statement as
961 (C ⇒ return = R1) and ((not C) ⇒ return = R2).
963 At this point, it remains to tie the statements together; we find that we can
964 again simply connect them with ''and''. We get:
965 > C = a<b and R1 = B and R2 = A and
966 > (C ⇒ return = R1) and ((not C) ⇒ return = R2).
968 So far, we have only discussed how to encode the behavior of programs
969 as propositional formulas. To actually reason about programs, we also need to
970 a way to describe the property we want to prove. To do this, we introduce a
971 primitive `ASSERT`. Let `e` be some expression; then `ASSERT(e)` is supposed
972 to do nothing if `e` evaluates to true, and to abort the program if `e`
975 For instance, we can add prove that `max(a,b) <= a` by modifying the `max`
978 int max(int a, int b)
990 The corresponding SSA would be
992 int max(int a, int b)
995 bit C; /* Track which branch was taken */
999 ASSERT(Phi(C,result.1,result.2) <= a);
1000 return Phi(C,result.1,result.2);
1003 We translate `ASSERT(Phi(C,result.1,result.2) <= a)` into
1004 > Φ(C,result.1,result.2) <= a
1005 The resulting formula would be
1006 > C = a<b and R1 = B and R2 = A and
1007 > (C ⇒ R1 <= A) and ((not C) ⇒ R2 <= A).
1008 > (C ⇒ return = R1) and ((not C) ⇒ return = R2).
1010 We can extend this approach quite straightforwardly to other constructs, but
1011 one obvious problem remains: We have not described how to handle loops. This
1012 turns out to be a substantial issue for theoretical reasons: Programs without
1013 loop (and without recursive functions) always run for a limited amount of
1014 execution steps, so we can easily write down formulas that, essentially,
1015 track their entire execution history. But programs with loops can take
1016 arbitrarily many steps, or even run into infinite loops, so we cannot describe
1017 their behavior using a finite propositional formula in the way we have
1020 There are multiple approaches to deal with this problem, all with different
1021 trade-offs. CBMC chooses bounded model checking as the underlying approach.
1022 The idea is that we only consider program
1023 executions that are, in some measure, ''shorter'' than a given bound. This bound
1024 then implies an upper bound on the size of the required formulas, which makes
1025 the problem tractable.
1027 To make this concrete, let's go back to the factorial example. We consider
1028 the function in the following form (in CPROVER, we actually use a `goto`-based
1029 IR, but the `while`-based version is a bit simpler to understand):
1031 unsigned long factorial(unsigned n) {
1032 unsigned long fac = 1;
1041 We set the following ''length bound'': The loops in the program are allowed to
1042 be executed at most three times; we will ignore all other executions. With this
1043 in mind, we observe that the following two classes of programs behave
1046 /* Execute this loop at most n+1 times */
1055 /* Execute this loop at most n times */
1061 This transformation is known as [loop unrolling](https://en.wikipedia.org/wiki/Loop_unrolling)
1063 We can always replace a loopby an
1064 `if` that checks if we should execute, a copy of the loop body and the
1067 So, to reason about the `factorial` function, we unroll the loop three times
1068 and then replace the loop with `ASSERT(!(condition))`. We get:
1070 unsigned long factorial(unsigned n) {
1071 unsigned long fac = 1;
1088 Since this program is now in a simpler form without loops, we can use the
1089 techniques we learned above to transform it into a propositional formula.
1090 First, we transform into SSA (with code guard variables already included):
1092 unsigned long factorial(unsigned n) {
1093 unsigned long fac.1 = 1;
1094 unsigned int i.1 = 1;
1097 fac.2 = fac.1 * i.1;
1101 fac.3 = fac.2 * i.2;
1105 fac.4 = fac.3 * i.3;
1107 ASSERT(!(i.4 <= n));
1109 return Phi(C1, Phi(C2, Phi(C3, fac.4, fac.3), fac.2), fac.1);
1112 Note that we may be missing
1113 possible executions of the program due to this translation; we come back to
1116 The corresponding propositional formula can then be written as (check
1117 that this is equivalent to the formula you would be getting by following
1118 the translation procedure described above):
1120 > fac.1 = 1 and i.1 = 1 and C1 = i.1 <= n and
1121 > fac.2 = fac.1 * i.1 and i.2 = i.1 + 1 and C2 = i.2 <= n and
1122 > fac.3 = fac.2 * i.2 and i.3 = i.2 + 1 and C3 = i.3 <= n and
1123 > fac.4 = fac.3 * i.3 and i.4 = i.3 + 1 and C4 = i.4 <= n and
1124 > not (i.4 <= n) and
1125 > ((C1 and C2 and C3) ⇒ result = fac.4) and
1126 > ((C1 and C2 and not C3) ⇒ result = fac.3) and
1127 > ((C1 and not C2) ⇒ result = fac.2) and
1128 > ((not C1) ⇒ result = fac.1)
1129 In the following, we reference this formula as FA(n, result).
1131 At this point, we know how to encode programs as propositional formulas.
1132 Our goal was to reason about programs, and in particular, to check whether
1133 a certain property holds. Suppose, for example, that we want to check if there
1134 is a way that the `factorial` function returns `6`. One way to do this is to
1135 look at the following propositional formula: FA(n, result) and result = 6.
1136 If this formula has a model (i.e., if we can find a satisfying assignment to
1137 all variables, and in particular, to n), we can extract the required value
1138 for the parameter `n` from that model. As we have discussed above, this can
1139 be done using a SAT solver: If you run, say, MiniSAT on this formula, you will
1140 get a model that translates to n=3.
1142 Be aware that this method has very clear limits: We know that the factorial of
1143 `5` is `120`, but with the formula above, evaluating
1144 ''FA(n, result) and result=120'' would yield ''unsatisfiable''! This is because
1145 we limited the number of loop iterations, and to reach 120, we have to execute
1146 the loop more than three times. In particular, a ''VERIFICATION SUCCESSFUL''
1147 message, as output by CBMC, must always be interpreted keeping the bounds
1148 that were used in mind.
1150 In the case that we found a model, we can get even more information: We can
1151 even reconstruct the program execution that would lead to the requested result.
1152 This can be achieved by tracing the SSA, using the guard variables to decide
1153 which branches of `if` statements to take. In this way, we can simulate the
1154 behavior of the program. Writing down the steps of this simulation yields a
1155 *trace*, which described how the corresponding program execution would look like.
1157 \subsubsection smt_etc_subsubsection Where to go from here
1159 The above section gives only a superficial overview on how SAT solving and
1160 bounded model checking work. Inside the CPROVER framework, we use a
1161 significantly more advanced engine, with numerous optimizations to the
1162 basic algorithms presented above. One feature that stands out is that
1163 we do not reduce everything to propositional logic, but instead use a more
1164 powerful logic, namely quantifier-free first-order logic. The main
1165 difference is that instead of propositional variables, we allow expressions
1166 that return Boolean values, such as comparisons between numbers or string
1167 matching expressions. This gives us a richer logic to express properties.
1168 Of course, a simple SAT solver cannot deal with such formulas, which is why we
1169 go to [*SMT solvers*](https://en.wikipedia.org/wiki/Satisfiability_modulo_theories)
1170 instead - these solvers can deal with specific classes
1171 of first-order formulas (like the ones we produce).
1172 One well-known SMT solver is [Z3](https://github.com/Z3Prover/z3).
1174 \subsection static_analysis_section Static analysis
1176 While BMC analyzes the program by transforming everything to logic
1177 formulas and, essentially, running the program on sets of concrete
1178 states, another approach to learn about a program is based on the idea
1179 of interpreting an abstract version of the program. This is known
1180 as [**abstract interpretation**](https://en.wikipedia.org/wiki/Abstract_interpretation).
1181 Abstract interpretation is one of the
1182 main methods in the area of **static analysis**.
1184 \subsubsection abstract_interpretation_section Abstract Interpretation
1186 The key idea is that instead of looking at concrete program states
1187 (e.g., ''variable x contains value 1''), we look at some
1188 sufficiently-precise abstraction (e.g., ''variable x is odd'', or
1189 ''variable x is positive''), and perform interpretation of the program
1190 using such abstract values. Coming back to our running example, we wish
1191 to prove that the factorial function never returns 0.
1193 An abstract interpretation is made up from four ingredients:
1194 1. An **abstract domain**, which represents the analysis results.
1195 2. A family of **transformers**, which describe how basic programming
1196 language constructs modify the state.
1197 3. A map that takes a pair of a program location (e.g., a position in the
1198 program code) and a variable name and yields a value in the abstract domain.
1199 4. An algorithm to compute a ''fixed point'', computing a map as described
1200 in the previous step that describes the behavior of the program.
1202 The first ingredient we need for abstract interpretation is the
1203 **abstract domain**.
1204 The domain allows us to express what we know about a given variable or
1205 value at a given program location; in our example, whether it is zero or not.
1206 The way we use the abstract domain is for each program point, we have a
1207 map from visible variables to elements of the abstract domain,
1208 describing what we know about the values of the variables at this point.
1210 For instance, consider the `factorial` example again. After running the
1211 first basic block, we know that `fac` and `i` both contain 1, so we have
1212 a map that associates both `fac` and `i` to "not 0".
1214 An abstract domain is a set $D$ (or, if you
1215 prefer, a data type) with the following properties:
1216 - There is a function merge that takes two elements of $D$ and returns
1217 an element of $D$. This function is associative (merge(x, merge(y,z))
1218 = merge(merge(x,y), z)), commutative (merge(x,y) = merge(y,x)) and
1219 idempotent (merge(x,x) = x).
1220 - There is an element bottom of $D$ such that merge(x, bottom) = x.
1222 Algebraically speaking, $D$ needs to be a semi-lattice.
1224 For our example, we use the following domain:
1225 - D contains the elements "bottom" (nothing is known),
1226 "equals 0", "not 0" and "could be 0".
1228 - merge is defined as follows:
1229 merge(bottom, x) = x
1230 merge("could be 0", x) = "could be 0"
1232 merge("equals 0", "not 0") = "could be 0"
1233 - bottom is bottom, obviously.
1234 It is easy but tedious to check that all conditions hold.
1236 The second ingredient we need are the **abstract state transformers**.
1237 An abstract state transformer describes how a specific expression or
1238 statement processes abstract values. For the example, we need to define
1239 abstract state transformers for multiplication and addition.
1241 Let us start with multiplication, so let us look at the expression
1242 `x*y`. We know that if `x` or `y` are 0, `x*y` is zero. Thus,
1243 if `x` or `y` are "equals 0", the result of `x*y` is also "equals 0".
1244 If `x` or `y` are "could be 0" (but neither is "equals 0"), we simply
1245 don't know what the result is - it could be zero or not. Thus, in this
1246 case, the result is "could be 0".
1248 What if `x` and `y` are both "not 0"? In a mathematically ideal world,
1249 we would have `x*y` be non-zero as well. But in a C program,
1250 multiplication could overflow, yielding 0! So, to be correct, we have to
1251 yield "could be 0" in this case.
1253 Finally, when `x` is bottom, we can just return whatever value we had
1254 assigned to `y`, and vice versa.
1256 For addition, the situation looks like this: consider `x+y`.
1257 If neither `x` nor `y` is "not 0", but
1258 at least one is "could be 0", the result is "could be 0". If both are
1259 "equals 0", the result is "equals 0". What if both are "not 0"? It seems
1260 that, since the variables are unsigned and not zero, it should be
1261 "not 0". Sadly, overflow strikes again, and we have to make do with
1262 "could be 0". The bottom cases can be handled just like for
1265 The way of defining the transformation functions above showcases another
1266 important property: they must reflect the actual behavior of the
1267 underlying program instructions. There is a formal description of this
1268 property, using [*Galois connections*](https://en.wikipedia.org/wiki/Galois_connection);
1269 for the details, it is best to
1270 look at the literature.
1272 The third ingredient is straightforward: we use a simple map
1273 from program locations and variable names to values in the abstract
1274 domain. In more complex analyses, more involved forms of maps may be
1275 used (e.g., to handle arbitrary procedure calls, or to account for the
1278 At this point, we have almost all the ingredients we need to set up an abstract
1279 interpretation. To actually analyze a function, we take its CFG and
1280 perform a *fixpoint algorithm*.
1282 Concretely, let us consider the CFG for factorial again. This time, we
1283 have named the basic blocks, and simplified the variable names.
1284 \dot "Control flow graph for factorial - named basic blocks"
1286 node [shape=rectangle]
1287 bb1 [label="BB1\nDeclaration: unsigned long fac, value 1\nDeclaration: unsigned i, value 1",peripheries=2]
1288 bb2 [label="BB2\nExpression: i <= n"]
1289 bb3 [label="BB3\nExpression: fac *= i\nExpression: i ++",junk="*"]
1291 bb4 [label="BB4\nReturn: fac",style=filled,fillcolor=gray80]
1293 bb2 -> bb3 [label="true"]
1294 bb2 -> bb4 [label="false"]
1298 We provide an initial variable map: `n` is "could be 0" before BB1.
1299 As a first step, we analyze BB1 and find that the final variable map
1302 - `fac` "not 0" (it is, in fact, 1, but our domain does not allow us to
1305 Let as call this state N.
1307 At this point, we look at all the outgoing edges of BB1 - we wish to
1308 propagate our new results to all blocks that follow BB1. There is only
1309 one such block, BB2. We analyze BB2 and find that it doesn't change any
1310 variables. Furthermore, the result of `i <= n` does not allow us
1311 to infer anything about the values in the variable map, so we get the
1312 same variable map at the end of BB2 as before.
1314 Again, we look at the outgoing edges. There are two successor blocks,
1315 BB3 and BB4. The information in our variable map does not allow us to
1316 rule out either of the branches, so we need to propagate to both blocks.
1317 We start with BB3 and remember that we need to visit BB4.
1319 At this point, we know that `n` "could be 0", while `fac` and `i` are "not 0".
1320 Applying the abstract transformers, we learn that afterwards, both `fac`
1321 and `i` "could be 0" (`fac` ends up as "could be 0" since both `fac` and `i`
1322 were "not 0" initially, `i` ends up as "could be 0" because both `i` and 1
1323 are "not 0"). So, the final variable map at the end of BB3 is
1324 - `n`, `fac` and `i` "could be 0".
1325 Let us call this state S.
1327 At this point, we propagate again, this time to BB2. But wait, we have
1328 propagated to BB2 before! The way this is handled is as follows:
1329 We first calculate what the result of running BB2 from S; this yields S
1330 again. Now, we know that at the end of BB2, we can be either in state S
1331 or N. To get a single state out of these two, we *merge*: for each
1332 variable, merge the mapping of that variable in S and N.
1334 - `n` maps to merge("could be 0", "could be 0") = "could be 0"
1335 - `fac` maps to merge("could be 0", "not 0") = "could be 0"
1336 - `i` maps to merge("could be 0", "not 0") = "could be 0"
1337 In other words, we arrive at S again.
1339 At this point, we propagate to BB3 and BB4 again. Running BB3, we again
1340 end up with state S at the end of BB3, so things don't change. We detect
1341 this situation and figure out that we do not need to propagate from BB3
1342 to BB2 - this would not change anything! Thus, we can now propagate to
1343 BB4. The state at the end of BB4 is also S.
1345 Now, since we know the variable map at the end of BB4, we can look up
1346 the properties of the return value: in S, `fac` maps to "could be 0", so
1347 we could not prove that the function never returns 0. In fact, this is
1348 correct: calculating `factorial(200)` will yield 0, since the 64-bit
1349 integer `fac` overflows.
1351 Nevertheless, let us consider what would happen in a mathematically ideal
1352 world (e.g., if we used big integers). In that case, we would get
1353 "not 0" * "not 0" = "not 0", "not 0" + x = "not 0" and
1354 x + "not 0" = "not 0". Running the abstract interpretation with these
1355 semantics, we find that if we start BB3 with variable map N, we get
1356 variable map N at the end as well, so we end up with variable map N
1357 at the end of BB4 - but this means that fac maps to "not 0"!
1359 The algorithm sketched above is called a fixpoint algorithm because we
1360 keep propagating until we find that applying the transformers does not
1361 yield any new results (which, in a mathematically precise way, can be
1362 shown to be equivalent to calculating, for a specific function f, an x such
1365 This overview only describes the most basic way of performing abstract
1366 interpretation. For one, it only works on the procedure level (we say it is an
1367 *intra-procedural analysis*); there are various ways of extending
1368 it to work across function boundaries, yielding *inter-procedural analysis*.
1369 Additionally, there are situations where it can be helpful make a variable map
1370 more abstract (widening) or use information that can be gained from
1371 various sources to make it more precise (narrowing); these advanced
1372 topics can be found in the literature as well.
1374 \section Glossary_section Glossary
1376 \subsection instrument_subsection Instrument
1378 To instrument a piece of code means to modify it by (usually) inserting new
1379 fragments of code that, when executed, tell us something useful about the code
1380 that has been instrumented.
1382 For instance, imagine you are given the following function:
1393 and you want to design an analysis that figures out which lines of code will
1394 be covered when `aha` is executed with the input `5`.
1396 We can *instrument* the code so that the function will look like:
1400 __runtime_line_seen(0);
1402 __runtime_line_seen(1);
1405 __runtime_line_seen(2);
1408 __runtime_line_seen(3);
1412 All we have to do now is to implement the function
1413 `void __runtime_line_seen(int line)` so that when executed it logs somewhere
1414 what line is being visited. Finally we execute the instrumented version of
1415 `aha` and collect the desired information from the log.
1417 More generally speaking, and especially within the CPROVER code base,
1418 instrumenting the code often refers to modify its behavior in a manner that
1419 makes it easier for a given analysis to do its job, regardless of whether that
1420 is achieved by executing the instrumented code or by just analyzing it in some
1423 \subsection flattening_lowering_subsection Flattening and Lowering
1425 As we have seen above, we often operate on many different representations
1426 of programs, such as ASTs, control flow graphs, SSA programs, logical formulas
1427 in BMC and so on. Each of these forms is good for certain kinds of analyses,
1428 transformations or optimizations.
1430 One important kind of step in dealing with program representations is
1431 going from one representation to the other. Often, such steps are going
1432 from a more ''high-level'' representation (closer to the source code)
1433 to a more ''low-level'' representation. Such transformation steps are
1434 known as **flattening** or **lowering** steps, and tend to be more-or-less
1437 An example of a lowering step is the transformation from ASTs to the GOTO IR,
1440 \subsection verification_condition_subsection Verification Condition
1442 In the CPROVER framework, the term **verification condition** is used
1443 in a somewhat non-standard way. Let a program and a set of assertions
1444 be given. We transform the program into an (acyclic) SSA (i.e., an SSA
1445 with all loops unrolled a finite number of times) and turn it into a
1446 logical formula, as described above. Note that in this case, the
1447 formula will also contain information about what the program does
1448 after the assertion is reached: this part of the formula, is, in fact,
1449 irrelevant for deciding whether the program can satisfy the assertion or
1450 not. The *verification condition* is the part of the formula that only
1451 covers the program execution until the line that checks the assertion
1452 has been executed, with everything that comes after it removed.