cprover
goto-programs/README.md
Go to the documentation of this file.
1 \ingroup module_hidden
2 \defgroup goto-programs goto-programs
3 
4 # Folder goto-programs
5 
6 \author Kareem Khazem, Martin Brain
7 
8 \section goto-programs-overview Overview
9 Goto programs are the intermediate representation of the CPROVER tool
10 chain. They are language independent and similar to many of the compiler
11 intermediate languages. Each function is a
12 list of instructions, each of which has a type (one of 19 kinds of
13 instruction), a code expression, a guard expression and potentially some
14 targets for the next instruction. They are not natively in static
15 single-assign (SSA) form. Transitions are nondeterministic (although in
16 practise the guards on the transitions normally cover form a disjoint
17 cover of all possibilities). Local variables have non-deterministic
18 values if they are not initialised. Goto programs can be serialised
19 in a binary (wrapped in ELF headers) format or in XML (see the various
20 `_serialization` files).
21 
22 The `cbmc` option `--show-goto-programs` is often a good starting point
23 as it outputs goto-programs in a human readable form. However there are
24 a few things to be aware of. Functions have an internal name (for
25 example `c::f00`) and a ‘pretty name’ (for example `f00`) and which is
26 used depends on whether it is internal or being presented to the user.
27 `NONDET(some_type)` is use to represent a nondeterministic value.
28 `IF guard GOTO x` represents a GOTO instruction with a guard, not a distinct
29 `IF` instruction.
30 The comment lines are generated from the `source_location` field
31 of the `instructiont` structure.
32 
33 GOTO programs are commonly produced using the `initialize_goto_model` function,
34 which combines the complete process from command-line arguments specifying
35 source code files, through parsing and generating a symbol table, to finally
36 producing GOTO functions.
37 
38 Alternatively `lazy_goto_modelt` can be used, which facilitates producing GOTO
39 functions on demand. See \ref section-lazy-conversion for more details on this
40 method.
41 
42 \section goto_data_structures Data Structures
43 
44 A \ref goto_functionst object contains a set of GOTO programs. Note the
45 counter-intuitive naming: `goto_functionst` instances are the top level
46 structure representing the program and contain `goto_programt` instances
47 which represent the individual functions.
48 
49 An instance of \ref goto_programt is effectively a list of
50 instructions (a nested class called \ref goto_programt::instructiont).
51 It is important
52 to use the copy and insertion functions that are provided as iterators
53 are used to link instructions to their predecessors and targets and
54 careless manipulation of the list could break these.
55 
56 Individual instructions are instances of type \ref goto_programt::instructiont.
57 They represent one step in the function. Each has a type, an instance of
58 \ref goto_program_instruction_typet which denotes what kind of instruction
59 it is. They can be computational (such as `ASSIGN` or `FUNCTION_CALL`),
60 logical (such as `ASSUME` and `ASSERT`) or informational (such as
61 `LOCATION` and `DEAD`). At the time of writing there are 19 possible
62 values for `goto_program_instruction_typet` / kinds of instruction.
63 Instructions also have a guard field (the condition under which it is
64 executed) and a code field (what the instruction does). These may be
65 empty depending on the kind of instruction.
66 These are of type \ref exprt and \ref codet respectively.
67 The next instructions (remembering that transitions may be
68 non-deterministic) are given by the list `targets` (with the
69 corresponding list of labels `labels`) and the corresponding set of
70 previous instructions is get by `incoming_edges`. Finally `instructiont`
71 has informational `function` and `source_location` fields that indicate where
72 they are in the source code.
73 
74 \section goto-conversion Goto Conversion
75 
76 In the \ref goto-programs directory.
77 
78 **Key classes:**
79 * goto_programt
80 * goto_functionst
81 * \ref goto_programt::instructiont
82 
83 \dot
84 digraph G {
85  node [shape=box];
86  rankdir="LR";
87  1 [shape=none, label=""];
88  2 [label="goto conversion"];
89  3 [shape=none, label=""];
90  1 -> 2 [label="Symbol table"];
91  2 -> 3 [label="goto-programs, goto-functions, symbol table"];
92 }
93 \enddot
94 
95 At this stage, CBMC constructs a goto-program from a symbol table.
96 Each symbol in the symbol table with function type (that is, the symbol's `type`
97 field contains a \ref code_typet) will be converted to a corresponding GOTO
98 program. The parse tree and source file are not used at all for this step.
99 
100 The conversion happens in two phases:
101 
102 1. Function `goto_convertt::convert` turns each `codet` in the symbol table
103  into corresponding GOTO instructions.
104 2. `goto_convertt::finish_gotos` and others populate the `GOTO` and `CATCH`
105  instructions' `targets` members, pointing to their possible successors.
106  `DEAD` instructions are also added when GOTO instructions branch out of one
107  or more lexical blocks.
108 
109 Here is an example of a simple C program being converted into GOTO code (note
110 the `targets` member of each instruction isn't populated yet):
111 
112 \dot
113 digraph G{
114  graph [nojustify=true];
115  node [shape=box];
116  compound=true;
117 
118  subgraph cluster_src {
119  1 [shape="none", label="source files"];
120  2 [label="file1.c\n-----------\nint main(){
121  int x = 5;
122  if(x > 7){
123  x = 9; } }
124 
125 void foo(){}"];
126  1 -> 2 [color=white];
127 
128  100 [label="file2.c\n--------\nchar bar(){ }"];
129  2 -> 100 [color=white];
130  }
131 
132  1 -> 3 [label="corresponds to", lhead=cluster_goto,
133  ltail=cluster_src];
134 
135  subgraph cluster_goto {
136  3 [label="a\ngoto_functionst", URL="\ref goto_functionst", shape=none];
137  4 [label="function_map\n(a map from irep_idt\nto goto_programt)"];
138  }
139  4 -> 5 [lhead=cluster_funmap, label="value:"];
140 
141  subgraph cluster_funmap {
142  9 [label="bar\n(an irep_idt)", URL="\ref irep_idt"];
143  10 [label="a goto_programt", URL="\ref goto_programt"];
144  9->10 [label="maps to"];
145 
146  5 [label="main\n(an irep_idt)", URL="\ref irep_idt"];
147 
148  7 [label="foo\n(an irep_idt)", URL="\ref irep_idt"];
149  8 [label="a goto_programt", URL="\ref goto_programt"];
150  7->8 [label="maps to"];
151 
152  subgraph cluster_goto_program {
153  11 [shape=none, label="a\ngoto_programt", URL="\ref goto_programt"];
154  12 [label="instructions\n(a list of instructiont)"];
155  }
156  5 -> 11 [lhead=cluster_goto_program, label="maps to:"];
157 
158  }
159 
160  12 -> target1 [lhead=cluster_instructions];
161 
162  subgraph cluster_instructions {
163  subgraph cluster_ins1{
164  code1 [label="code", URL="\ref codet"];
165  target1 [label="target"];
166  }
167 
168  target1 -> target2 [color=white,lhead=cluster_ins2,
169  ltail=cluster_ins1];
170 
171  subgraph cluster_ins2{
172  code2 [label="code", URL="\ref codet"];
173  target2 [label="target"];
174  }
175 
176  target2 -> target3 [color=white,lhead=cluster_ins3,
177  ltail=cluster_ins2];
178 
179  subgraph cluster_ins3{
180  code3 [label="code", URL="\ref codet"];
181  target3 [label="target"];
182  }
183 
184  target3 -> target4 [color=white,lhead=cluster_ins4,
185  ltail=cluster_ins3];
186 
187  subgraph cluster_ins4{
188  code4 [label="code", URL="\ref codet"];
189  target4 [label="target"];
190  }
191 
192  }
193 
194  subgraph cluster_decl {
195  decl1 [label="type:\ncode_declt", URL="\ref code_declt",
196  shape=none];
197  subgraph cluster_decl_in{
198  cluster_decl_in_1 [label="symbol()", shape=none];
199  cluster_decl_in_2 [label="x"];
200  cluster_decl_in_1 -> cluster_decl_in_2 [color=white];
201  }
202  decl1 -> cluster_decl_in_1 [lhead="cluster_decl_in", color=white];
203  }
204  code1 -> decl1 [lhead=cluster_decl];
205 
206  subgraph cluster_assign1 {
207  assign1 [label="type:\ncode_assignt", URL="\ref code_assignt",
208  shape=none];
209  subgraph cluster_assign1_in{
210  cluster_assign1_in_1 [label="lhs()", shape=none];
211  cluster_assign1_in_2 [label="x"];
212  cluster_assign1_in_1 -> cluster_assign1_in_2 [color=white];
213 
214  cluster_assign1_in_3 [label="rhs()", shape=none];
215  cluster_assign1_in_4 [label="5"];
216  cluster_assign1_in_3 -> cluster_assign1_in_4 [color=white];
217  }
218  assign1 -> cluster_assign1_in_1 [lhead="cluster_assign1_in", color=white];
219  }
220  code2 -> assign1 [lhead=cluster_assign1];
221 
222  subgraph cluster_if {
223  if [label="type:\ncode_ifthenelset", URL="\ref code_ifthenelset",
224  shape=none];
225  if_body [label="..."];
226  if -> if_body [color=white];
227  }
228  code3 -> if [lhead=cluster_if];
229 
230  subgraph cluster_assign2 {
231  assign2 [label="type:\ncode_assignt", URL="\ref code_assignt",
232  shape=none];
233  subgraph cluster_assign2_in{
234  cluster_assign2_in_1 [label="lhs()", shape=none];
235  cluster_assign2_in_2 [label="x"];
236  cluster_assign2_in_1 -> cluster_assign2_in_2 [color=white];
237 
238  cluster_assign2_in_3 [label="rhs()", shape=none];
239  cluster_assign2_in_4 [label="9"];
240  cluster_assign2_in_3 -> cluster_assign2_in_4 [color=white];
241  }
242  assign2 -> cluster_assign2_in_1 [lhead="cluster_assign2_in", color=white];
243  }
244  code4 -> assign2 [lhead=cluster_assign2];
245 
246 }
247 \enddot
248 
249 The `codet` representation of method bodies that is stored in the symbol table
250 resembles C code quite strongly: it uses lexical variable scoping and has
251 complex control-flow constructs such as `code_whilet` and `code_ifthenelset`.
252 In converting to a GOTO program this structure is changed in several ways:
253 
254 * Variable lifespan implied by \ref code_declt instructions and lexical scopes
255  described by \ref code_blockt nodes is replaced by `DECL` and corresponding
256  `DEAD` instructions. There are therefore no lexical scopes in GOTO programs
257  (not even local variable death on function exit is enforced).
258 
259 * Expressions with side-effects are explicitly ordered so that there is one
260  effect per instruction (apart from function call instructions, which can
261  still have many). For example, `y = f() + x++` will have become something like
262  `tmp1 = f(); y = tmp1 + x; x = x + 1;`
263 
264 * Compound blocks are eliminated. There are several subclasses of
265  \ref codet that count as 'compound blocks;' therefore, later stages in
266  the CBMC pipeline that switch over the codet subtype of a particular
267  instruction should not need to consider these types. In particular:
268 
269  * code_ifthenelset is turned into GOTOs. In particular, the bodies of
270  the conditionals are flattened into lists of instructions, inline
271  with the rest of the instruction in the goto_programt. The guard of
272  the conditional is placed onto the
273  \ref goto_programt::instructiont::guard "guard" member of
274  an instruction whose code member is of type \ref code_gotot. The
275  \ref goto_programt::instructiont::targets "targets" member
276  of that instruction points to the appropriate branch of the
277  conditional. (Note that although instructions have a list of
278  targets, in practice an instruction should only ever have at most
279  one target; you should check this invariant with an assertion if you
280  rely on it).
281 
282  The order of instructions in a list of instructions---as well as the
283  targets of GOTOs---are both displayed as arrows when viewing a
284  goto-program as a Graphviz DOT file with `goto-instrument --dot`.
285  The semantics of a goto-program is: the next instruction is the next
286  instruction in the list, unless the current instruction has a
287  target; in that case, check the guard of the instruction, and jump
288  to the target if the guard evaluates to true.
289 
290  * switch statements, for and while loops, and try-catches are also
291  transformed into lists of instructions guarded by GOTOs.
292 
293  * \ref code_blockt "code blocks" are transformed into lists of
294  instructions.
295 
296 * \ref code_returnt "return statements" are transformed into
297  (unconditional) GOTOs whose target is the \ref END_FUNCTION
298  instruction. Each goto_programt should have precisely one such
299  instruction.
300 
301 \section section-goto-transforms Subsequent Transformation
302 
303 It is normal for a program that calls `goto_convert` to immediately pass the
304 GOTO functions through one or more transformations. For example, the ubiquitous
305 `remove_returns` transformation turns the method return sequence generated by
306 `goto_convert` (something like `RETURN x; GOTO end;`) into
307 `function_name#return_value = x; GOTO end;`, and a callsite
308 `y = function_name();` into `function_name(); y = function_name#return_value;`.
309 
310 Some subsequent parts of the CBMC pipeline assume that one or more of these
311 transforms has been applied, making them effectively mandatory in that case.
312 
313 Common transformations include:
314 
315 * Removing superfluous SKIP instructions (`remove_skip`)
316 * Converting indirect function calls into dispatch tables over direct calls
317  (`remove_virtual_functions` and `remove_function_pointers`)
318 * Adding checks for null-pointer dereferences, arithmetic overflow and other
319  undefined behaviour (`goto_check`)
320 * Adding code coverage assertions (`instrument_cover_goals`).
321 
322 By convention these post-convert transformations are applied by a driver program
323 in a function named `process_goto_program` or `process_goto_functions`; examples
324 of these can be found in `cbmc_parse_optionst`, `goto_analyzer_parse_optionst`
325 and `goto_instrument_parse_optionst` among others.
326 
327 \section section-lazy-conversion Lazy GOTO Conversion
328 
329 The conventional source-code-to-GOTO-program pipeline is horizontally
330 structured: all source-file information is converted into a symbol table, then
331 each symbol-table method is converted into a GOTO program, then each transform
332 is run over the whole GOTO-model in turn. This is fine when the GOTO model
333 consumer will use most or all of the results of conversion, but if not then
334 lazy conversion may be beneficial.
335 
336 Lazy conversion is coordinated by `lazy_goto_modelt`. Early source code parsing
337 proceeds as usual, but while the same symbol table symbols are created as using
338 `initialize_goto_model`, method bodies in the symbol table are not populated.
339 `lazy_goto_modelt::get_goto_function` then allows each function to be converted
340 on demand, passing individually through symbol-table `codet` production, GOTO
341 conversion and driver-program transformation.
342 
343 `goto_symext` is able to use such a lazy GOTO model in place of the usual
344 `goto_modelt`, enabling it to create GOTO programs on demand, and thus avoid
345 spending time and memory generating GOTO programs that symex determines cannot
346 in fact be reached. This is particularly useful when the source program contains
347 many indirect calls (virtual function calls, or calls via function pointer), as
348 the `remove_virtual_functions` and `remove_function_pointers` passes treat these
349 very conservatively, producing large dispatch tables of possible callees which
350 symex often finds to be largely unreachable.
351 
352 JBMC exhibits typical usage of `lazy_goto_modelt` when it is called with
353 `--symex-driven-lazy-loading`. It defines a function
354 `jbmc_parse_optionst::process_goto_function` which performs driver-program-
355 specific post-GOTO-convert transformation. The lazy GOTO model is then passed to
356 `goto_symext`, which will trigger symbol-table population, GOTO conversion
357 and post-processing as needed.
358 
359 \section section-goto-binary Binary Representation
360 
361 An instance of `::goto_modelt` can be serialised to a binary stream (which is
362 typically a file on the disk), and later deserialised from that stream back to
363 an equivalent `::goto_modelt` instance.
364 
365 \subsection subsection-goto-binary-serialisation Serialisation
366 
367 The serialisation is implemented in C++ modules:
368  - `write_goto_binary.h`
369  - `write_goto_binary.cpp`
370 
371 To serialise a `::goto_modelt` instance `gm` to a stream `ostr` call the
372 function `::write_goto_binary`, e.g. `write_goto_binary(ostr, gm)`.
373 
374 The content of the written stream will have this structure:
375  - The header:
376  - A magic number: byte `0x7f` followed by 3 characters `GBF`.
377  - A version number written in the 7-bit encoding (see [number serialisation](\ref irep-serialization-numbers)). Currently, only version `4` is supported.
378  - The symbol table:
379  - The number of symbols in the table in the 7-bit encoding.
380  - The array of individual symbols in the table. Each written symbol `s` has this structure:
381  - The `::irept` instance `s.type`.
382  - The `::irept` instance `s.value`.
383  - The `::irept` instance `s.location`.
384  - The string `s.name`.
385  - The string `s.module`.
386  - The string `s.base_name`.
387  - The string `s.mode`.
388  - The string `s.pretty_name`.
389  - The number `0` in the 7-bit encoding.
390  - The flags word in the 7-bit encoding. The bits in the flags word correspond to the following `Boolean` fields (from the most significant bit):
391  - `s.is_weak`
392  - `s.is_type`
393  - `s.is_property`
394  - `s.is_macro`
395  - `s.is_exported`
396  - `s.is_input`
397  - `s.is_output`
398  - `s.is_state_var`
399  - `s.is_parameter`
400  - `s.is_auxiliary`
401  - `0` (corresponding to `s.binding`, i.e. we always clear this info)
402  - `s.is_lvalue`
403  - `s.is_static_lifetime`
404  - `s.is_thread_local`
405  - `s.is_file_local`
406  - `s.is_extern`
407  - `s.is_volatile`
408  - The functions with bodies, i.e. those missing a body are skipped.
409  - The number of functions with bodies in the 7-bit encoding.
410  - The array of individual functions with bodies. Each written function has this structure:
411  - The string with the name of the function.
412  - The number of instructions in the body of the function in the 7-bit encoding.
413  - The array of individual instructions in function's body. Each written instruction `I` has this structure:
414  - The `::irept` instance `I.code`, i.e. data of the instruction, like arguments.
415  - The string `I.function`, i.e. the name of the function this instruction belongs to.
416  - The `::irept` instance `I.source_location`, i.e. the reference to the original source code (file, line).
417  - The word in the 7-bit encoding `I.type`, i.e. the op-code of the instruction.
418  - The `::irept` instance `I.guard`.
419  - The empty string (representing former `I.event`).
420  - The word in the 7-bit encoding `I.target_number`, i.e. the jump target to this instruction from other instructions.
421  - The word in the 7-bit encoding `I.targets.size()`, i.e. the count of jump targets from this instruction.
422  - The array of individual jump targets from this instruction, each written as a word in the 7-bit encoding.
423  - The word in the 7-bit encoding `I.labels.size()`.
424  - The array of individual labels, each written as a word in the 7-bit encoding.
425 
426 An important propery of the serialisation is that each serialised `::irept`
427 instance occurs in the stream exactly once. Namely, in the position of
428 its first serialisation query. All other such queries save only a hash
429 code (i.e. reference) of the `::irept` instance.
430 
431 A similar strategy is used for serialisation of string constants
432 shared amongst `::irept` instances. Such a string is fully saved only in
433 the first serialisation query of an `::irept` instance it appears in and
434 all other queries only save its integer hash code.
435 
436 Details about serialisation of `::irept` instances, strings, and words in
437 7-bit encoding can be found [here](\ref irep-serialization).
438 
439 \subsection subsection-goto-binary-deserialisation Deserialisation
440 
441 The deserialisation is implemented in C++ modules:
442  - `read_goto_binary.h`
443  - `read_goto_binary.cpp`
444  - `read_bin_goto_object.h`
445  - `read_bin_goto_object.cpp`
446 
447 The first two modules are responsible for location of the stream with the
448 serialised data within a passed file. And the remaining two modules
449 perform the actual deserialisation of a `::goto_modelt` instance from
450 the located stream.
451 
452 To deserialise a `::goto_modelt` instance `gm` from a file
453 `/some/path/name.gbf` call the function `::read_goto_binary`, e.g.
454 `read_goto_binary("/some/path/name.gbf", gm, message_handler)`, where
455 `message_handler` must be an instance of `::message_handlert` and serves
456 for reporting issues during the process.
457 
458 The passed binary file is assumed to have the same structure as described in
459 the [previous subsection](\ref subsection-goto-binary-serialisation).
460 The process of the deserialisation does not involve any seeking in the file.
461 The content is read linearly from the beginning to the end. `::irept` instances
462 and their string constants are deserialised into the memory only once at their
463 first occurrences in the stream. All subsequent deserialisation queries are
464 resolved as in-memory references to already deserialised the `::irept`
465 instances and/or strings, based on hash code matching.
466 
467 NOTE: The first deserialisation is detected so that the loaded hash code
468 is new. That implies that the full definition follows right after the hash.
469 
470 Details about serialisation of `::irept` instances, strings, and words in
471 7-bit encoding can be found [here](\ref irep-serialization).
472 
473 \subsubsection subsection-goto-binary-deserialisation-from-elf Deserialisation from ELF image
474 
475 One can decide to store the serialised stream as a separate section, named
476 `goto-cc`, into an ELF image. Then the deserialisation has a support of
477 automatic detection of that section in an ELF file and the deserialisation
478 will be automatically started from that section.
479 
480 For reading the ELF images there is used an instance of `::elf_readert`
481 implemented in the C++ module:
482  - `elf_reader.h`
483  - `elf_reader.cpp`
484 
485 \subsubsection subsection-goto-binary-deserialisation-from-mach-o-fat-image Deserialisation from Mach-O fat image
486 
487 One can decide to store the serialised stream into Mach-O fat image as a
488 separate non-empty section with flags `CPU_TYPE_HPPA` and
489 `CPU_SUBTYPE_HPPA_7100LC`. Then the deserialisation has a support of
490 automatic detection of that section in a Mach-O fat image, extraction
491 of the section from the emage into a temporary file (this is done by
492 calling `lipo` utility with `-thin hppa7100LC` option), and the
493 deserialisation will be automatically started from that temporary
494 file.
495 
496 For reading the Mach-O fat images there is used an instance of
497 `::osx_fat_readert` implemented in the C++ module:
498  - `osx_fat_reader.h`
499  - `osx_fat_reader.cpp`
500 
501 NOTE: This functionality is available only when the modules are built
502 on a MacOS machine.
503 
504 \subsubsection subsection-goto-binary-is-binary-file Checking file type
505 
506 You can use function `::is_goto_binary` to check whether a passed file contains
507 a deserialised `::goto_modelt` instance or not. This is done by checking the
508 magic number of the stream (see subsection
509 [Serialisation](\ref subsection-goto-binary-serialisation)). However, in the
510 case when the deserialised data were stored into ELF or Mach-O fat image, then
511 only the check for presence of the concrete section in the image is performed.
512 
513 \subsubsection subsection-goto-binary-deserialisation-linking Linking Goto Models
514 
515 Similar to linking of object files together by C/C++ linker, the module
516 provides linking of a dereserialised `::goto_modelt` instance into a given
517 (i.e. previously deserialised or otherwise created) `::goto_modelt` instance.
518 
519 This is implemented in function `::read_object_and_link`. The function first
520 deserialises the passed file into a temporary `::goto_modelt` instance, and
521 then it performs 'linking' of the temporary into a passed destination
522 `::goto_modelt` instance.
523 
524 Details about linking of `::goto_modelt` instances can be found
525 [here](\ref section-linking-goto-models).
526 
527 
528 \section section-linking-goto-models Linking Goto Models
529 
530 C++ modules:
531  - `link_goto_model.h`
532  - `link_goto_model.cpp`
533 
534 Dependencies:
535  - [linking folder](\ref linking).
536  - [typecheck](\ref section-goto-typecheck).