2 \defgroup goto-programs goto-programs
6 \author Kareem Khazem, Martin Brain
8 \section goto-programs-overview Overview
9 Goto programs are the intermediate representation of the CPROVER tool
10 chain. They are language independent and similar to many of the compiler
11 intermediate languages. Each function is a
12 list of instructions, each of which has a type (one of 19 kinds of
13 instruction), a code expression, a guard expression and potentially some
14 targets for the next instruction. They are not natively in static
15 single-assign (SSA) form. Transitions are nondeterministic (although in
16 practise the guards on the transitions normally cover form a disjoint
17 cover of all possibilities). Local variables have non-deterministic
18 values if they are not initialised. Goto programs can be serialised
19 in a binary (wrapped in ELF headers) format or in XML (see the various
20 `_serialization` files).
22 The `cbmc` option `--show-goto-programs` is often a good starting point
23 as it outputs goto-programs in a human readable form. However there are
24 a few things to be aware of. Functions have an internal name (for
25 example `c::f00`) and a ‘pretty name’ (for example `f00`) and which is
26 used depends on whether it is internal or being presented to the user.
27 `NONDET(some_type)` is use to represent a nondeterministic value.
28 `IF guard GOTO x` represents a GOTO instruction with a guard, not a distinct
30 The comment lines are generated from the `source_location` field
31 of the `instructiont` structure.
33 GOTO programs are commonly produced using the `initialize_goto_model` function,
34 which combines the complete process from command-line arguments specifying
35 source code files, through parsing and generating a symbol table, to finally
36 producing GOTO functions.
38 Alternatively `lazy_goto_modelt` can be used, which facilitates producing GOTO
39 functions on demand. See \ref section-lazy-conversion for more details on this
42 \section goto_data_structures Data Structures
44 A \ref goto_functionst object contains a set of GOTO programs. Note the
45 counter-intuitive naming: `goto_functionst` instances are the top level
46 structure representing the program and contain `goto_programt` instances
47 which represent the individual functions.
49 An instance of \ref goto_programt is effectively a list of
50 instructions (a nested class called \ref goto_programt::instructiont).
52 to use the copy and insertion functions that are provided as iterators
53 are used to link instructions to their predecessors and targets and
54 careless manipulation of the list could break these.
56 Individual instructions are instances of type \ref goto_programt::instructiont.
57 They represent one step in the function. Each has a type, an instance of
58 \ref goto_program_instruction_typet which denotes what kind of instruction
59 it is. They can be computational (such as `ASSIGN` or `FUNCTION_CALL`),
60 logical (such as `ASSUME` and `ASSERT`) or informational (such as
61 `LOCATION` and `DEAD`). At the time of writing there are 19 possible
62 values for `goto_program_instruction_typet` / kinds of instruction.
63 Instructions also have a guard field (the condition under which it is
64 executed) and a code field (what the instruction does). These may be
65 empty depending on the kind of instruction.
66 These are of type \ref exprt and \ref codet respectively.
67 The next instructions (remembering that transitions may be
68 non-deterministic) are given by the list `targets` (with the
69 corresponding list of labels `labels`) and the corresponding set of
70 previous instructions is get by `incoming_edges`. Finally `instructiont`
71 has informational `function` and `source_location` fields that indicate where
72 they are in the source code.
74 \section goto-conversion Goto Conversion
76 In the \ref goto-programs directory.
81 * \ref goto_programt::instructiont
87 1 [shape=none, label=""];
88 2 [label="goto conversion"];
89 3 [shape=none, label=""];
90 1 -> 2 [label="Symbol table"];
91 2 -> 3 [label="goto-programs, goto-functions, symbol table"];
95 At this stage, CBMC constructs a goto-program from a symbol table.
96 Each symbol in the symbol table with function type (that is, the symbol's `type`
97 field contains a \ref code_typet) will be converted to a corresponding GOTO
98 program. The parse tree and source file are not used at all for this step.
100 The conversion happens in two phases:
102 1. Function `goto_convertt::convert` turns each `codet` in the symbol table
103 into corresponding GOTO instructions.
104 2. `goto_convertt::finish_gotos` and others populate the `GOTO` and `CATCH`
105 instructions' `targets` members, pointing to their possible successors.
106 `DEAD` instructions are also added when GOTO instructions branch out of one
107 or more lexical blocks.
109 Here is an example of a simple C program being converted into GOTO code (note
110 the `targets` member of each instruction isn't populated yet):
114 graph [nojustify=true];
118 subgraph cluster_src {
119 1 [shape="none", label="source files"];
120 2 [label="file1.c\n-----------\nint main(){
126 1 -> 2 [color=white];
128 100 [label="file2.c\n--------\nchar bar(){ }"];
129 2 -> 100 [color=white];
132 1 -> 3 [label="corresponds to", lhead=cluster_goto,
135 subgraph cluster_goto {
136 3 [label="a\ngoto_functionst", URL="\ref goto_functionst", shape=none];
137 4 [label="function_map\n(a map from irep_idt\nto goto_programt)"];
139 4 -> 5 [lhead=cluster_funmap, label="value:"];
141 subgraph cluster_funmap {
142 9 [label="bar\n(an irep_idt)", URL="\ref irep_idt"];
143 10 [label="a goto_programt", URL="\ref goto_programt"];
144 9->10 [label="maps to"];
146 5 [label="main\n(an irep_idt)", URL="\ref irep_idt"];
148 7 [label="foo\n(an irep_idt)", URL="\ref irep_idt"];
149 8 [label="a goto_programt", URL="\ref goto_programt"];
150 7->8 [label="maps to"];
152 subgraph cluster_goto_program {
153 11 [shape=none, label="a\ngoto_programt", URL="\ref goto_programt"];
154 12 [label="instructions\n(a list of instructiont)"];
156 5 -> 11 [lhead=cluster_goto_program, label="maps to:"];
160 12 -> target1 [lhead=cluster_instructions];
162 subgraph cluster_instructions {
163 subgraph cluster_ins1{
164 code1 [label="code", URL="\ref codet"];
165 target1 [label="target"];
168 target1 -> target2 [color=white,lhead=cluster_ins2,
171 subgraph cluster_ins2{
172 code2 [label="code", URL="\ref codet"];
173 target2 [label="target"];
176 target2 -> target3 [color=white,lhead=cluster_ins3,
179 subgraph cluster_ins3{
180 code3 [label="code", URL="\ref codet"];
181 target3 [label="target"];
184 target3 -> target4 [color=white,lhead=cluster_ins4,
187 subgraph cluster_ins4{
188 code4 [label="code", URL="\ref codet"];
189 target4 [label="target"];
194 subgraph cluster_decl {
195 decl1 [label="type:\ncode_declt", URL="\ref code_declt",
197 subgraph cluster_decl_in{
198 cluster_decl_in_1 [label="symbol()", shape=none];
199 cluster_decl_in_2 [label="x"];
200 cluster_decl_in_1 -> cluster_decl_in_2 [color=white];
202 decl1 -> cluster_decl_in_1 [lhead="cluster_decl_in", color=white];
204 code1 -> decl1 [lhead=cluster_decl];
206 subgraph cluster_assign1 {
207 assign1 [label="type:\ncode_assignt", URL="\ref code_assignt",
209 subgraph cluster_assign1_in{
210 cluster_assign1_in_1 [label="lhs()", shape=none];
211 cluster_assign1_in_2 [label="x"];
212 cluster_assign1_in_1 -> cluster_assign1_in_2 [color=white];
214 cluster_assign1_in_3 [label="rhs()", shape=none];
215 cluster_assign1_in_4 [label="5"];
216 cluster_assign1_in_3 -> cluster_assign1_in_4 [color=white];
218 assign1 -> cluster_assign1_in_1 [lhead="cluster_assign1_in", color=white];
220 code2 -> assign1 [lhead=cluster_assign1];
222 subgraph cluster_if {
223 if [label="type:\ncode_ifthenelset", URL="\ref code_ifthenelset",
225 if_body [label="..."];
226 if -> if_body [color=white];
228 code3 -> if [lhead=cluster_if];
230 subgraph cluster_assign2 {
231 assign2 [label="type:\ncode_assignt", URL="\ref code_assignt",
233 subgraph cluster_assign2_in{
234 cluster_assign2_in_1 [label="lhs()", shape=none];
235 cluster_assign2_in_2 [label="x"];
236 cluster_assign2_in_1 -> cluster_assign2_in_2 [color=white];
238 cluster_assign2_in_3 [label="rhs()", shape=none];
239 cluster_assign2_in_4 [label="9"];
240 cluster_assign2_in_3 -> cluster_assign2_in_4 [color=white];
242 assign2 -> cluster_assign2_in_1 [lhead="cluster_assign2_in", color=white];
244 code4 -> assign2 [lhead=cluster_assign2];
249 The `codet` representation of method bodies that is stored in the symbol table
250 resembles C code quite strongly: it uses lexical variable scoping and has
251 complex control-flow constructs such as `code_whilet` and `code_ifthenelset`.
252 In converting to a GOTO program this structure is changed in several ways:
254 * Variable lifespan implied by \ref code_declt instructions and lexical scopes
255 described by \ref code_blockt nodes is replaced by `DECL` and corresponding
256 `DEAD` instructions. There are therefore no lexical scopes in GOTO programs
257 (not even local variable death on function exit is enforced).
259 * Expressions with side-effects are explicitly ordered so that there is one
260 effect per instruction (apart from function call instructions, which can
261 still have many). For example, `y = f() + x++` will have become something like
262 `tmp1 = f(); y = tmp1 + x; x = x + 1;`
264 * Compound blocks are eliminated. There are several subclasses of
265 \ref codet that count as 'compound blocks;' therefore, later stages in
266 the CBMC pipeline that switch over the codet subtype of a particular
267 instruction should not need to consider these types. In particular:
269 * code_ifthenelset is turned into GOTOs. In particular, the bodies of
270 the conditionals are flattened into lists of instructions, inline
271 with the rest of the instruction in the goto_programt. The guard of
272 the conditional is placed onto the
273 \ref goto_programt::instructiont::guard "guard" member of
274 an instruction whose code member is of type \ref code_gotot. The
275 \ref goto_programt::instructiont::targets "targets" member
276 of that instruction points to the appropriate branch of the
277 conditional. (Note that although instructions have a list of
278 targets, in practice an instruction should only ever have at most
279 one target; you should check this invariant with an assertion if you
282 The order of instructions in a list of instructions---as well as the
283 targets of GOTOs---are both displayed as arrows when viewing a
284 goto-program as a Graphviz DOT file with `goto-instrument --dot`.
285 The semantics of a goto-program is: the next instruction is the next
286 instruction in the list, unless the current instruction has a
287 target; in that case, check the guard of the instruction, and jump
288 to the target if the guard evaluates to true.
290 * switch statements, for and while loops, and try-catches are also
291 transformed into lists of instructions guarded by GOTOs.
293 * \ref code_blockt "code blocks" are transformed into lists of
296 * \ref code_returnt "return statements" are transformed into
297 (unconditional) GOTOs whose target is the \ref END_FUNCTION
298 instruction. Each goto_programt should have precisely one such
301 \section section-goto-transforms Subsequent Transformation
303 It is normal for a program that calls `goto_convert` to immediately pass the
304 GOTO functions through one or more transformations. For example, the ubiquitous
305 `remove_returns` transformation turns the method return sequence generated by
306 `goto_convert` (something like `RETURN x; GOTO end;`) into
307 `function_name#return_value = x; GOTO end;`, and a callsite
308 `y = function_name();` into `function_name(); y = function_name#return_value;`.
310 Some subsequent parts of the CBMC pipeline assume that one or more of these
311 transforms has been applied, making them effectively mandatory in that case.
313 Common transformations include:
315 * Removing superfluous SKIP instructions (`remove_skip`)
316 * Converting indirect function calls into dispatch tables over direct calls
317 (`remove_virtual_functions` and `remove_function_pointers`)
318 * Adding checks for null-pointer dereferences, arithmetic overflow and other
319 undefined behaviour (`goto_check`)
320 * Adding code coverage assertions (`instrument_cover_goals`).
322 By convention these post-convert transformations are applied by a driver program
323 in a function named `process_goto_program` or `process_goto_functions`; examples
324 of these can be found in `cbmc_parse_optionst`, `goto_analyzer_parse_optionst`
325 and `goto_instrument_parse_optionst` among others.
327 \section section-lazy-conversion Lazy GOTO Conversion
329 The conventional source-code-to-GOTO-program pipeline is horizontally
330 structured: all source-file information is converted into a symbol table, then
331 each symbol-table method is converted into a GOTO program, then each transform
332 is run over the whole GOTO-model in turn. This is fine when the GOTO model
333 consumer will use most or all of the results of conversion, but if not then
334 lazy conversion may be beneficial.
336 Lazy conversion is coordinated by `lazy_goto_modelt`. Early source code parsing
337 proceeds as usual, but while the same symbol table symbols are created as using
338 `initialize_goto_model`, method bodies in the symbol table are not populated.
339 `lazy_goto_modelt::get_goto_function` then allows each function to be converted
340 on demand, passing individually through symbol-table `codet` production, GOTO
341 conversion and driver-program transformation.
343 `goto_symext` is able to use such a lazy GOTO model in place of the usual
344 `goto_modelt`, enabling it to create GOTO programs on demand, and thus avoid
345 spending time and memory generating GOTO programs that symex determines cannot
346 in fact be reached. This is particularly useful when the source program contains
347 many indirect calls (virtual function calls, or calls via function pointer), as
348 the `remove_virtual_functions` and `remove_function_pointers` passes treat these
349 very conservatively, producing large dispatch tables of possible callees which
350 symex often finds to be largely unreachable.
352 JBMC exhibits typical usage of `lazy_goto_modelt` when it is called with
353 `--symex-driven-lazy-loading`. It defines a function
354 `jbmc_parse_optionst::process_goto_function` which performs driver-program-
355 specific post-GOTO-convert transformation. The lazy GOTO model is then passed to
356 `goto_symext`, which will trigger symbol-table population, GOTO conversion
357 and post-processing as needed.
359 \section section-goto-binary Binary Representation
361 An instance of `::goto_modelt` can be serialised to a binary stream (which is
362 typically a file on the disk), and later deserialised from that stream back to
363 an equivalent `::goto_modelt` instance.
365 \subsection subsection-goto-binary-serialisation Serialisation
367 The serialisation is implemented in C++ modules:
368 - `write_goto_binary.h`
369 - `write_goto_binary.cpp`
371 To serialise a `::goto_modelt` instance `gm` to a stream `ostr` call the
372 function `::write_goto_binary`, e.g. `write_goto_binary(ostr, gm)`.
374 The content of the written stream will have this structure:
376 - A magic number: byte `0x7f` followed by 3 characters `GBF`.
377 - A version number written in the 7-bit encoding (see [number serialisation](\ref irep-serialization-numbers)). Currently, only version `4` is supported.
379 - The number of symbols in the table in the 7-bit encoding.
380 - The array of individual symbols in the table. Each written symbol `s` has this structure:
381 - The `::irept` instance `s.type`.
382 - The `::irept` instance `s.value`.
383 - The `::irept` instance `s.location`.
384 - The string `s.name`.
385 - The string `s.module`.
386 - The string `s.base_name`.
387 - The string `s.mode`.
388 - The string `s.pretty_name`.
389 - The number `0` in the 7-bit encoding.
390 - The flags word in the 7-bit encoding. The bits in the flags word correspond to the following `Boolean` fields (from the most significant bit):
401 - `0` (corresponding to `s.binding`, i.e. we always clear this info)
403 - `s.is_static_lifetime`
404 - `s.is_thread_local`
408 - The functions with bodies, i.e. those missing a body are skipped.
409 - The number of functions with bodies in the 7-bit encoding.
410 - The array of individual functions with bodies. Each written function has this structure:
411 - The string with the name of the function.
412 - The number of instructions in the body of the function in the 7-bit encoding.
413 - The array of individual instructions in function's body. Each written instruction `I` has this structure:
414 - The `::irept` instance `I.code`, i.e. data of the instruction, like arguments.
415 - The string `I.function`, i.e. the name of the function this instruction belongs to.
416 - The `::irept` instance `I.source_location`, i.e. the reference to the original source code (file, line).
417 - The word in the 7-bit encoding `I.type`, i.e. the op-code of the instruction.
418 - The `::irept` instance `I.guard`.
419 - The empty string (representing former `I.event`).
420 - The word in the 7-bit encoding `I.target_number`, i.e. the jump target to this instruction from other instructions.
421 - The word in the 7-bit encoding `I.targets.size()`, i.e. the count of jump targets from this instruction.
422 - The array of individual jump targets from this instruction, each written as a word in the 7-bit encoding.
423 - The word in the 7-bit encoding `I.labels.size()`.
424 - The array of individual labels, each written as a word in the 7-bit encoding.
426 An important propery of the serialisation is that each serialised `::irept`
427 instance occurs in the stream exactly once. Namely, in the position of
428 its first serialisation query. All other such queries save only a hash
429 code (i.e. reference) of the `::irept` instance.
431 A similar strategy is used for serialisation of string constants
432 shared amongst `::irept` instances. Such a string is fully saved only in
433 the first serialisation query of an `::irept` instance it appears in and
434 all other queries only save its integer hash code.
436 Details about serialisation of `::irept` instances, strings, and words in
437 7-bit encoding can be found [here](\ref irep-serialization).
439 \subsection subsection-goto-binary-deserialisation Deserialisation
441 The deserialisation is implemented in C++ modules:
442 - `read_goto_binary.h`
443 - `read_goto_binary.cpp`
444 - `read_bin_goto_object.h`
445 - `read_bin_goto_object.cpp`
447 The first two modules are responsible for location of the stream with the
448 serialised data within a passed file. And the remaining two modules
449 perform the actual deserialisation of a `::goto_modelt` instance from
452 To deserialise a `::goto_modelt` instance `gm` from a file
453 `/some/path/name.gbf` call the function `::read_goto_binary`, e.g.
454 `read_goto_binary("/some/path/name.gbf", gm, message_handler)`, where
455 `message_handler` must be an instance of `::message_handlert` and serves
456 for reporting issues during the process.
458 The passed binary file is assumed to have the same structure as described in
459 the [previous subsection](\ref subsection-goto-binary-serialisation).
460 The process of the deserialisation does not involve any seeking in the file.
461 The content is read linearly from the beginning to the end. `::irept` instances
462 and their string constants are deserialised into the memory only once at their
463 first occurrences in the stream. All subsequent deserialisation queries are
464 resolved as in-memory references to already deserialised the `::irept`
465 instances and/or strings, based on hash code matching.
467 NOTE: The first deserialisation is detected so that the loaded hash code
468 is new. That implies that the full definition follows right after the hash.
470 Details about serialisation of `::irept` instances, strings, and words in
471 7-bit encoding can be found [here](\ref irep-serialization).
473 \subsubsection subsection-goto-binary-deserialisation-from-elf Deserialisation from ELF image
475 One can decide to store the serialised stream as a separate section, named
476 `goto-cc`, into an ELF image. Then the deserialisation has a support of
477 automatic detection of that section in an ELF file and the deserialisation
478 will be automatically started from that section.
480 For reading the ELF images there is used an instance of `::elf_readert`
481 implemented in the C++ module:
485 \subsubsection subsection-goto-binary-deserialisation-from-mach-o-fat-image Deserialisation from Mach-O fat image
487 One can decide to store the serialised stream into Mach-O fat image as a
488 separate non-empty section with flags `CPU_TYPE_HPPA` and
489 `CPU_SUBTYPE_HPPA_7100LC`. Then the deserialisation has a support of
490 automatic detection of that section in a Mach-O fat image, extraction
491 of the section from the emage into a temporary file (this is done by
492 calling `lipo` utility with `-thin hppa7100LC` option), and the
493 deserialisation will be automatically started from that temporary
496 For reading the Mach-O fat images there is used an instance of
497 `::osx_fat_readert` implemented in the C++ module:
499 - `osx_fat_reader.cpp`
501 NOTE: This functionality is available only when the modules are built
504 \subsubsection subsection-goto-binary-is-binary-file Checking file type
506 You can use function `::is_goto_binary` to check whether a passed file contains
507 a deserialised `::goto_modelt` instance or not. This is done by checking the
508 magic number of the stream (see subsection
509 [Serialisation](\ref subsection-goto-binary-serialisation)). However, in the
510 case when the deserialised data were stored into ELF or Mach-O fat image, then
511 only the check for presence of the concrete section in the image is performed.
513 \subsubsection subsection-goto-binary-deserialisation-linking Linking Goto Models
515 Similar to linking of object files together by C/C++ linker, the module
516 provides linking of a dereserialised `::goto_modelt` instance into a given
517 (i.e. previously deserialised or otherwise created) `::goto_modelt` instance.
519 This is implemented in function `::read_object_and_link`. The function first
520 deserialises the passed file into a temporary `::goto_modelt` instance, and
521 then it performs 'linking' of the temporary into a passed destination
522 `::goto_modelt` instance.
524 Details about linking of `::goto_modelt` instances can be found
525 [here](\ref section-linking-goto-models).
528 \section section-linking-goto-models Linking Goto Models
531 - `link_goto_model.h`
532 - `link_goto_model.cpp`
535 - [linking folder](\ref linking).
536 - [typecheck](\ref section-goto-typecheck).