6 \author Martin Brain, Owen Jones, Chris Smowton
8 \section util_data_structures Data Structures
10 \ref util contains some of the key data-structures used in the
13 \subsection irept_section irept: a general-purpose polymorphic tree
15 See detailed documentation at \ref irept.
17 [irept](\ref irept)s are generic tree nodes. You
18 should think of each node as holding a single string ([data](\ref irept::data),
19 actually an \ref irep_idt) and lots of child nodes, some of which are numbered
20 ([sub](\ref irept::dt::sub)) and some of which are labelled, and the label
21 can either start with a “\#” ([comments](\ref irept::dt::comments)) or without
22 one ([named_sub](\ref irept::dt::named_sub)). The meaning of the “\#” is that
23 this child shouldn't be counted when comparing two [irept](\ref irept)s for
24 equality; this is usually used when making an advisory annotation which does
25 not alter the semantics of the program.
27 They are used to represent many kinds of structured objects throughout the
28 CPROVER codebase, such as expressions, types and code. An \ref exprt represents
29 expressions, including for example \ref equal_exprt, an equality predicate, or
30 \ref dereference_exprt, which represents the `*` unary dereference operator
31 found in C. A \ref typet represents a type, and may have other \ref typet nodes
32 as children: for example, \ref array_typet represents an array, and has a single
33 numbered child that gives the type of the array elements. Finally, \ref codet
34 represents imperative statements, such as \ref code_assignt, which represents an
35 imperative assignment. It has two numbered operands, its left- and right-hand
38 Note that \ref exprt, \ref codet and similar classes deriving from \ref irept
39 do so in order to add constraints to the general tree (for example, providing
40 accessor methods with user-friendly names, or enforcing invariants that a node
41 must have a certain number of children), but do not override virtual methods or
44 The implementation of \ref irept allows saving memory by sharing instances of
45 its internal storage using a
46 [copy-on-write](https://en.wikipedia.org/wiki/Copy-on-write) scheme. For
47 example, the statement `irep1.sub()[0] = irep2;` will share rather than copy
48 `irep2` and its children, saving memory unless either irept is subsequently
49 modified, at which point a copy is made.
51 \subsection irept_discussion_section Discussion
53 Many other compiler infrastructures represent a polymorphic tree using nodes
54 specialised to a particular expression type: for example, perhaps a binary
55 addition operator could be represented using a tag plus two pointers to child
56 expressions, rather than allocating a whole irept (including a variable-length
57 expression vector and empty maps for storing named subexpressions). This may
58 save memory, but inhibits ad-hoc annotations such as tagging the addition
59 "does not overflow" without adding that field to addition operations globally
60 or maintaining a shadow data structure to track that information. This is easier
61 with a general irept structure that can store an arbitrary number of
62 arbitrarily-named child nodes.
64 Another optimisation commonly seen when storing polymorphic trees is to use a
65 uniform node data structure (like irept) but to keep the node compact, for
66 example storing at most four pointers and transparently allocating extension
67 nodes when necessary for an unusually large expression. This provides the best
68 of both worlds, obtaining compact storage for common cases such as unannotated
69 binary expressions while permitting deviation at times. The downside is that
70 implementing such a system is more complex than straightforwardly using C++
71 standard data structures as in irept.
73 \subsection irep_idt_section Strings: dstringt, the string_container and the ID_*
75 Within cbmc, strings are represented using \ref irep_idt, or \ref irep_namet
76 for keys to [named_sub](\ref irept::dt::named_sub) or
77 [comments](\ref irept::dt::comments). By default these are both
78 typedefed to \ref dstringt. For debugging purposes you can set `USE_STD_STRING`,
79 in which case they are both typedefed to `std::string`. You can also easily
80 convert an [irep_idt](\ref irep_idt) or [irep_namet](\ref irep_namet) to a
81 `std::string` using the [id2string](\ref id2string) or
82 [name2string](\ref name2string) function, respectively, or either of them to a
83 `char*` using the [c_str()](\ref dstringt::c_str) member function.
85 \ref dstringt stores a string as an index into a large
86 static table of strings. This makes it easy to compare if two
87 [irep_idt](\ref irep_idt)s are equal (just compare the index) and it makes it
88 efficient to store many copies of the same string. The static list of strings
89 is initially populated from `irep_ids.def`, so for example the fourth entry
90 in `irep_ids.def` is `“IREP_ID_ONE(type)”`, so the string “type” has index 3.
91 You can refer to this \ref irep_idt as `ID_type`. The other kind of line you
92 see is \c "IREP_ID_TWO(C_source_location, #source_location)", which means the
93 \ref irep_idt for the string “\#source_location” can be referred to as
94 `ID_C_source_location`. The “C” is for comment, meaning that it should be
95 stored in the [comments](\ref irept::dt::comments). Any strings that need
96 to be stored as [irep_idt](\ref irep_idt)s which aren't in `irep_ids.def`
97 are added to the end of the table when they are first encountered, and the
98 same index is used for all instances.
100 See documentation at \ref dstringt.
102 \subsection typet_section typet
104 \ref typet represents the type of an expression. Types may have subtypes,
105 stored in two [sub](\ref irept::dt::sub)s named “subtype” (a single type) and
106 “subtypes” (a vector of types). For pre-defined types see `std_types.h` and
107 `mathematical_types.h`.
109 \subsubsection symbol_typet_section symbol_typet
111 \ref symbol_typet is a type used to store a reference to the symbol table. The
112 full \ref symbolt can be retrieved using the identifier of \ref symbol_typet.
114 \subsection exprt_section exprt
116 \ref exprt is the class to represent an expression. It inherits from \ref
117 irept, and the only things it adds to it are that (1) every \ref exprt has
118 an entry in [named_sub](\ref irept::dt::named_sub) containing its type and
119 (2) everything in the [sub](\ref irept::dt::sub) of an \ref exprt is again an
120 \ref exprt, not just an \ref irept. You can think of \ref exprt as a node in
121 the abstract syntax tree for an expression. There are many classes that
122 inherit from \ref exprt and which represent more specific things. For
123 example, there is \ref minus_exprt, which has a [sub](\ref irept::dt::sub)
124 of size 2 (for the two arguments of minus).
126 Recall that every \ref irept has one piece of data of its own, i.e. its
127 [id()](\ref irept::id()), and all other information is in its
128 [named_sub](\ref irept::dt::named_sub), [comments](\ref irept::dt::comments)
129 or [sub](\ref irept::dt::sub). For [exprt](\ref exprt)s, the
130 [id()](\ref irept::id()) is used to specify why kind of \ref exprt this is,
131 so a \ref minus_exprt has `ID_minus` as its [id()](\ref irept::id()). This
132 means that a \ref minus_exprt can be passed wherever an \ref exprt is
133 expected, and if you want to check if the expression you are looking at is
134 a minus expression then you have to check its [id()](\ref irept::id()) (or use
135 [can_cast_expr](\ref can_cast_expr)`<minus_exprt>`).
137 \subsection codet_section codet
139 See documentation at \ref codet.
141 \ref exprt represents expressions and \ref codet represents statements.
142 \ref codet inherits from \ref exprt, so all [codet](\ref codet)s are
143 [exprt](\ref exprt)s, with [id()](\ref irept::id()) `ID_code`.
144 Many different kinds of statements inherit from \ref codet, and they are
145 distinguished by their [statement](\ref codet::get_statement()). For example,
146 a while loop would be represented by a \ref code_whilet, which has
147 [statement](\ref codet::get_statement()) `ID_while`. \ref code_whilet has
148 two operands in its [sub](\ref irept::dt::sub), and helper functions to make
149 it easier to use: [cond()](\ref code_whilet::cond()) returns the first
150 subexpression, which is the condition for the while loop, and
151 [body()](\ref code_whilet::body()) returns the second subexpression, which
152 is the body of the while loop - this has to be a \ref codet, because the body
153 of a while loop is a statement.
155 \subsection symbolt_section symbolt, symbol_tablet, and namespacet
157 A symbol table is a mapping from symbol names to \ref symbolt objects, which
158 store a symbol's name, attributes, type and perhaps value. They are used to
159 describe local and global variables, type definitions and function prototypes
162 All symbols store a type (an instance of \ref typet). For function or method
163 symbols these are \ref code_typet instances.
165 Global variable symbols may have a value (an \ref exprt), in which case it is
166 used to initialise the global.
168 Method or function symbols may also have a value, in which case it is a
169 \ref codet and gives the function definition. A method or function symbol
170 without a value is a prototype (for example, it might be an `extern` declaration
171 in C). A function symbol that has been converted to a GOTO function *may* be
172 replaced with a special "compiled" value, but this varies from driver program to
173 program -- at the time of writing, only \ref goto-cc does this.
175 Local variables' symbol values are always ignored;
176 any initialiser must be explicitly assigned after they are instantiated by a
177 declaration (\ref code_declt).
179 Symbol expressions (\ref symbol_exprt) and types (\ref symbol_typet) refer to
180 symbols stored in a symbol table. Symbol expressions can be thought of as
181 referring to the table for more detail about a symbol (for example, is it a
182 local or a global variable, or perhaps a function?), and have a type which must
183 match the type given in the symbol table. Symbol types can be thought of as
184 shorthands or aliases for a type given in more detail in the symbol table, for
185 example permitting a shorthand for a large structure type, as well as permitting
186 the construction of expressions referring to that type before its full
189 Note the implementation of \ref symbol_tablet is split into a base interface
190 (\ref symbol_table_baset) and an implementation (\ref symbol_tablet). There is
191 one alternate implementation (\ref journalling_symbol_tablet) which additionally
192 maintains a log or journal of symbol creation, modification and deletions.
194 Namespaces (\ref namespacet) provide a read-only view on one or more symbol
195 tables, and provide helper functions that aid accessing the table. A namespace
196 may layer one or more symbol tables, in which case any lookup operation checks
197 the 'top' symbol table before moving down the layers towards the 'bottom' symbol
198 table, looking up the target symbol name in each successive table until one is
199 found. Note class \ref multi_namespacet can layer arbitrary numbers of symbol
200 tables, while for historical reasons \ref namespacet can layer up to two.
202 The namespace wrapper class also provides the \ref namespacet::follow
203 operation, which dereferences a `symbol_typet` to retrieve the type it refers
204 to, including following a symbol_typet which refers to another symbol which
205 eventually refers to a 'real' type.
207 \subsubsection symbolt_lifetime Symbol lifetimes
209 Symbols with \ref symbolt::is_static_lifetime set are initialised before a
210 program's entry-point is called and live until it ends. Such long-lived
211 variables are used to implement globals, but also C's procedure-local static
212 variables, which have restricted visiblity but the lifetime of a global.
213 They may be marked dead using a \ref code_deadt instruction, but this does not
214 render the variable inaccessible, it only indicates that the variable's current
215 value will not be read without an intervening write.
217 Non-type, non-global symbols (those with \ref symbolt::is_static_lifetime and
218 \ref symbolt::is_type not set) are local variables, and their lifespan
219 description varies depending on context.
221 In symbol table function definitions (the values of function-typed symbols),
222 local variables are created using a \ref code_declt instruction, and live until
223 their enclosing \ref code_blockt ends (similar to the C idiom
224 `{ int x; ... } // x lifespan ends`). By contrast in GOTO programs locals are
225 declared using a DECL instruction and live until a DEAD instruction referring
226 to the same symbol. Explicit DEAD instructions are always used rather than
227 killing implicitly by exiting a function.
229 Multiple instances of the same local may be live at the same time by virtue of
230 recursive function calls; a dead instruction or scope end always targets the
231 most recently allocated instance.
233 Scoping rules are particular to source languages and are not enforced by
234 CPROVER. For example, in standard C it is not possible to refer to a local
235 variable across functions without using a pointer, but in some possible source
236 languages this is permitted.
238 \subsubsection symbolt_details Symbol details
241 * A mode, which indicates the source language frontend responsible for creating
242 them. This is mainly used in pretty-printing the symbol table, to indicate
243 the appropriate language frontend to use rendering the symbol's value and/or
244 type. For example, mode == ID_C == "C" indicates that \ref ansi_c_languaget,
245 the C front-end, should be used to pretty-print, which in turn delegates to
247 * A base-name and pretty-name, which are a short and user-friendly version of
248 the symbol's definitive name respectively.
249 * Several flags (see \ref symbolt for full details), including
250 \ref symbolt::is_static_lifetime (is this a global variable symbol?),
251 \ref symbolt::is_type (is this a type definition),
252 \ref symbolt::is_thread_local (of a variable, are there implicitly instances
253 of this variable per-thread?).
263 1 [shape=none, label=""];
264 2 [label="command line parsing"];
265 3 [shape=none, label=""];
266 1 -> 2 [label="C files or goto-binaries"];
267 2 -> 3 [label="Command line options, file names"];
271 \subsection ast-examples-section Examples: how to represent the AST of statements, in C and in java
273 \subsubsection ast-example-1-section x = y + 123
277 \subsubsection ast-example-2-section if (x > 10) { y = 2 } else { v[3] = 4 }
281 \subsubsection ast-example-3-section Java arrays: struct Array { int length, int *data };
285 \subsection section-goto-typecheck Goto Model Typecheck
289 \subsection irep-serialization `irept` Serialization
291 The module provides serialisation and deserialisation of
292 integer numbers, strings, and `irept` instances.
294 This is implemented in C++ modules:
295 - `irep_serialization.h`
296 - `irep_serialization.cpp`
297 - `irep_hash_container.h` (applies only to `irept` instances)
298 - `irep_hash_container.cpp` (applies only to `irept` instances)
300 \subsubsection irep-serialization-numbers Serialization of Numbers
302 A number is serialiased in 7-bit encoding. For example, given a 2-byte
303 number in base 2, like `10101010 01010101`, then it is saves in 3 bytes,
304 where each byte takes only 7 bits from the number, reading from the
305 left. The 8th bit in each output byte is set to 1 except in the last
306 byte, where the bit is 0. That 0 bit indicates the end of the
307 encoding of the number. So, the output bytes look like this:
308 `11010101 11010100 00000010`.
310 This is implmented in the function `::write_gb_word`.
312 The deserialisation does the oposite process and it is implemented in
313 `irep_serializationt::read_gb_word`.
315 \subsubsection irep-serialization-strings Serialization of Strings
317 A string is encoded as classic 0-terminated C string. However,
318 characters `0` and `\\` are escaped by writing additional `\\`
321 This is implmented in the function `::write_gb_string` and the
322 deserialisation is implemented in `irep_serializationt::read_gb_string`.
324 Each string which is stored inside an `::irept` instance is saved (meaining
325 its characters) into the ouptut stream, only in the first serialisation
326 query of the string. In that case, before the string there is also saved
327 a computed integer hash code of the string. Then, all subsequent
328 serialisation queries save only that integer hash code of the string.
330 \subsubsection irep-serialization-ireps Serialization of `irept` Instances
333 \subsection CProver-output CProver output - printing.
335 CProver output can be in plain text, json or xml format.
336 All of CProver output should use the built-in messaging infrastructure,
337 which filters messages by 'verbosity'. Default verbosity filtering is set
338 to the maximum level (10), i.e. all messages are printed. Error messages
339 have the lowest level of verbosity (level 1) while debug messages have
340 the highest level (level 10). Intermediate levels (in order of
341 increasing 'verbosity') are for warnings, results, status/phase information,
342 statistical information and progress information
343 (more information on these verbosity levels can be found in
344 \ref messaget in the `message.h` header).
346 Key classes related to the messaging infrastructure can be found in
347 the `message.h` header. \ref messaget provides messages (with a built-in
348 verbosity level), which are then processed by subclasses
349 of \ref message_handlert. These filter messages according to verbosity
350 level and direct output to an appropriate location.
351 An important group of subclasses is \ref stream_message_handlert and
352 its subtypes, which direct messages to an output stream
353 (`std::cout` & `std::cerr`, `std::clog` is not used). In particular,
354 messages of verbosity level less than 3 are directed to `std::cerr`,
355 while others go to `std::cout` (this may change, but you should be
356 aware that not all messages are necessarily directed to only
357 `std::cout` or `std::cerr`).
358 Another key \ref message_handlert subclass is
359 ui_message_handlert - which provides output in plain text,
362 \subsubsection CProver-legacy-output CProver legacy output
363 Although all output should use the messaging interface - there are a
364 few locations where this is not yet implemented. These should
365 _not_ be extended - but it may be helpful to be aware of where this happens.
367 * Output from invariants / exceptions
369 Invariants output to `std::cerr` - and provide a backtrace and optionally
370 some diagnostic information.
371 For more information on invariants, see `invariant.h`
373 Exceptions have a standard `what()` interface.
374 Best current practice for exceptions is for the output of `what()` to be
375 routed via a message with verbosity level 1 (as returned by
376 `messaget::error()`). The message is then processed by a message_handler.
377 Where plain text output (versus json or xml output) is
378 chosen, exceptions print (indirectly) to `std::cerr`. Json and xml
379 output always goes to `std::cout`. There are still a few locations where
380 exceptions print directly to std::cerr. These should _not_ be extended.
381 More information on exceptions can be found in `exception_utils.h`.
383 * Direct output via `std::cout` & `std::cerr`
385 These are in the process of being removed - no new output should go
386 via `std::cout` or `std::cerr`, but should instead use the
387 \ref messaget and \ref message_handlert infrastructure.