cprover
util/README.md
Go to the documentation of this file.
1 \ingroup module_hidden
2 \defgroup util util
3 
4 # Folder util
5 
6 \author Martin Brain, Owen Jones, Chris Smowton
7 
8 \section util_data_structures Data Structures
9 
10 \ref util contains some of the key data-structures used in the
11 CPROVER codebase.
12 
13 \subsection irept_section irept: a general-purpose polymorphic tree
14 
15 See detailed documentation at \ref irept.
16 
17 [irept](\ref irept)s are generic tree nodes. You
18 should think of each node as holding a single string ([data](\ref irept::data),
19 actually an \ref irep_idt) and lots of child nodes, some of which are numbered
20 ([sub](\ref irept::dt::sub)) and some of which are labelled, and the label
21 can either start with a “\#” ([comments](\ref irept::dt::comments)) or without
22 one ([named_sub](\ref irept::dt::named_sub)). The meaning of the “\#” is that
23 this child shouldn't be counted when comparing two [irept](\ref irept)s for
24 equality; this is usually used when making an advisory annotation which does
25 not alter the semantics of the program.
26 
27 They are used to represent many kinds of structured objects throughout the
28 CPROVER codebase, such as expressions, types and code. An \ref exprt represents
29 expressions, including for example \ref equal_exprt, an equality predicate, or
30 \ref dereference_exprt, which represents the `*` unary dereference operator
31 found in C. A \ref typet represents a type, and may have other \ref typet nodes
32 as children: for example, \ref array_typet represents an array, and has a single
33 numbered child that gives the type of the array elements. Finally, \ref codet
34 represents imperative statements, such as \ref code_assignt, which represents an
35 imperative assignment. It has two numbered operands, its left- and right-hand
36 sides.
37 
38 Note that \ref exprt, \ref codet and similar classes deriving from \ref irept
39 do so in order to add constraints to the general tree (for example, providing
40 accessor methods with user-friendly names, or enforcing invariants that a node
41 must have a certain number of children), but do not override virtual methods or
42 add fields.
43 
44 The implementation of \ref irept allows saving memory by sharing instances of
45 its internal storage using a
46 [copy-on-write](https://en.wikipedia.org/wiki/Copy-on-write) scheme. For
47 example, the statement `irep1.sub()[0] = irep2;` will share rather than copy
48 `irep2` and its children, saving memory unless either irept is subsequently
49 modified, at which point a copy is made.
50 
51 \subsection irept_discussion_section Discussion
52 
53 Many other compiler infrastructures represent a polymorphic tree using nodes
54 specialised to a particular expression type: for example, perhaps a binary
55 addition operator could be represented using a tag plus two pointers to child
56 expressions, rather than allocating a whole irept (including a variable-length
57 expression vector and empty maps for storing named subexpressions). This may
58 save memory, but inhibits ad-hoc annotations such as tagging the addition
59 "does not overflow" without adding that field to addition operations globally
60 or maintaining a shadow data structure to track that information. This is easier
61 with a general irept structure that can store an arbitrary number of
62 arbitrarily-named child nodes.
63 
64 Another optimisation commonly seen when storing polymorphic trees is to use a
65 uniform node data structure (like irept) but to keep the node compact, for
66 example storing at most four pointers and transparently allocating extension
67 nodes when necessary for an unusually large expression. This provides the best
68 of both worlds, obtaining compact storage for common cases such as unannotated
69 binary expressions while permitting deviation at times. The downside is that
70 implementing such a system is more complex than straightforwardly using C++
71 standard data structures as in irept.
72 
73 \subsection irep_idt_section Strings: dstringt, the string_container and the ID_*
74 
75 Within cbmc, strings are represented using \ref irep_idt, or \ref irep_namet
76 for keys to [named_sub](\ref irept::dt::named_sub) or
77 [comments](\ref irept::dt::comments). By default these are both
78 typedefed to \ref dstringt. For debugging purposes you can set `USE_STD_STRING`,
79 in which case they are both typedefed to `std::string`. You can also easily
80 convert an [irep_idt](\ref irep_idt) or [irep_namet](\ref irep_namet) to a
81 `std::string` using the [id2string](\ref id2string) or
82 [name2string](\ref name2string) function, respectively, or either of them to a
83 `char*` using the [c_str()](\ref dstringt::c_str) member function.
84 
85 \ref dstringt stores a string as an index into a large
86 static table of strings. This makes it easy to compare if two
87 [irep_idt](\ref irep_idt)s are equal (just compare the index) and it makes it
88 efficient to store many copies of the same string. The static list of strings
89 is initially populated from `irep_ids.def`, so for example the fourth entry
90 in `irep_ids.def` is `“IREP_ID_ONE(type)”`, so the string “type” has index 3.
91 You can refer to this \ref irep_idt as `ID_type`. The other kind of line you
92 see is \c "IREP_ID_TWO(C_source_location, #source_location)", which means the
93 \ref irep_idt for the string “\#source_location” can be referred to as
94 `ID_C_source_location`. The “C” is for comment, meaning that it should be
95 stored in the [comments](\ref irept::dt::comments). Any strings that need
96 to be stored as [irep_idt](\ref irep_idt)s which aren't in `irep_ids.def`
97 are added to the end of the table when they are first encountered, and the
98 same index is used for all instances.
99 
100 See documentation at \ref dstringt.
101 
102 \subsection typet_section typet
103 
104 \ref typet represents the type of an expression. Types may have subtypes,
105 stored in two [sub](\ref irept::dt::sub)s named “subtype” (a single type) and
106 “subtypes” (a vector of types). For pre-defined types see `std_types.h` and
107 `mathematical_types.h`.
108 
109 \subsubsection symbol_typet_section symbol_typet
110 
111 \ref symbol_typet is a type used to store a reference to the symbol table. The
112 full \ref symbolt can be retrieved using the identifier of \ref symbol_typet.
113 
114 \subsection exprt_section exprt
115 
116 \ref exprt is the class to represent an expression. It inherits from \ref
117 irept, and the only things it adds to it are that (1) every \ref exprt has
118 an entry in [named_sub](\ref irept::dt::named_sub) containing its type and
119 (2) everything in the [sub](\ref irept::dt::sub) of an \ref exprt is again an
120 \ref exprt, not just an \ref irept. You can think of \ref exprt as a node in
121 the abstract syntax tree for an expression. There are many classes that
122 inherit from \ref exprt and which represent more specific things. For
123 example, there is \ref minus_exprt, which has a [sub](\ref irept::dt::sub)
124 of size 2 (for the two arguments of minus).
125 
126 Recall that every \ref irept has one piece of data of its own, i.e. its
127 [id()](\ref irept::id()), and all other information is in its
128 [named_sub](\ref irept::dt::named_sub), [comments](\ref irept::dt::comments)
129 or [sub](\ref irept::dt::sub). For [exprt](\ref exprt)s, the
130 [id()](\ref irept::id()) is used to specify why kind of \ref exprt this is,
131 so a \ref minus_exprt has `ID_minus` as its [id()](\ref irept::id()). This
132 means that a \ref minus_exprt can be passed wherever an \ref exprt is
133 expected, and if you want to check if the expression you are looking at is
134 a minus expression then you have to check its [id()](\ref irept::id()) (or use
135 [can_cast_expr](\ref can_cast_expr)`<minus_exprt>`).
136 
137 \subsection codet_section codet
138 
139 See documentation at \ref codet.
140 
141 \ref exprt represents expressions and \ref codet represents statements.
142 \ref codet inherits from \ref exprt, so all [codet](\ref codet)s are
143 [exprt](\ref exprt)s, with [id()](\ref irept::id()) `ID_code`.
144 Many different kinds of statements inherit from \ref codet, and they are
145 distinguished by their [statement](\ref codet::get_statement()). For example,
146 a while loop would be represented by a \ref code_whilet, which has
147 [statement](\ref codet::get_statement()) `ID_while`. \ref code_whilet has
148 two operands in its [sub](\ref irept::dt::sub), and helper functions to make
149 it easier to use: [cond()](\ref code_whilet::cond()) returns the first
150 subexpression, which is the condition for the while loop, and
151 [body()](\ref code_whilet::body()) returns the second subexpression, which
152 is the body of the while loop - this has to be a \ref codet, because the body
153 of a while loop is a statement.
154 
155 \subsection symbolt_section symbolt, symbol_tablet, and namespacet
156 
157 A symbol table is a mapping from symbol names to \ref symbolt objects, which
158 store a symbol's name, attributes, type and perhaps value. They are used to
159 describe local and global variables, type definitions and function prototypes
160 and definitions.
161 
162 All symbols store a type (an instance of \ref typet). For function or method
163 symbols these are \ref code_typet instances.
164 
165 Global variable symbols may have a value (an \ref exprt), in which case it is
166 used to initialise the global.
167 
168 Method or function symbols may also have a value, in which case it is a
169 \ref codet and gives the function definition. A method or function symbol
170 without a value is a prototype (for example, it might be an `extern` declaration
171 in C). A function symbol that has been converted to a GOTO function *may* be
172 replaced with a special "compiled" value, but this varies from driver program to
173 program -- at the time of writing, only \ref goto-cc does this.
174 
175 Local variables' symbol values are always ignored;
176 any initialiser must be explicitly assigned after they are instantiated by a
177 declaration (\ref code_declt).
178 
179 Symbol expressions (\ref symbol_exprt) and types (\ref symbol_typet) refer to
180 symbols stored in a symbol table. Symbol expressions can be thought of as
181 referring to the table for more detail about a symbol (for example, is it a
182 local or a global variable, or perhaps a function?), and have a type which must
183 match the type given in the symbol table. Symbol types can be thought of as
184 shorthands or aliases for a type given in more detail in the symbol table, for
185 example permitting a shorthand for a large structure type, as well as permitting
186 the construction of expressions referring to that type before its full
187 definition is known.
188 
189 Note the implementation of \ref symbol_tablet is split into a base interface
190 (\ref symbol_table_baset) and an implementation (\ref symbol_tablet). There is
191 one alternate implementation (\ref journalling_symbol_tablet) which additionally
192 maintains a log or journal of symbol creation, modification and deletions.
193 
194 Namespaces (\ref namespacet) provide a read-only view on one or more symbol
195 tables, and provide helper functions that aid accessing the table. A namespace
196 may layer one or more symbol tables, in which case any lookup operation checks
197 the 'top' symbol table before moving down the layers towards the 'bottom' symbol
198 table, looking up the target symbol name in each successive table until one is
199 found. Note class \ref multi_namespacet can layer arbitrary numbers of symbol
200 tables, while for historical reasons \ref namespacet can layer up to two.
201 
202 The namespace wrapper class also provides the \ref namespacet::follow
203 operation, which dereferences a `symbol_typet` to retrieve the type it refers
204 to, including following a symbol_typet which refers to another symbol which
205 eventually refers to a 'real' type.
206 
207 \subsubsection symbolt_lifetime Symbol lifetimes
208 
209 Symbols with \ref symbolt::is_static_lifetime set are initialised before a
210 program's entry-point is called and live until it ends. Such long-lived
211 variables are used to implement globals, but also C's procedure-local static
212 variables, which have restricted visiblity but the lifetime of a global.
213 They may be marked dead using a \ref code_deadt instruction, but this does not
214 render the variable inaccessible, it only indicates that the variable's current
215 value will not be read without an intervening write.
216 
217 Non-type, non-global symbols (those with \ref symbolt::is_static_lifetime and
218 \ref symbolt::is_type not set) are local variables, and their lifespan
219 description varies depending on context.
220 
221 In symbol table function definitions (the values of function-typed symbols),
222 local variables are created using a \ref code_declt instruction, and live until
223 their enclosing \ref code_blockt ends (similar to the C idiom
224 `{ int x; ... } // x lifespan ends`). By contrast in GOTO programs locals are
225 declared using a DECL instruction and live until a DEAD instruction referring
226 to the same symbol. Explicit DEAD instructions are always used rather than
227 killing implicitly by exiting a function.
228 
229 Multiple instances of the same local may be live at the same time by virtue of
230 recursive function calls; a dead instruction or scope end always targets the
231 most recently allocated instance.
232 
233 Scoping rules are particular to source languages and are not enforced by
234 CPROVER. For example, in standard C it is not possible to refer to a local
235 variable across functions without using a pointer, but in some possible source
236 languages this is permitted.
237 
238 \subsubsection symbolt_details Symbol details
239 
240 Symbols have:
241 * A mode, which indicates the source language frontend responsible for creating
242  them. This is mainly used in pretty-printing the symbol table, to indicate
243  the appropriate language frontend to use rendering the symbol's value and/or
244  type. For example, mode == ID_C == "C" indicates that \ref ansi_c_languaget,
245  the C front-end, should be used to pretty-print, which in turn delegates to
246  \ref expr2ct.
247 * A base-name and pretty-name, which are a short and user-friendly version of
248  the symbol's definitive name respectively.
249 * Several flags (see \ref symbolt for full details), including
250  \ref symbolt::is_static_lifetime (is this a global variable symbol?),
251  \ref symbolt::is_type (is this a type definition),
252  \ref symbolt::is_thread_local (of a variable, are there implicitly instances
253  of this variable per-thread?).
254 
255 \subsection cmdlinet
256 
257 See \ref cmdlinet.
258 
259 \dot
260 digraph G {
261  node [shape=box];
262  rankdir="LR";
263  1 [shape=none, label=""];
264  2 [label="command line parsing"];
265  3 [shape=none, label=""];
266  1 -> 2 [label="C files or goto-binaries"];
267  2 -> 3 [label="Command line options, file names"];
268 }
269 \enddot
270 
271 \subsection ast-examples-section Examples: how to represent the AST of statements, in C and in java
272 
273 \subsubsection ast-example-1-section x = y + 123
274 
275 To be documented..
276 
277 \subsubsection ast-example-2-section if (x > 10) { y = 2 } else { v[3] = 4 }
278 
279 To be documented.
280 
281 \subsubsection ast-example-3-section Java arrays: struct Array { int length, int *data };
282 
283 To be documented.
284 
285 \subsection section-goto-typecheck Goto Model Typecheck
286 
287 Class `typecheckt`.
288 
289 \subsection irep-serialization `irept` Serialization
290 
291 The module provides serialisation and deserialisation of
292 integer numbers, strings, and `irept` instances.
293 
294 This is implemented in C++ modules:
295  - `irep_serialization.h`
296  - `irep_serialization.cpp`
297  - `irep_hash_container.h` (applies only to `irept` instances)
298  - `irep_hash_container.cpp` (applies only to `irept` instances)
299 
300 \subsubsection irep-serialization-numbers Serialization of Numbers
301 
302 A number is serialiased in 7-bit encoding. For example, given a 2-byte
303 number in base 2, like `10101010 01010101`, then it is saves in 3 bytes,
304 where each byte takes only 7 bits from the number, reading from the
305 left. The 8th bit in each output byte is set to 1 except in the last
306 byte, where the bit is 0. That 0 bit indicates the end of the
307 encoding of the number. So, the output bytes look like this:
308 `11010101 11010100 00000010`.
309 
310 This is implmented in the function `::write_gb_word`.
311 
312 The deserialisation does the oposite process and it is implemented in
313 `irep_serializationt::read_gb_word`.
314 
315 \subsubsection irep-serialization-strings Serialization of Strings
316 
317 A string is encoded as classic 0-terminated C string. However,
318 characters `0` and `\\` are escaped by writing additional `\\`
319 before them.
320 
321 This is implmented in the function `::write_gb_string` and the
322 deserialisation is implemented in `irep_serializationt::read_gb_string`.
323 
324 Each string which is stored inside an `::irept` instance is saved (meaining
325 its characters) into the ouptut stream, only in the first serialisation
326 query of the string. In that case, before the string there is also saved
327 a computed integer hash code of the string. Then, all subsequent
328 serialisation queries save only that integer hash code of the string.
329 
330 \subsubsection irep-serialization-ireps Serialization of `irept` Instances
331 
332 <br>
333 \subsection CProver-output CProver output - printing.
334 
335 CProver output can be in plain text, json or xml format.
336 All of CProver output should use the built-in messaging infrastructure,
337 which filters messages by 'verbosity'. Default verbosity filtering is set
338 to the maximum level (10), i.e. all messages are printed. Error messages
339 have the lowest level of verbosity (level 1) while debug messages have
340 the highest level (level 10). Intermediate levels (in order of
341 increasing 'verbosity') are for warnings, results, status/phase information,
342 statistical information and progress information
343 (more information on these verbosity levels can be found in
344 \ref messaget in the `message.h` header).
345 
346 Key classes related to the messaging infrastructure can be found in
347 the `message.h` header. \ref messaget provides messages (with a built-in
348 verbosity level), which are then processed by subclasses
349 of \ref message_handlert. These filter messages according to verbosity
350 level and direct output to an appropriate location.
351 An important group of subclasses is \ref stream_message_handlert and
352 its subtypes, which direct messages to an output stream
353 (`std::cout` & `std::cerr`, `std::clog` is not used). In particular,
354 messages of verbosity level less than 3 are directed to `std::cerr`,
355 while others go to `std::cout` (this may change, but you should be
356 aware that not all messages are necessarily directed to only
357 `std::cout` or `std::cerr`).
358 Another key \ref message_handlert subclass is
359 ui_message_handlert - which provides output in plain text,
360 json or xml formats.
361 
362 \subsubsection CProver-legacy-output CProver legacy output
363 Although all output should use the messaging interface - there are a
364 few locations where this is not yet implemented. These should
365 _not_ be extended - but it may be helpful to be aware of where this happens.
366 
367 * Output from invariants / exceptions
368 
369  Invariants output to `std::cerr` - and provide a backtrace and optionally
370  some diagnostic information.
371  For more information on invariants, see `invariant.h`
372 
373  Exceptions have a standard `what()` interface.
374  Best current practice for exceptions is for the output of `what()` to be
375  routed via a message with verbosity level 1 (as returned by
376  `messaget::error()`). The message is then processed by a message_handler.
377  Where plain text output (versus json or xml output) is
378  chosen, exceptions print (indirectly) to `std::cerr`. Json and xml
379  output always goes to `std::cout`. There are still a few locations where
380  exceptions print directly to std::cerr. These should _not_ be extended.
381  More information on exceptions can be found in `exception_utils.h`.
382 
383 * Direct output via `std::cout` & `std::cerr`
384 
385  These are in the process of being removed - no new output should go
386  via `std::cout` or `std::cerr`, but should instead use the
387  \ref messaget and \ref message_handlert infrastructure.