We applied the general ideas for guaranteed satisfiable and unsatisfiable formula gen- eration in Sections 5.2 and 5.3 to the theories of bitvectors and floating point [132] in SMT-LIB [28]. This section describes key implementation details and challenges we had during this process, starting with the application to the theory of bitvectors.
5.4.1
Application to the Theory of Bitvectors
For guaranteed satisfiable formula generation, our implementation follows a very sim- ilar structure to what is provided in Figure 5.4. For the most part, we simply extend the provided rules to add in additional operations available in the real theory of bitvec- tors. The one discussion point of interest is how exactly we implemented operations like boolAnd, boolOr, add, and lessThan, which we had discussed have multiple avenues for implementation. This ended up being a difficult decision to make, and we ended up trying out all three approaches discussed in Section 5.2 A discussion of our experiences with these approaches follows.
We originally went with a pure Prolog approach, wherein bitvectors were represented as lists of Boolean values. While this was very simple and did not rely on auxilliary constraint solvers, we found it to be often impractically slow, even for relatively small inputs. This was especially true for constraints involving negation, as our naive imple- mentation was often forced to perform a brute force search of the state space under these conditions. For this reason, this was unfit for extension to the generation of guaranteed unsatisfiable formulas, as these inherently involve many negated constraints (e.g., an ex- pression should evaluate to any value except for the expected value, hence negation). Moreover, the performance problems meant that it was practically necessary to put a timeout on generation in order to ensure practical progress was being made. As such,
while the pure Prolog approach was overall simple, it is not an optimal choice when it comes to generation.
From here, we modified the generator to use the arithmetic constraint solver available in SWI-PL [67, 146], the CLP engine we chose for generation. The idea here was to repre- sent bitvector operations in terms of integer arithmetic operations. For example, bvadd’s modular arithmetic can be represented using standard arithmetic addition (specifically constraints like X #= Y + Z), along with some additional constraints regarding what happens on overflow. The biggest challenge here was to determine how to encode things using integer constraints, which was still relatively straightforward.
While this approach based on utilizing built-in constraint solvers works in theory, we found it to be fraught with problems in practice, to the point of unusability. For one, while the CLP library supports non-linear operations like multiplication (which greatly simplifies multiplication over bitvectors), we found that the engine was prone to hanging when asked to find solutions for these sorts of constraints. This was true even in contrived situations where a simple brute force search could be completed within milliseconds to find the answer. This required us to manually specify how non-linear operations worked, with was complex and error-prone. Moreover, we found the library to be extremely buggy, to the point where it was more likely for the engine to crash than to actually produce an input. Given that our testing technique relies on a relatively bug-free constraint solver, this quickly made the library in SWI-PL untenable for our purposes.
For these reasons, we ultimately went with the approach of using Z3 [47] to solve these sorts of integer constraints, using custom bindings for SWI-PL [147]. This sped up generation tremendously, and the generated inputs were observed to generally utilize more semantic rules. Z3’s robust support for non-linear constraints was also a big win here. While this approach assumes that Z3 ultimately is correct when it comes to solving integer constraints, we found these constraints to be extremely reliable in practice. In
fact, we actually spent nearly a month of CPU time fuzzing Z3’s theory of integers using traditional syntactic techniques, which failed to find any bugs. As such, we have high confidence in the correctness of Z3’s theory of integers, at least for the sort of queries we were issuing.
5.4.2
Application to the Theory of Floating Point
The complexity of the theory of floating point [132], along with its youth, required us to take a different approach than what we used for the theory of bitvectors. For the theory of floating point, merely getting the semantics correct was a non-trivial problem, which was exacerbated by the fact that no robust solvers yet exist for the theory. As such, when a semantic question arose, we could not simply ask a solver what the correct solution was; the solver could very well be wrong, particularly with the sort of edge cases we were most interested in. This required us to take a different approach, split up into two stages.
The first stage involved writing a naive solver in Scala [118], a functional language. This allowed us to separate out concerns which were specific to generation, which are relevant only in a CLP-based situation. This solver works simply by brute-forcing the entire state space, going through all the possible floating point values within the imposed bounds. For small inputs, this is computationally feasible, and in certain cases it can ironically be even orders of magnitude faster than the solvers under test. Additionally, this still gives us a sound and complete solver, as the state space for the theory of floating point is finite in general; all variables in the theory have a fixed, albeit generally intractable, number of possible values.
Central to our testing technique is the requirement that these semantics be correct. An incorrect semantics leads to inputs which are incorrectly marked as buggy, which
requires human intervention in order to fix. Because inputs can be large, this can require a significant amount of effort. To try to catch as many semantic flaws as possible in our implementation ahead of time, we used a syntactic fuzzing approach to test it against preexisting hardware floating point arithmetic implementations. This ensured that our implementation was at least consistent with existing implementations. In general, this does not guarantee anything — the hardware implementations themselves may be buggy, and there are subtle, intentional differences between the actual IEEE-754 standard [133] and the SMT theory of floating point [132]. In particular, hardware implementations generally can reason only about 32-bit and 64-bit floating point values, whereas the theory of floating point allows an arbitrary (but fixed) number of bits). Additionally, the semantics of edge cases like min(+0.0, -0.0) can differ significantly. Even so, this was an effective technique for finding bugs in our own implementation before moving on to using our implementation to find bugs.
Once we were reasonably sure that our Scala-based implementation was correct, we ported this to CLP. This process was usually straightforward, though complexities arose due to the nondeterministic semantics of CLP. For example, the Scala implementation could always assume that only one particular value was in play for a variable at any given point in time, which was the main advantage with going with a brute-force approach. In CLP, it is possible that a variable’s value is completely uninstantiated, meaning it nondeterministically holds all possible values at once. As such, in order to assume some- thing in the CLP context, it is necessary to instantiate the variable in a way such that its value reflects whatever assumption is being made. This can get tricky, particularly when it comes to optimizing generation code. Indeed, we found that in practice, if a bug came up in our implementation, it was most likely specific to a generation concern in the CLP implementation, as opposed to a fundamental semantic issue in the Scala implementation.
As for representing bits, we chose the pure Prolog representation of using lists of Boolean values, without he help of any sort of external constraint solver. The reason for this strategy was simplicity: the theory is so complex to begin with that we did not want to introduce any more complexity by adding in auxilliary constraint solvers.
5.5
Evaluation
This section discusses how we applied the formula generators in Section 5.4 to testing a series of SMT solvers. This includes both our overall testing methodology, along with the actual bug-finding results.
5.5.1
Generators Evaluated
The state space of formula generators is significantly larger than just syntactic, guar- anteed satisfiable, and guaranteed unsatisfiable. For example, this speaks nothing of the size of the formulas generated, the number of variables they contain, and so on. We in- formally tried some different combinations which were intuitively likely to find bugs, and we found that the number of bits involved (that is, the number of bits in a bitvector or floating point value) to be significant. We say this was “informal” because we do not have complete evaluation information for all possible parameters; the whole space balloons to over 1,000 unique configurations, which would necessitate approximately 14 months of CPU time to fully evaluate.
Overall, all the test case generators we implemented and evaluated are named and described in Table 5.1. This table is notably missing guaranteed unsatisfiable formula generation for the theory of floating point. Unfortunately, the choice of the pure Prolog approach of representing bits for this theory (discussed in Section 5.4.2) made it imprac- tically slow when used for guaranteed unsatisfiable formula generation. In this context,
Theory # Bits Type Generator Name Solvers Tested Bitvectors 3 sat bv_few_sat Z3 [47] CVC4 [143] MathSAT5 [144] Boolector [145] unsat bv_few_unsat syntactic bv_few_syntactic 8 sat bv_many_sat unsat bv_many_unsat syntactic bv_many_syntactic Floating Point 5 (2e + 3m) sat fp_few_sat Z3 [47] syntactic fp_few_syntactic
32 (8e + 24m) syntactic fp_many_syntacticsat fp_many_sat Z3 [47]MathSAT5 [144] Table 5.1: Implemented test case generators, along with the SMT solvers they were used to test. “sat” means the generator produces guaranteed satisfiable formulas using the technique described in Section 5.2. “unsat” means the generator produces guaranteed unsatisfiable formulas using the technique described in Section 5.3. “syntactic” means the generator produces formula with unknown satisfiability results, using traditional syntactic fuzzing techniques and differential testing [21]. For the floating point configurations, the number of bits (# Bits) is broken down into exponent bits (“e”) and mantissa bits (“m”). “impractically slow” was on the order of several programs per hour. None of these pro- grams found any bugs, so we simply remove these generators from our evaluation entirely. Table 5.1 also lists the SMT solvers we test against for each test case generator configuration. Solvers were chosen based on popularity, capabilities (e.g., support for the theory of floating point), and performance. CVC4 [143] and Boolector [145] lack support for the theory of floating point, so we do not test them with our floating point generators. Additionally, while MathSAT5 [144] has support for the theory of floating point, discussion with the authors [148] revealed that the primary focus is on values comprising many bits; bugs found involving few bits were put at lower priority. As such, we only tested MathSAT5 against inputs comprised of many bits.
5.5.2
Testing Process and Infrastructure
As for exactly how these generators were tested, intuitively this follows the following four-step process:
1. Generate a test case
2. Run the test case on a system under test
3. Classify the result from the system under test as being normal or indicative of a bug
4. Report any bugs found along with representative inputs to the appropriate parties The first two steps were executed in a massively parallel fashion, with a series of identical test case generators producing inputs for a pool of systems under test.
A naive way of performing the second step above is to run the solver under test once per input. However, this is extremely inefficient, with nearly all testing time being spent with the associated costs of building up and tearing down processes. As such, we modified the frontends of each solver slightly so that they can incrementally accept whole new inputs without process teardown, similar to the technique employed in Chapter 4, Section 4.5.1. This improved the testing throughput by up to 10×.
As for the third step, we found that we would often repeatedly hit the same bug, leading to many redundant inputs which would consume progressively more and more disk space. To help alleviate this situation, we implemented a parallel online version of the sort of fuzzer taming techniques discussed in Chen et al. [119]. We found that while this was extremely effective for taming assertion violations, it did not apply well to correctness bugs. As such, for correctness bugs, we implemented a custom similarity metric which considered two non-identical formulas to be the equivalent as long as the differences lied
solely in the leaves of their ASTs; this reduced the space of inputs considerably, down to tens of thousands of inputs as opposed to millions of inputs.
With the actual reporting of bug-triggering inputs in step four, while most generated inputs contain no more than a dozen formulas, in some cases these formulas are extremely complex. Based on developer feedback [48], we implemented a delta debugger [75], using the same technique as described in Brummayer et al. 2009 [17]. This significantly cut down on the complexity of reported bug-triggering inputs.
5.5.3
Evaluation Methodology
Ultimately, we want to measure the total number of unique bugs found for each test case generator, particularly the total number of unique correctness bugs found. This is somewhat difficult to do because of the large numbers of inputs involved; even after fuzzer taming, we are still left with tens of thousands of inputs, most of which trigger the same bugs. As such, it was necessary to implement some sort of automation in this space.
To achieve this automation, we took a two phase strategy. From a high level, the purpose of the first phase is to discover as many previously unknown bugs as possible, with a significant amount of manual intervention. The goal of the second phase is to automatically rediscover the bugs found in the first phase. Further description of each of these phases follows.
In the first phase, we tested applicable systems using all the test input generators over the course of several months. This was done by incrementally testing and reporting any bugs found. When a bug was found, testing was immediately halted, the bug reported upstream. the revision of the solver was recorded, and the type of bug (either correctness or otherwise) was recorded. Once the bug was fixed upstream, we would record which
revision of the solver fixed the bug, and update our local solver to this latest revsion. From here, testing was restarted. This process was repeated until no new bugs were found.
In the second phase, we tested each of the revisions recorded in the first phase, which includes both the revisions where bugs were originally identified and revisions which fixed the identified bugs. Under the assumption that each revision fixes at most one bug, it is then possible to uniquely identify the bugs found. For example, consider an ordered series of revisions R1, R2, ..., Rn. Under input I1, revision R1 indicates a bug. However,
under the same input I1, revision R2 does not indicate a bug. From this, we can deduce
that R2 fixes the bug without any manual intervention. Most importantly, if we discover
another input I2 under which R1 is buggy and R2 is not buggy, then we know that I2 is
merely another input that triggers the same bug fixed in R2. As such, we know that only
one bug has been uniquely discovered in this example, even though we have found two bug-triggering inputs. Moreover, because we recorded what kind of bug was fixed by R2
in phase one, we know whether or not this is a correctness bug.
Crucially, the second phase requires no user intervention. As such, it is easy to run each test case generator under its own independent second phase, using a fixed time budget (in our case, ten hours). From this, we can derive exactly both the number of unique bugs and the number of unique correctness bugs found by each test case generator.
5.5.4
Results
The results of fuzzing using the above evaluation methodology are shown in Table 5.2. A full discussion of these results follows in Section 5.6. A breakdown of the bugs found on a per-solver basis is shown in Table 5.3. Links to filed bug reports have been included for Z3 and CVC4 in Table 5.3, as these have publically-accessible issue trackers.
Generator Name Total Unique Bugs Found
Total Unique Correctness Bugs Found bv_few_syntactic 1 1 bv_few_sat 2 1 bv_few_unsat 0 0 bv_many_syntactic 4 2 bv_many_sat 1 1 bv_many_unsat 0 0 fp_few_syntactic 7 4 fp_few_sat 5 3 fp_many_syntactic 5 3 fp_many_sat 3 3
Table 5.2: The number and types of bugs found in each solver by each generator. Solver Unique
Correctness Bugs
Unique
Total Bugs Bug Report Links Boolector [145] 1 1 N/A MathSAT5 [144] 5 5 N/A CVC4 [143] 1 4 [149, 150, 151, 152] Z3 [47] 5 13 [153, 154, 155, 156, 157][158, 159, 160, 161, 162] [163, 164, 165] Total 12 23 N/A
Table 5.3: The number and types of bugs found on a per-solver basis. Links to filed bug reports are provided for Z3 and CVC4, which are the only solvers tested with publically- accessible issue trackers.