In the following, we survey the test case generation methods for DNNs that have not been covered in Section 5.1 and that do not employ the existing adversarial attack algorithms.
5.2.1 Input Mutation
Given a set of inputs, input mutation generates new inputs (as test cases) by changing the existing input according to some predefined transformation rules or algorithms. For example, [Wicker et al., 2018] systematically mutates input dimensions with the goal of enumerating all hyper-rectangles in the input space. Moreover, aiming at testing the fairness (i.e., free of unintended bias) of DNNs, AEQUITAS [Udeshi et al., 2018] essentially employs an input mutation technique to first randomly sample a set of inputs and then explore the neighbourhood of the sampled inputs by changing a subset of input dimensions, however, it has not been applied to DNN model.
5.2.2 Fuzzing
Fuzzing, or fuzz testing, is an automated software testing technique that efficiently generates a massive amount of random input data (possibly invalid or unexpected) to a program, which is then monitored for exceptions and failures. A fuzzer can be mutation-based that modifies existing input data. Depending on the level of awareness of the program structure, the fuzzer can be white/grey/block-box. There are recent works that adopt fuzz testing to deep neural networks.
TensorFuzz [Odena and Goodfellow, 2018] is a coverage-guided fuzzing method for DNNs. It randomly mutates the inputs, guided by a coverage metric over the goal of satisfying user-specified constraints. The coverage is measured by a fast approximate nearest neighbour algorithm. TensorFuzz is validated in finding numerical errors, generating disagreements between DNNs and their quantized versions, and surfacing undesirable behaviour in DNNs. Similar to TensorFuzz, DeepHunter [Xie et al., 2018] is another coverage-guided grey-box DNN fuzzer, which utilises these extensions of neuron coverage from [Ma et al., 2018a]. More- over, DLFuzz [Guo et al., 2018] is a differential fuzzing testing framework. It mutates the input to maximise the neuron coverage and the prediction difference between the original input and the mutated input.
5.2.3 Symbolic Execution and Testing
Though input mutation and fuzzing are good at generating a large amount of random data, there is no guarantee that certain test objectives will be satisfied. Symbolic execution (also symbolic evaluation) is a means of analysing a program to determine what inputs cause each part of a program to execute. It assumes symbolic values for inputs rather than obtaining actual inputs as normal execution of the program would, and thus arrives at expressions in terms of those symbols for expressions and variables in the program, and constraints in terms of those symbols for the possible outcomes of each conditional branch.
Concolic testing is a hybrid software testing technique that alternates between concrete execution, i.e., testing on particular inputs, and symbolic execution. This idea still holds for deep neural networks. In DeepConcolic [Sun et al., 2018c, Sun et al., 2018d], coverage criteria for DNNs that have been studied in the literature are first formulated using the Quantified Linear Arithmetic over Rationals, and then a coherent method for performing concolic testing to increase test coverage is provided. The concolic procedure starts from executing the DNN using concrete inputs. Then, for those test objectives that have not been satisfied, they are ranked according to some heuristic. Consequently, a top ranked pair of test objective and the corresponding concrete input are selected and symbolic analysis is thus applied to find a new input test. The experimental results show the effectiveness of the concolic testing approach in both achieving high coverage and finding adversarial examples.
The idea in [Gopinath et al., 2018] is to translate a DNN into an imperative program, thereby enabling program analysis to assist with DNN validation. It introduces novel techniques for lightweight symbolic analysis of DNNs and
applies them in the context of image classification to address two challenging problems, i.e., identification of important pixels (for attribution and adversarial generation), and creation of 1-pixel and 2-pixel attacks. In [Agarwal et al., 2018], black-box style local explanations are first called to build a decision tree, to which the symbolic execution is then applied to detect individual discrimination in a DNN: such a discrimination exists when two inputs, differing only in the values of some specified attributes (e.g., gender/race), get different decisions from the neural network.
5.2.4 Testing using Generative Adversarial Networks
Generative adversarial networks (GANs) are a class of AI algorithms used in unsupervised machine learning. It is implemented by a system of two neural networks contesting with each other in a zero-sum game framework. Deep- Road [Zhang et al., 2018] automatically generate large amounts of accurate driving scenes to test the consistency of DNN-based autonomous driving systems across different scenes. In particular, it synthesises driving scenes with various weather conditions (including those with rather extreme conditions) by applying the Generative Adversarial Networks (GANs) along with the corresponding real-world weather scenes.
5.2.5 Differential Analysis
We have already seen differential analysis techniques in [Pei et al., 2017a] and [Guo et al., 2018] that analyse the differences between multiple DNNs to max- imise the neuron coverage. Differential analysis of a single DNN’s internal states has been also applied to debug the neural network model by [Ma et al., 2018d], in which a DNN is said to be buggy when its test accuracy for a specific output label is lower than the ideal accuracy. Given a buggy output label, the differential analysis in [Ma et al., 2018d] builds two heat maps corresponding to its correct and wrong classifications. Intuitively, a heat map is an image whose size equals to the number of neurons and the pixel value represents the importance of a neuron (for the output). Subsequently, the difference between these two maps can be used to highlight these faulty neurons that are responsible for the output bug. Then, new inputs are generated (e.g., using GAN) to re-train the DNN so to reduce the influence of the detected faulty neurons and the buggy output.