Although an adequate test suite is difficult to obtain, preparing a test oracle that can automatically recognize each execution as either passing or failing is even more difficult. In some situations, test case generation can be relatively easy. For example, one can simply feed random strings to a program that consumes string inputs. However, these test cases are hardly useful until we know
f T t f T t p T u
T
p T(a) Scheme with Labelled Cases Only
f T t f T t p T u
T
p T(b) Scheme with Both Labelled and Unlabelled Cases
Figure 3.10: Two Schemes to Work with Partial-labelled Test Suite
0 2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 70 80
T−Score: Percentage of Examined Code (%)
Percentage of Located Faults (%)
Percentage of Located Faults w.r.t. Examined Code
CT Liblit05 SOBER SOBER, 0.1% SOBER_Full (a) m = 1 0 2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 70 80
T−Score: Percentage of Examined Code (%)
Percentage of Located Faults (%)
Percentage of Located Faults w.r.t. Examined Code
CT Liblit05 SOBER SOBER_Full (b) m = 3 0 2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 70 80
T−Score: Percentage of Examined Code (%)
Percentage of Located Faults (%)
Percentage of Located Faults w.r.t. Examined Code
CT Liblit05 SOBER SOBER_Full (c) m = 5 0 2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 70 80
T−Score: Percentage of Examined Code (%)
Percentage of Located Faults (%)
Percentage of Located Faults w.r.t. Examined Code
CT Liblit05 SOBER SOBER_Full
(d) m = 10
Figure 3.11: Quality Comparison w.r.t. the Number (m) of Labelled Failing Cases
what the expected outputs are.
In practice, except for programs that can be described by a program model, the expected outputs are usually prepared by human testers, either manually or assisted by tools. It is usually unrealistic for a tester to examine thousands of executions and label them. Instead, a tester will likely stop testing and return the faulty program to developers for patches when a small number of failing cases are encountered. At that time, the examined cases are labelled and the rest are unlabelled. This describes a typical scenario which exemplifies how partially labelled test suites arise in practice. In this subsection, we examine how well Sober helps developers locate the underlying faults, when the test suite is partially labelled.
Formally, given a suite T , suppose a tester has examined and labelled a subset suite Te(Te⊆ T ).
Because manual labelling is usually expensive, it is common for |Te| ¿ |T |. Let Tp and Tf denote
the set of passing and failing runs identified by the tester, then Te= Tp∪ Tf and Tp∩ Tf = φ. We
only if Tu 6= φ. The set relationship is further depicted in Fig. 3.10(a). The outer ellipse represents
the entire test suite T . The vertical line divides T into the full failing set Tt
f on the left and the
full passing set Tt
p on the right. Certainly, Tf ⊆ Tft and Tp ⊆ Tpt. As one can see from Fig. 3.9(a),
the best localization is achieved by Sober when Tf = Tt
f and Tp = Tpt, i.e., when the given test
suite T is fully labelled.
Now given the partially labelled test suite T , the most straightforward scheme for Sober is to analyze labelled test cases Te only. Because Te is fully labelled, Sober can be immediately applied
with Te. In fact, this scheme is equivalent to running Sober on a β-sampled test suite, where
β ≈ |Te|
|T |. Because commonly |Te| ¿ |T |, β is usually quite small. As a conservative estimation, β
can be around 1%. In considering the Siemens programs, β = 1% means that the tester examines tens, among thousands, of test cases, and identifies about five failing runs on the average. In our opinion, this can be a reasonable workload for the tester.
This scheme, although straightforward, does render reasonable localization results. As shown in Fig. 3.9(a), “SOBER 1% Sampled” is clearly better than Ct. But it is also seen that a considerable gap exists between “SOBER 0.1% Sampled” and “SOBER 100% Sampled”. For concise reference, we use Sober Full to denote “SOBER 100% Sampled” in the following. Although the same quality as Sober Full is not (unrealistically) expected when T is partially labelled, we nevertheless believe that Tu can be utilized for better quality than with Te only.
The above straightforward scheme apparently overlooks the information contained in Tu. Al- though Tu does not provide any labelling information, their runtime statistics, if used properly, can assist Sober in fault analysis. In this study, we restrict our discussion to reasonably developed programs that pass all but a few test cases. One can judge whether this assumption holds by examining the percentage of failing cases in Te. If, however, a program fails most cases in Te, the
fault could be quite easy to find. For reasonably developed programs, we can choose to label all the unexamined test cases Tu as passing, and apply Sober to the regarded failing and passing set
Tf0 and Tp0, where Tf0 = Tf and Tp0 = Tp∪ Tu. The difference between the two schemes is visualized
in Fig. 3.10(b).
Let Tm represent the set of unexamined failing cases, i.e., Tm= Tt
f−Tf, then all the cases in Tm
impurity into Tp0, the effect it has on Sober is minimal: the θpcalculated with Tp0deviates negligibly
from that with Tt
p because Tp0 = Tp∪ Tu= Tpt∪ Tm and |Tm| ≤ |Tf| ¿ |Tpt|.
On the other hand, by mislabelling Tm, we utilize the runtime statistics of the cases in Tpt− Tp,
which are otherwise disregarded. In this way, θp can be estimated more accurately with Tp0 than
with Tp only. This could, in consequence, bring better localization quality. Therefore, this is
essentially a tradeoff between grabbing more passing runs and (unavoidably) mislabelling some failing runs. In our belief, the gain from including more passing executions should surpass the loss from mislabelling. As will be shown shortly, this scheme achieves much better results than the straightforward scheme, and sometimes it even obtains comparable results to Sober Full.
We simulate partially labelled test suites using the Siemens programs. For each faulty version, we randomly select m failing cases as Tf (i.e., the set of failing cases identified by the tester). According to the above scheme, all the rest cases are regarded as passing, i.e., T0
p. We then run
both Sober and Liblit05 with the same T0
p and Tf0 (recall that Tf0 = Tf) for each of the 130
faulty versions. We experiment with m equal to 1, 3, 5 and 10 respectively, and this represents the increasing effort that the tester puts into test evaluation. If a faulty version has less than m failing cases, we take all the available failing cases. In the Siemens suite, there are 0, 4, 14, and 19 versions that have less than 1, 3, 5, and 10 failing cases. These versions were not excluded because they do represent real situations.
Fig. 3.11 plots the localization quality for both Sober and Liblit05 with m equal to 1, 3, 5 and 10 respectively. Curves for Ct and Sober Full are also plotted as the baseline and ceiling quality in each subfigure. Among the four subfigures, Fig. 3.11(a) represents the toughest situation, where only one failing case is identified in each faulty version. This simulates a typical scenario where a developer starts debugging once a faulty execution is encountered. As expected, the quality of Sober degrades considerably from Sober Full, but it is still better than Ct.
For Fig. 3.9(a), we note that the m = 1 situation is at least as harsh as the situation with 0.1%-sampled test suites. Nevertheless, at least one failing run is in every 0.1%-sampled test suite. In order to demonstrate the effect of treating Tu as passing, we re-plot the curve of Sober with
β = 0.1% in Fig. 3.11(a) with a dashed line. The remarkable gap between “SOBER” and “SOBER, 0.1%” suggests the benefit of treating unlabelled cases as passing.
The four subfigures of Fig. 3.11, viewed in a sequence, show that the quality of Sober gradually improves as additional failing cases are explicitly labelled. Intuitively, the more failing cases that are identified, the more accurately the statistic Y (Eq. 3.4) approaches to the true faulty behavior of predicate P , and hence the higher quality of the final predicate ranking list. Liblit05 also improves because of a similar reason.