Capitulo IX: Evaluación Económico Financiera
Anexo 02: Segundo Focus: Kallpa Brain
In, Figure 20 we juxtapose the results of F007 and those of other fault localization techniques -- using the Siemens suite. These other techniques include Frequent Pattern Mining (FP) (Di Fatta et al., 2006) and Tarantula (Jones and Harrold, 2005) on function coverage taken from the work of Di Fatta et al. (2006). Tarantula was actually proposed for statement coverage by Jones and Harrold (2005). In Figure 20, Y-axis (named as axis 1) for the FP and the Tarantula is measured in the percentage of versions. A fault was equivalent to one version in the Siemens suite. Each version contained many passing and failing traces. FP and Tarantula actually discovered a faulty function containing a fault, by using passing traces and failing traces pertaining to that fault (or version).
In Figure 20, we also show the performance of F007 on the Siemens suite; however, the
Y-axis (axis 2) for F007 is calibrated in the number of failed traces for all versions (faults) of the Siemens suite. This means that F007 can discover the faulty functions in a single trace using the previous collection of (only) “failed” traces for the same or different faults in the same function. F007, unlike FP and Tarantula, does not require a collection of “passing” traces and “failing” traces related to the same fault in a faulty function to discover that faulty function. However, F007 still requires an initial collection of labeled traces with known faulty functions (i.e., the knowledge of at least one fault for the function) to discover the faulty functions in new traces. (See Section 2.3 where we
describe how an initial set of traces can be built from in-house traces and subsequently can be evolved from field traces, and how F007 can be trained on the evolved set of traces.)
Figure 20: Comparing Frequent Pattern Mining (FP) using function sequences and Tarantula on function coverage against F007.
Thus, F007 is useful for deployed software where a large number of faults are rediscoveries originating from a small percentage of code. It is also useful when it is not feasible to collect many passing traces and failing traces for a fault from the field, or when only the failed traces are gathered for economic reasons. FP and Tarantula are suited primarily for in-house testing where pass-fail traces are readily accessible for a fault, but they are not suitable when only limited failing traces are available from the field. Thus, while F007 is related to FP and Tarantula, it is not directly comparable because F007 is suited for field testing and FP and Tarantula are suitable for in-house testing. Similarly, other techniques mentioned in Section 2.2.1 (e.g., discovering faulty statements using statement coverage (Jones and Harrold, 2005; Wong et al., 2006; Wong
et al., 2007; Zhang et al., 2009); statistical debugging (Chilimbi et al., 2009; Liu and Han, 2006; Liu et al., 2005; Zheng et al., 2004) also have the same major differences with F007 as do FP and Tarantula in Figure 20. A similar comparison of F007 in terms of effort in statements is made against the statement-level techniques, effective fault localization using code coverage (EFL) (Wong et al., 2007) and Tarantula on statement coverage (Jones and Harrold, 2005), in Figure 21. Again, the same differences exist between F007 and EFL and Tarantula, and the results are not directly comparable for the same reasons as mentioned before for Figure 20 (FP and Tarantula on function coverage). The statement effort for F007 would only improve as it was the pessimistic approach (see Section 2.6.2); whereas, the statement effort for EFL and Tarantula, in Figure 21, would not improve further --it is the best case. In Section 2.2 and Table 2, we characterized F007 and the other closely related techniques similar to Tarantula and EFL. There is no direct comparison of F007 against other fault discovery techniques focusing on in-house testing.
Figure 21: Comparing Effective Fault Localization and Tarantula on statement coverage against the statement-effort of F007.
Table 10: Comparison of related techniques focusing on function-call pattern analysis.
Reference Pattern Length
Pattern Type Method Output Di Fatta et al. (2006) 2+ Serial Heuristic Function Dallmeier et al. (2005) 2+ Serial Heuristic Class
Elbaum et al. (2007) 5 Serial Heuristic Pass/fail Yuan et al. (2006) 1 Serial Classifier Config. Cause
F007 1 Serial, Parallel, and Hybrid
Classifier Function
In Section 2.6.1 we showed that only single function-calls (episodes of length 1) are sufficient to discover faulty functions in failed traces. In Table 10, we provide a comparison of our findings with those of the related techniques focusing on the use of patterns in fault discovery. Table 10 shows that: (a) the references of the related techniques focusing on the use of patterns; (b) the length of function-call patterns that other researchers found effective in improving accuracy; (c) empirical method employed by researchers (a machine learning classifier or other comparison heuristics); and (d) the output of techniques. The techniques in Table 10 are explained as follows:
• The technique (FP) to detect faulty functions by Di Fatta et al. (2006) and using object-specific sequences to detect faulty classes by Dallmeier et al. (2005) were primarily focused on testing. They (Di Fatta et al., 2006; Dallmeier et al., 2005) compared patterns of functions-calls extracted from passing traces against the patterns from failing traces to detect faulty functions (Di Fatta et al., 2006) or (Java) classes (Dallmeier et al., 2005). They (Di Fatta et al., 2006; Dallmeier et al., 2005) found that patterns of length greater than two function-calls discover faults with 15% to 20% better accuracy than length 1 functions. Di Fatta et al. (2006) experimented on the Siemens suite and Dallmeier experimented on NanoXML (4334-7646 LOC and 16-23 classes).
• Elbaum et al. (2007) found out that patterns of length up to five are useful in deciding when to start the collection of the traces of field failures. They found (Elbaum et al., 2007) that function-call patterns of length up to five improve accuracy by 10% from length 1 function-calls, but the accuracy does not improve
beyond length five. Elbaum et al. (2007) use heuristics such as identification of exceptional function sequences and exceptional frequency ranges to achieve their task on the Pine program (157,245-186,366 LOC and 1558-1785 functions).
• Yuan et al. (2006) use support vector machines (a classification algorithm) to identify root causes of the configuration problems in a Windows XP based system. Due to the large size of the Windows XP, their traces contained about 100,000 system function-calls. Yuan et al. (2006), like F007, found out that patterns of function-calls of higher length do not yield any better accuracy than single function-calls.
• Finally, we evaluated F007 on small to large commercial programs (see Table 4 and Table 8), and found out that when using the decision tree classifier higher length patterns of function-calls do not improve accuracy. Our findings our similar to what Yuan et al. (2006) found when using another classifier. However, Yuan et al. (2006) (including other researchers in Table 10) only extracted serial patterns (of length equivalent to window width); whereas, we have extracted serial, parallel and hybrid patterns (see Section 2.3.1) of different window widths and length sizes--our experiments cover a wide range of patterns. We have also validated our results by conducting statistical tests on many different programs (see Section 2.6.1.1); other researchers’ works in Table 10 lack on this front. Another novel contribution of this paper is that it identified that the use of only function “entry” or only function “exit” is sufficient to discover fault origin (see Section 2.6.4). This discovery helps in reducing the size and overhead of function-call traces to half; e.g., the large program used in our study, in some cases, has traces of about 4GB (44 million function-calls)—such traces can be reduced to half. Also, in the case of the large program (see Section 2.7.5), we remove those functions which had low variances because they occur in few traces or occur in all the traces. Yuan et al. (2006) also performed similar filtering by setting a threshold to remove function-calls occurring rarely in some traces. Interestingly, in Yuan et al. (2006) and in our case the accuracy remains same after removing such function-calls. Thus, this shows that in case of the large programs,
the sizes of function-call traces can be reduced to more than half—if rarely occurring or function-calls with low variances are discarded along with function “entry” or “exit”. In summary the novel attributes of this paper are: (a) faulty functions in future releases or the same release can be identified by using the traces of at least one fault of the same faulty functions from previous or the same release; (b) different faults in the same function occur with similar function-calls; (c) patterns of function-calls (i.e., serial, parallel, and hybrid) do not improve the accuracy of identification of fault origin—single function-calls are sufficient; (d) only function “entry” or only function “exits” are sufficient to discover the fault origin; and (e) in the large program, the removal of function-calls with similar frequencies do not decrease the accuracy of identification of faulty functions.