• No se han encontrado resultados

Capitulo IX: Evaluación Económico Financiera

Anexo 04: Perfiles de los Puestos

In this section, we describe certain threats to the validity of the research results. We classify threats into four groups: conclusion validity, internal validity, construct validity, and external validity.

2.11.1 Conclusion Validity

Conclusion validity is concerned with our ability to draw the correct conclusion about the relations between treatment and outcome of an experiment (Wohlin et al., 2000).

A threat to conclusion validity exists with traces of the number of faults we used to infer the conclusion. In the large software application, in Table 8, we observed 82% rediscoveries of faults in the database, but we were able to collect failed traces of only some of the faults. Similarly, in terms of the UNIX utilities, the failed traces for some faults were not used because of the criteria of using the faults with less than 20% of failed test cases (Do et al., 2005). The sample of failed traces that we collected did not represent all the faults that occurred in the releases of the software applications. This threat is mitigated by the fact that results in the large software application were similar to the results of the Siemens suite (Do et al., 2005; Hutchins et al., 1994), Space (Do et al., 2005) and UNIX utilities (Do et al.,2005). In fact, the accuracy across releases would be

higher if the failed traces of all the faults were used. This is because the decision tree would have had sufficient knowledge of cross-release faulty functions, and resulted in a better accuracy.

The threat to conclusion validity is low because we have evaluated F007 on twelve medium to large programs with several releases, and a large legacy program with three releases. There is thus sufficient evidence for valid conclusions.

2.11.2 Internal Validity

Internal validity is concerned whether the relationship between treatment and outcome is causal, and not due to any confounding factors (Wohlin et al., 2000).

A threat to internal validity exists in the implementation of the algorithms and this technique, since it involved quite a lot of programming. Human errors (e.g., logical errors) are a possibility in the implementation of the algorithms. Though, it was not possible to manually verifiy the output on all the traces for the MINEPI algorithm, we have mitigated this threat, and made our implementation reliable, by manually investigating the outputs on different example traces. For example, in the case of the MINEPI algorithm (Mannila et al., 1997), we manually verified the outputs on different examples.

2.11.3 Construct Validity

Construct validity refers to the extent to which the experiment settings actually reflect the construct under study (Wohlin et al., 2000).

A threat exists in measuring the programmer’s effort in discovering faulty functions. Recall, from Section 2.3, 2.5, and 2.6.2, that F007 generates a list of faulty functions for a new failed trace, and the programmer’s effort is measured by counting functions (or statements) examined (see Equation 1 and Equation 2). In a ranking based technique, such as F007, it is possible that two or more functions can be listed at the same rank. In such cases, the best case is the first function to be examined is faulty and the worst case is the last function to be examined is faulty. For example, suppose there is one function listed at rank 1, five functions listed at rank 2, and one of the five functions at rank 2 is

faulty. The best case is that the faulty function is the second to be discovered (i.e., one at rank 1 and one at rank 2), whereas the worst case is that the faulty function is the sixth to be discovered. This implies that an incompetent technique will have high best case accuracy (e.g., 90-100% accuracy on examining 1-10% of the program) and low worst case accuracy (e.g., 90-100% accuracy on examining 90-100% of the program), because it will list all the functions as faulty at the same rank.

Figure 22: Best and worst case accuracies using F007 for the UNIX utilities.

In our case, the worst and the best case mostly resulted in the same accuracy: in few cases, there were only minor difference between the worst and the best case. For example, Figure 22 shows the examples of the worst and the best case accuracy differences on the UNIX utilities obtained using F007. Thus, in all our results we have shown the best case accuracies because: (a) there was hardly any difference between the worst and the best case; (b) this avoided cluttering of graphs; and (c) related techniques

demonstrate their best cases, a valid comparison could only be made by comparing their best cases with our best case.

Another threat to construct validity exists in measuring the code reviewed by the programmer to identify faulty functions. This measure of programmer’s effort was dependent on the faulty traces and their correct mapping to faulty functions. In the case of the large software application, as mentioned in Section 2.7, no record of direct mapping between faulty functions and failed traces was kept. We collected the required data by using the help of different developers, tools and scripts. The process was complex and it could have resulted in discrepancies in the mappings of traces to the faulty functions in some cases. This threat was mitigated by the fact that the results on the very large software application were similar to the results of other software studied. Also, this threat was mitigated by using sufficient number of traces for the large software.

A threat to construct validity exists in the use of failed field traces for fault discovery by F007. Consider, automated failure reporting such as in Mozilla, Net Beans, and Visual Studio. This failure reporting facilitates fault localization by providing contextual information, traces, etc. to the developers. It may be possible that such large number of traces may contain passing traces. In such cases, pass-fail classification techniques (Bowring et al., 2004; Haran et al., 2007) or a technique to collect only function-calls related to the fault (Elbaum et al., 2007) (which are complementary to our work) can be used to classify a trace as passing or failing. However, if a trace is captured at the time of a fault (as it was in the case of the large program; see Section 2.7), then F007 will identify faulty functions in that trace. This is because if the trace is captured at the time of a fault then it would encompass the sequences of function-calls contributing to faulty functions; even if it doesn’t contain specific fault conditions (e.g., exception thrown). Thus, F007 can identify faulty functions in such field traces because our results (in Section 2.7) on the large program are not different from other subject programs.

2.11.4 External Validity

External validity refers to the ability to generalize results of an experiment to industrial practice (Wohlin et al., 2000).

If a commercial software application is restructured, then a threat exists when predicting faulty functions across the releases. In the restructuring of software application new functions are added, previous functions are modified, and functions are re-grouped in different namespaces or files. Therefore, the restructuring is just like a new release of software and the same method can be applied as discussed in Section 2.8. For example, in the “Sed” program (of the UNIX utilities) release 3.02 to release 4.0.6 took 5 years, almost every major function changed, new developers work on the project and changes were significant (Do et al.,2005). Our results on the Sed program are shown in Section 2.6.

Documento similar