• No se han encontrado resultados

Capitulo IX: Evaluación Económico Financiera

Anexo 01: Primer Focus: Kallpa Brain

In Figure 16, we show the results of F007 on three releases by using 25% training set and 75% test set. It can be observed that on average the faulty functions in 70% of the failed traces are successfully identified by F007 for each of the releases by reviewing less than 1% of the code (functions). In the rest of the 30% cases some of the faulty functions occurred only once (one trace). So these functions were not identified at all by F007 for the sample of traces we used. However, there were 82% rediscoveries of the faults in the database and the traces were not kept for a long time in the repository of this commercial software due to their large sizes. This is why we have a few faulty functions found only

in one trace. F007 stores traces in its database in the form of common functions (episodes); thereby, reducing the storage overhead required to store traces in the raw form. Thus, actual raw traces can also be preserved for a long time by F007.

Figure 16: F007 on three releases of a large commercial application.

In Figure 17, we show the results for identification of the faulty functions in release 2 by using release 1 as a train-set for F007. By using traces of release 1 we were able to only identify faulty functions in 35% of the failed traces of release 2 on the review of 3% or less of the code. This is because not all of the faulty functions found in release 2 were present in the training set of release 1. However, on using 10% traces of release 2 with the traces of release 1 approximately 80% of the faulty functions were successfully identified. Similarly, in Figure 18, we have used the traces of both release 1 and release 2 to identify the faulty functions in the traces of release 3. Figure 18 shows that the faulty functions in about 60% of the traces were identified by using only the traces of release 1 and release 2.

In our experiments in Section 2.6.2 and in this section (Figure 17), interestingly, we observed that in the first few releases there are fewer common faulty functions than in the subsequent releases (e.g., Figure 18). It could be due to the sample of data that we used for experiments did not contain common faulty functions in the failed traces. It could also mean that as the software gets stable through releases, the number of faulty functions become similar. Nonetheless, if 50-90% of the field failures are rediscoveries of the same fault then by using just 10% of the failed traces of current release we can still identify the majority of the faulty functions. Also, in the case of earlier releases, the accuracy of F007 can be improved by using in-house failed traces because we have observed that: fault in the same function occurs with similar function-call sequences; and there is an overlap among origin of in-house and field faults according to our own study on a very large software system (Gittens et al., 2005).

Figure 17: Identifying the faulty functions across releases.

In the large commercial software application it would be worthwhile to point out faults at a higher level of granularity too, such as components. A large system actually contains so

many components that it makes the component level a useful abstraction for maintainers to locate bugs in functions, files, and statements. This could aid maintainers in correctly diagnosing the fault origin. For example, maintainers can use their experience to decide which combination of faulty functions and faulty components from F007’s predicted list would lead them to the correct fault origin.

Figure 18: Identifying faulty functions across releases.

In Figure 19, we show the accuracy of F007 in identifying the faulty components in the field traces. Here, we used 300 components as the total number of components to measure the code review in terms of components. We first used release 1 to identify the faulty components in release 2. It can be seen in Figure 19 for the series “using release 1 for release 2” that faulty components in approximately 50% of the failed traces were diagnosed correctly by reviewing 8% of the program (i.e., components in this case). Similarly, second series “using release 1 and 10% of release 2 for release 2” shows that faulty components in approximately 90% of the failed traces of releases 2 can be correctly identified on reviewing approximately 8% of the code. This identification of

faulty components was done by using only 10% of the failed traces of release 2 and the traces of release 1 as a training set, and the remaining 90% of the traces of release 2 as a test set. Finally, following the similar approach, Figure 19 also shows that faulty components in 90% of failed traces of release 3 were identified by using only failed traces of release 1 and release 2. This identification is done by just reviewing approximately 8% of the code (components).

In Figure 19, we have not shown the results for release 3 from the combination of traces of release 3, release 2 and release 1. This is because 90% of the components in release 3 were already identified using the traces of release 1 and 2 on the review of approximately 8% or less of the code (components).

Figure 19: Identifying faulty components across releases (a total of 300 components make 100% program in this figure).

In the results of this section, we have only shown the results in terms of number of functions or components reviewed (program). For this commercial software, we could

not get access to the actual source code to count the number of statements of functions or components. Doing so would help in finer-grained evaluation in terms of the number of statements reviewed for each function or component (similar to what we have shown in Section 2.6.2.3. However, as we mentioned in the earlier section (Section 2.6.2.3, maintainers do not review all the statements of every component or function to identify a fault; they use their experience and context to focus only on the few relevant statements. Thus, further finer-grained evaluation would be based on the subjective judgment.

2.8 Executing F007 across Releases: Revisiting Example

Documento similar