I I SE EXPIDE CONSTANCIA DE ALINEAMIENTO CON LAS SIGUIENTES CARACTERÍSTICAS
3/3CONSTANCIA DE ALINEAMIENTO Y/O NÚMERO OFICIAL
We now present the results and answers to our two research questions.
RQ1: How does the accuracy of the traceability links recovered by Trustrace compare with that of approaches based on JSM and VSM alone?
Figure 6.3 shows the precision and recall graphs of JSM, Trustrace, and VSM. Trustrace provides better precision and recall values than the two IR techniques by themselves. Table 6.1 shows the average precision and recall values calculated by comparing the differences between the JSM, Trustrace, and VSM approaches. Trustrace with DynWing has a better precision and recall than the other weighting techniques. The recall value for Pooka improves on average but without statistical significance when compared to VSM results.
In the case of SIP with only the Histracebugs expert, recall values decrease with statis-
tical significance when compared to VSM values, as discussed in Section 6.4.1. There is no statistically-significant decrease in recall when compared to JSM results. We explored the reason for the recall values to decrease in the case of SIP with Histracebugs. We found that
only 4% of SIP SVN commits are linked to bug reports. Therefore, we did not find much evidence for many links and this lack of evidence yielded many links from the baseline set to be removed, following our constraints in Equation 5.1, and, consequently, a lower recall.
We performed the statistical tests described in Section 6.2.6 to verify whether or not the average improvements in precision and recall are statistical significant. We have statistically- significant evidence to reject H01and H02. Table 6.1 shows that the p-values for the precision
values are below the standard significant value, α = 0.05. The reported figures show that, for most values of precision and recall, we can reject H03 and H04 in all but three cases (in
bold in Table 6.1). Thus, we cannot claim to always reject H03 and H04: in two cases for
VSM and one for JSM, recall values do not improve.
We use Cliff’s d as introduced in Section 6.2.6 to measure the effect of Trustrace over single IR techniques. Table 6.1 shows that Trustrace has a large effect on the improvements in precision and recall values in 66%, medium in 19%, small in 9% of the improvements. Only in the case of SIP, i.e., 6%, with only Histracebugs, Trustrace recall values decreased
with one large and one small effect size. Overall, the obtained effect size values indicate a practical improvement with Trustrace.
jEdit Pooka Rhino SIP V SM 0 20 40 60 80 100 0 20 40 60 80 100 VSM Trustrace 0 20 40 60 80 100 0 20 40 60 80 100 VSM Trustrace 0 20 40 60 80 100 0 20 40 60 80 100 VSM Trustrace 0 20 40 60 80 100 0 20 40 60 80 100 VSM Trustrace J SM 0 20 40 60 80 100 0 20 40 60 80 100 JSM Trustrace 0 20 40 60 80 100 0 20 40 60 80 100 JSM Trustrace 0 20 40 60 80 100 0 20 40 60 80 100 JSM Trustrace 0 20 40 60 80 100 0 20 40 60 80 100 JSM Trustrace
Figure 6.3 Precision and recall values of JSM, Trustrace, and VSM, with the threshold t varying from 0.01 to 1 by step of 0.01. The X axis shows recall and Y axis shows precision.
Table 6.1 Precision and recall values for jEdit, Pooka, Rhino, and SIP, Mann-whitney test results, and Cliff’s d results
Precision Recall
VSM T rustraceV SM p-value Effect Size VSM T rustraceV SM p-value Effect Size
jEdit 59.55 69.14 <0.01 0.59 5.81 8.31 <0.01 0.67
Pooka 42.28 52.46 <0.01 0.37 11.14 12.52 0.99 0.24
Rhino (Histracecommits only) 71.79 92.76 <0.01 0.92 6.82 9.52 <0.01 0.92
Rhino (Histracebugs only) 71.79 93.73 <0.01 0.92 6.82 9.20 <0.01 0.92
Rhino 71.79 94.49 <0.01 0.84 6.82 12.33 <0.01 0.96
SIP, (Histracecommits only) 15.84 25.97 <0.01 0.61 15.61 15.79 <0.04 0.55
SIP, (Histracebugs only) 15.84 42.97 <0.01 0.37 15.61 11.07 <0.01 0.63
SIP 15.84 24.28 <0.01 0.34 15.61 21.60 <0.01 0.48
JSM T rustraceJ SM p-value Effect Size JSM T rustraceJ SM p-value Effect Size
jEdit 52.82 71.12 <0.01 0.82 13.62 15.91 <0.01 0.88
Pooka 33.11 45.48 <0.01 0.47 13.33 16.45 <0.01 0.19
Rhino (Histracecommits only) 77.37 87.16 <0.01 0.15 15.56 16.18 <0.01 0.77
Rhino (Histracebugs only) 77.37 90.88 <0.01 0.47 15.56 17.33 <0.01 0.59
Rhino 77.37 91.79 <0.01 0.47 15.56 18.26 <0.01 0.59
SIP (Histracecommits only) 15.83 21.29 <0.01 0.46 19.34 21.44 <0.01 0.60
SIP (Histracebugs only) 15.83 37.94 <0.01 0.33 19.34 14.24 <0.56 0.25
We answer RQ1: How does the accuracy of the traceability links recovered by Trustrace compare with that of approaches based on JSM and VSM alone? as follow: Trustrace helps to recover more correct links than IR techniques alone. When two experts are available, Trustrace is always better. Only in one case, and with just a single expert, due to a lack of external source of information, recall went down.
RQ2: How does the accuracy of the traceability links recovered using DynWing compare to that using PCA?
When comparing DynWing with a PCA-based weighting technique, Figure 6.4 shows that DynWing provides better results than PCA. PCA tends to provide higher precision than recall in some cases whereas DynWing tends to find a better balance between precision and recall in all cases.
We performed a Mann-whitney test to analyse if DynWing statistically provides better F1 scores or not. Table 6.2 shows that p-values are below the standard significant value
α = 0.05 for T rustraceV SM. Thus, we reject H05. In the case of JSM, for SIP, using two
experts does not yield any difference between DynWing and PCA-based weighting. Thus, we cannot reject H06 because in one case DynWing and PCA-based weighting provide the
same results.
Thus, we answer RQ2: How does the accuracy of the traceability links recovered using DynWing compare to that using PCA? as follow: DynWing provides better weights for different experts than a PCA-based weighting technique. However, it is possible that in some cases PCA-based weighting provides the same (but not better) results as DynWing.