• No se han encontrado resultados

Estrategia de difusión y empoderamiento del Plan Institucional

6. Ejecución y Seguimiento del Plan Estratégico Institucional

6.1. Estrategia de difusión y empoderamiento del Plan Institucional

One of the crosscutting concerns throughout this dissertation is the praise for the ben- efits that the experimental validation of claims made in the context of Software Engi- neering proposals. To provide a preliminary validation of the proposed process model, we next report a case study conducted at Universidade Nova de Lisboa (UNL) with graduate students pursuing a MSc degree in Informatics.

Problem statement

The process model described in this chapter was designed to conform with the state of the art practice in Experimental Software Engineering, while being accessible to

practitioners engaged in experimentation. Although we have been using this process in our work, as will be illustrated throughout the whole dissertation, we would like to assess how hard it is for new experimenters to follow this process.

Among several other desirable quality attributes in a process model such as the one presented in this chapter (e.g. effectiveness, predictability) we would like it to have a balanced learnability and understandability.

In this case study, we analyze the results on a series of experiments conducted by graduate students, following the process model presented in this chapter, in order to identify the sub-processes within our model which constitute a harder challenge for practitioners. This information can be used to improve the presented process model, in the future.

Research objectives

In this case study, our goal (G1) is to:

analyzesoftware engineering experiments,

for the purpose oftheir evaluation,

with respect tothe quality of their outcome,

from the viewpoint ofa course instructor,

in the context of experimental work carried out during a course for graduate students on the quality of software products and processes.

Context

This case study was conducted in the context of the Product and Process Quality course10, carried out in the Spring semester of 2008 in the context of the MSc program in Infor- matics, at UNL. This is a 6 ECTS11 course where, students were asked to conduct an ESE project, following the process presented in this chapter, as part of their evaluation during the semester.

3.4.2

Related work

To the best of our knowledge, this was the first application of the process presented in this chapter in experimental work which was performed by subjects who were not members of the proponent’s research group. Other examples of the application of the proposed experimental process can be found throughout this dissertation. A more thorough discussion on related work concerning the model under scrutiny can be found in section 3.5.

10The details concerning this course, including course materials are publicly available at http://

moodle.fct.unl.pt/course/view.php?id=1713

3.4.3

Experimental planning

In this case study, we will try to achieve goal G1, described earlier in this section, by following the plan presented in this sub-section.

Experimental units, material and tasks

As stated in the previous section, this case study was carried out in the context of the 2008 Product and Process Quality course at UNL. The participants were graduate stu- dents who chose to take this elective course. Typically, these students are in the second semester of Bologna’s second cycle, that is, a year away from finishing their MSc de- gree. The students were grouped in teams of two members. The students had access to the course materials, publicly available on the course’s web site, which include not only an earlier description [Goulão 07a] of the process presented here, but also the course’s slides, as well as references to several relevant publications about Experimen- tal Software Engineering. Furthermore, the students had access to the International

Software Benchmarking Standards Group (ISBSG)12 repository. A description of this repository can be found in [ISBSG 07a, ISBSG 07b]. Each group was asked to perform a particular observational study using data stored in ISBSG’s repository. This task was performed as part of the evaluation process of the course, off-line.

Hypotheses

The available time schedule and resources conditioned the extent to which this form of validation could be performed. In the best interest of the students taking the course, it was not feasible, for instance, to create a control group that would, for instance, per- form similar experiments in an ad-hoc way, so that we could compare the outcomes and use them to assess the benefits of following this process, as opposed to not follow- ing it. This constrains the kind of hypotheses we can validate here, as all students fol- lowed the same process. We can only discuss qualitatively whether or not the process helped students to successfully completing their tasks (and will do so, in the discussion of this case study).

We can, however, test whether or not the process description made available to the students [Goulão 07a], along with the course training and course materials, lead to a well balanced outcome of the experimental processes. In other words, were students able to follow the process with a consistent degree of success in each of the sub-tasks, or were there sub-tasks that would clearly benefit from improvements in the documen- tation and training made available to students?

More formally, we can express this concern as hypothesis H1, which will be sub-divided into the null hypothesis (H10) and its alternative (H11):

H10: The process was followed with a relatively uniform success.

H11: The process was followed with significantly (and consistently) different levels of success in different tasks.

Independent variables

The independent variable is nominal and represents group membership. It corre- sponds to the group’s id (GroupID).

Dependent variables

The dependent variables of these study are the detailed classifications of each group. The specific weight given by the course tutor to each of this partial classifications is not relevant for our analysis. Therefore, we will represent the grades as a percentage of the achieved success. The overall grade of the group in this project is a weighted sum of these partial grades, but its value is not relevant for the hypothesis being tested. The considered dependent variables are represented in the following list by a (code), followed by a short description. All their values are represented as a percentage:

• (W1.1) Problem statement • (W1.2) Context definition • (W1.3) Objectives definition • (W2.1) Context parameters • (W2.2) Hypothesis formulation • (W2.3) Variables selection • (W2.4) Subjects selection • (W2.5) Experiment design • (W2.6) Collection process • (W2.7) Analysis techniques • (W2.8) Instrumentation • (W3.1) Collection clearance • (W3.2) Motivation of participants • (W3.3) Data collection • (W3.4) Data validation

• (W3.5) Problem reporting • (W4.1) Data description • (W4.2) Data set reduction • (W4.3) Hypothesis testing • (W5.1) Results interpretation

• (W5.2) Validity threats identification • (W5.3) Inference (generalization) • (W5.4) Learned lessons

Design

This case study can be described as a within groups, post-test only design. In Trochim’s notation [Trochim 06], this can be described as follows:

X O11 X O12 ... X O53 X O54

In other words, each subject in our group receives exactly the same treatment, and its performance is then observed with each of the dependent variables (denoted as Oij; for instance, O41 stands for Data description). The rationale is to look for significant differences among the observations (which can be considered simultaneous) that are consistently observed in our subjects.

Procedure

The groups carry out their project in two phases. First, they file in an early version of their project report, after 4 weeks. This report is used for an early control with respect to who is really following the course and acts as a milestone that students have to overcome, in order to successfully complete the course. That said, the deliverable presented at this point is not addressed in this observation. Then, 6 weeks after the early version, they deliver their final project report. Only the latter is evaluated, by granting grades corresponding to each of our dependent variables.

Analysis procedure

The data analysis presented here follows the following steps:

• Descriptive statistics: the mean, standard deviation, minimum and maximum values of all the variables are presented and discussed.

• Data set reduction: if necessary, outliers and extreme values are removed from the analysis.

• Normality tests: these tests are crucial for deciding which are the adequate statis- tics for our hypothesis, given the characteristics of the distribution in our sample. • Hypothesis test: Depending on the sample’s distribution, a parametric (for nor- mal distribution) or a non-parametric (for other distributions) test is performed to check for statistically significant differences among our observations.

3.4.4

Execution

Sample

14 out of the 17 groups that signed up for the experiment finished the task. Therefore, the mortality of subjects (considering the groups as subjects) is of 17,6%.

Preparation

Before and during the conduction of their experimental work, the participants received training on the several tasks they were to perform in their project.

Data collection performed

The students performed their experiments as part of their normal work within the course. This project accounted for 30% of their final grade, so the incentive to per- form well in it was considerable. The validation effort of our proposal did not interfere directly in the outcome of their projects, as this case study’s data is based on the eval- uation of their reports. From the participant’s point of view, this was a normal project in a course. The classification of the experiment reports was carried out by the course instructor13. The data analysis that follows was based on the detailed classification report we had access to.

13The course instructor was Prof. Fernando Brito e Abreu, the supervisor of this dissertation’s propo-

3.4.5

Analysis

Descriptive statistics

Table 3.1 presents the descriptive statistics for the collected variables.

Mean Std. Deviation Minimum Maximum W1.1 ,6250 ,16261 ,50 1,00 W1.2 ,6964 ,20045 ,25 1,00 W1.3 ,7500 ,24019 ,50 1,00 W2.1 ,7500 ,21926 ,50 1,00 W2.2 ,7857 ,21611 ,50 1,00 W2.3 ,6786 ,20636 ,25 1,00 W2.4 ,6250 ,27298 ,25 1,00 W2.5 ,6786 ,28468 ,00 1,00 W2.6 ,4821 ,22922 ,25 1,00 W2.7 ,6250 ,32150 ,00 1,00 W2.8 ,6250 ,25476 ,00 1,00 W3.1 ,7679 ,26790 ,00 1,00 W3.2 ,6250 ,33613 ,00 1,00 W3.3 ,5179 ,26790 ,00 ,75 W3.4 ,3750 ,25476 ,00 ,75 W3.5 ,3036 ,29708 ,00 1,00 W4.1 ,6964 ,24374 ,25 1,00 W4.2 ,6071 ,21291 ,25 1,00 W4.3 ,6964 ,24374 ,25 1,00 W5.1 ,6607 ,23220 ,25 1,00 W5.2 ,5000 ,24019 ,00 ,75 W5.3 ,6071 ,30562 ,00 1,00 W5.4 ,6250 ,25476 ,00 1,00

Table 3.1: Descriptive statistics

Table 3.2 presents the normality tests for our dependent variables. The null hy- pothesis for the normality tests (the Kolmogorov-Smirnov and the Shapiro-Wilk tests) is that there is no statistically significant difference between the observed accumulated distribution and the one of the theoretical distribution being tested (the normal one). Several of the variables have a non-normal distribution according to at least one of the tests. Considering a confidence interval of 95% in both tests, the normality hypothe- sis should be rejected if the significance of the test is less than 0,05. In other words, if the variable’s normality test has a significance level (p-value) greater than 0,05, we can assume the variables’ distribution to be normal, with a confidence level of 95%. The non-normal variables (according to at least one of the normality tests) are highlighted in bold, in table 3.2, as is the test significance that points to the data’s non-normality.

Data set reduction

Kolmogorov-Smirnov(a) Shapiro-Wilk

Statistic df Sig. Statistic df Sig.

W1.1 ,350 14 ,000 ,731 14 ,001 W1.2 ,320 14 ,000 ,850 14 ,022 W1.3 ,280 14 ,004 ,730 14 ,001 W2.1 ,230 14 ,043 ,792 14 ,004 W2.2 ,268 14 ,007 ,786 14 ,003 W2.3 ,421 14 ,000 ,697 14 ,000 W2.4 ,176 14 ,200(*) ,888 14 ,075 W2.5 ,313 14 ,001 ,842 14 ,017 W2.6 ,255 14 ,014 ,843 14 ,018 W2.7 ,164 14 ,200(*) ,906 14 ,140 W2.8 ,331 14 ,000 ,814 14 ,007 W3.1 ,331 14 ,000 ,736 14 ,001 W3.2 ,225 14 ,053 ,867 14 ,038 W3.3 ,259 14 ,012 ,792 14 ,004 W3.4 ,260 14 ,011 ,876 14 ,052 W3.5 ,204 14 ,119 ,844 14 ,019 W4.1 ,218 14 ,069 ,875 14 ,049 W4.2 ,249 14 ,019 ,883 14 ,065 W4.3 ,218 14 ,069 ,875 14 ,049 W5.1 ,256 14 ,014 ,874 14 ,049 W5.2 ,214 14 ,081 ,861 14 ,032 W5.3 ,180 14 ,200(*) ,923 14 ,241 W5.4 ,331 14 ,000 ,814 14 ,007

Table 3.2: Normality tests for the dependent variables. The values marked with (*) are lower bounds for the true significance of the Kolmogorov-Smirnov test. (a) stands for Lilliefors significance correction. We cannot assume a normal distribution of the variables in bold. The significance of tests is highlighted in bold for tests with p < 0, 05 and italic bold for tests with p < 0, 01.

Hypotheses testing

As we have seen, the data does not have a normal distribution. As such, we have to use non-parametric tests. The non-parametric tests that we will perform to validate hypothesis H1 rely on the ranks of the values in the sample, rather than on the values themselves. The rationale for using ranks is to avoid the assumption of normality in analysis of variance. All the grades (ranging from W1.1 to W5.4) are put into a large sample, and ranked, for each group, from the lowest to the highest value. Table 3.3 presents the mean rank, for each of the variables, considering all groups.

Sub-process W1.1 W1.2 W1.3 Mean Rank 11,6 13,6 15,2 Sub-process W2.1 W2.2 W2.3 W2.4 W2.5 W2.6 W2.7 W2.8 Mean Rank 15,4 15,8 13,4 12,0 13,8 8,0 12,3 12,4 Sub-process W3.1 W3.2 W3.3 W3.4 W3.5 Mean Rank 16,2 12,3 9,9 6,0 4,6 Sub-process W4.1 W4.2 W4.3 Mean Rank 13,5 11,4 13,1 Sub-process W5.1 W5.2 W5.3 W5.4 Mean Rank 12,5 8,9 12,2 12,0

The problem, then, is to find out whether or not any of these mean ranks differs significantly from the remaining ones. We will use two non-parametric tests to do so.

Table 3.4 presents the results of the Friedman test, a non-parametric test designed to detect differences in treatments across multiple tests attempts [Friedman 37]. This test is commonly used as a non-parametric alternative to the Analysis of Variance test. Recall that our null hypotheses states that “the process was followed with a relatively uniform success”. There is a significant difference among the results of the treatments, with a chi-square of 63,980,(22, N=14), and p=,000<,01. Therefore, we can reject the null hypothesis14. In other words, at least one of the sub-processes lead to an outcome significantly different from the remaining ones.

N 14

Chi-Square 63,980

df 22

Asymp. Sig. ,000

Table 3.4: Friedman test for hypothesis H1.

We can further explore this by using Kendall’s W test, which is a normalization of Friedman’s test and is used to assess the level of concordance between raters. A strong agreement is signaled by a Kendall statistic value close to one, while a strong disagreement presents a value close to 0. The 0,208 value in table 3.5 indicates a low but significant agreement level with a chi-square of 63,980,(22, N=14), and p=,000<,01.

N 14

Kendall’s W(a) ,208 Chi-Square 63,980

df 22

Asymp. Sig. ,000

Table 3.5: Kendall’s W test for hypothesis H1. (a) stands for Kendall’s Coefficient of Concordance.

The most likely candidates for the existing agreement and, likewise, for the signif- icant differences found while evaluating the reports, can be identified using a boxplot representation of the distribution of the average grades for each of the sub-processes (left side of figure 3.20). Sub-processes W3.5 and W3.4 have an extreme and an out- lier mean classification. If we remove these sub-processes, and remake Friedman and Kendall’s tests, the statistics are still significant (p = 0,034 < 0,05) (left side of table 3.6). The new boxplot reveals that, in the absence of W3.5 and W3.4, three sub-processes emerge as outliers (W2.6, W5.2, and W3.3). If we remove W2.6 from the sample (the one which is further away from the mean value), both Friedman’s and Kendall’s tests

14The “traditional” way of interpreting a chi-square test is to use the chi-square table. If the calculated

chi-square value is greater than the critical value in the table, for a given significance and number of degrees of freedom, we can reject the null hypothesis. However, modern statistics tools, such as SPSS, compute the significance level directly, to save users the burden of consulting those tables. The asymp- totic significance presented by the used statistics tool has a value lower than 0,01 (in fact, lower than 0,0005), we can reject the null hypothesis.

no longer report statistically significant differences between the different assessments of the sub-processes under scrutiny (right side of table 3.6).

Figure 3.20: The boxplot on the left presents the distribution of the classifications, in- cluding all the sub-processes. Sub-process W3.5 is marked as an extreme. Sub-process W3.4 is marked as an outlier. The boxplot on the right side presents the distribution of the classifications, if we exclude the extreme W3.5 and outlier W3.4. Note that, in the absence of these two sub-processes, three other sub-processes (W2.6, W5.2, and W3.3) are now considered outliers in the remaining sample.

3.4.6

Interpretation

In an ideal process, practitioners should be able to carry out all the sub-processes with a consistently high proficiency. The Friedman test, complemented by Kendall’s coeffi- cient of concordance lead us to think that there are some parts of the process that were handled with a significantly different success by the participants in this case study, when compared to the others. The fairly low concordance coefficient also points to the fact that students roughly achieved the same success level in the majority of the sub- processes. In an ideal process, Kendall’s coefficient of concordance should be close to 0. This would mean that we were not able to significantly rank the success of the different sub-processes. Of course, the evaluations should also be high, as there is not much point in having practitioners performing consistently bad in all sub-processes.

While for most of the process the results are quite encouraging and balanced, con- sidering the lack of experience of the subjects in conducting experimental work, we should check what happened with the sub-processes where the success was signifi- cantly different (and, in this case, lower).

Sample, without W3.5 and W3.4 Sample without W3.5, W3.4 and W2.6 Friedman Test Statistics Friedman Test Statistics

N 14 N 14

Chi-Square 32,972 Chi-Square 24,288

df 20 df 19

Asymp. Sig. ,034 Asymp. Sig. ,185

Kendall’s Coefficient of Concordance Kendall’s Coefficient of Concordance

N 14 N 14

Kendall’s W(a) ,118 Kendall’s W(a) ,091 Chi-Square 32,972 Chi-Square 24,288

df 20 df 19

Asymp. Sig. ,034 Asymp. Sig. ,185

Table 3.6: Friedman test statistics and Kendall’s Coefficient of Concordance, when re- moving the extreme and outliers that caused the statistically significant differences in the classification of the sub-processes.

By using the extremes and outliers detection, we were able to single out the pro- cesses responsible for the concordance that does exist. The three identified sub- processes are the ones with the lowest mean classifications. W3.5, W3.4, and W2.6, correspond, respectively, to problem reporting, data validation, and collection process. Why did our subjects perform poorly in these tasks?

There are at least two plausible explanations for this. The first one is that they may have found these sub-processes’ descriptions less clear. But it may also be the case that the experimental tasks they were performing had a role to play in these difficulties. Unlike what usually happens in experimental work, the data used in these projects was collected a priori in the ISBSG repository. The three sub-processes relate to data collection, to some extent, something that our subjects did not perform, in practice.

Problem reporting focuses on deviations from the experiment execution plan. In this case, the data to be used was readily available from a repository, and the partici- pants showed difficulties in critically assessing, based on the existing information, the problems that may have occurred.

Likewise, the data validation was also one of the weakest sub-processes, again due to the challenges detecting, from the information available in the repository, poten- tial problems concerning data validity. We believe these difficulties may have been increased by the fact that our subjects were novice experimenters.

Finally, providing a detailed description of the data collection process was also chal- lenging for our subjects. Again this may result from difficulties in extracting the rel- evant information from the repository, on the one hand, and in acknowledging those difficulties, on the other. This is also a typical problem with novice experimenters that we have observed in other fora, namely while serving as reviewers for program com- mittees in conferences and workshops.

Evaluation of results and implications

The significant differences found in the sample point us to the parts of the process in which the case study participants performed significantly worse. This information can be used for guiding improvements in the process model, as well as in future editions of the course. With respect to the process model, these improvements can be achieved through a clarification of the description of the process. To a certain extent, we have already done so while writing this chapter of the dissertation. The degree of detail provided here is greater than the one used in [Goulão 07a], not only due to less con- strained size, which lead to the inclusion of more details in any of these topics, but also as a result of the feedback collected since the publication of the process, both from our peers and the students participating in this case study.