• No se han encontrado resultados

CAPITULO II: MARCO GEOLOGICO

2.5. Aspectos estructurales

The greatest surprise to me in this mapping study process has been the incredi- ble amount of work it took. My initial estimate was on the order of three or four

months; it took three and a half years. The literature searches themselves, produc- ing a total of 2056 recorded publications, not counting duplicates, took almost a year of calendar time (precisely 294 days), though a lot of that is accounted by my teaching duties interfering with this work, and some is accounted by vacations. All in all, on average I seem to have recorded about 7 publications each day (in- cluding teaching days and vacations); I am likely to have processed a lot more. The process of going through all the 2056 recorded publications to a final selec- tion decision took almost two years (629 days), meaning an average of 3 decisions every day, including teaching and vacation days. These speeds likely reflect the difficulty of deciding where the line between inclusion and exclusion really lies, based on my definitions of the concepts. A more focused study is likely to be able to attain much higher speeds.

One particular source of trouble was the low general usefulness of abstracts in programming language research. They rarely described what empirical meth- ods, if any, were used to evaluate their work, nor did they usually reveal the results of any such evaluation. As a result, Phase II (based exclusively on on- line metadata such as abstracts) excluded only about 30 % of the publications. In software enginering, similar problems have been noticed as well, and the use of structured abstracts(that is, abstracts with standard explicit subheadings) has been proposed and evaluated with some success (see e. g. Kitchenham, Brereton, Owen, et al. 2008; Budgen, Kitchenham, et al. 2008). I have adopted this practice in the abstract of this study.

I would caution any other research student not to attempt a systematic sec- ondary study alone. An ideal team size is, in my estimate, about six: as rec- ommended by guidelines, each publication should be looked at by at least two researchers independently in each phase of the study, to allow for the estimation of the reliability of decisions; having three teams of two researchers allows signif- icant parallelization of the work. A workable minimum is, I think, three, working together in pairs with a third opinion available for the difficult cases.

In retrospect, the literature search arrangement could have been much more efficient. The problem was that of a bootstrap: I could likely design a more ef- ficient search strategy for this study now, but to get here I had to conduct the inefficient searches. The quasi-gold standard method proposed by Zhang, Ali Babar, and Tell (2011) seems very promising, and I second the recommendation of Kitchenham and Brereton (2013, p. 2068) to incorporate it in future guidelines. I had initially a lot of trouble with defining the demarcation of evidence. My original plan was to simply take the research method list compiled by Vessey, Ramesh, et al. (2005) as a guide, but it quickly turned out to be unworkable, as they neither define what they mean by the names of the methods nor cite sources for any clear definitions. In the pilot extraction exercise described in Sub- section 4.3.1, I and professor Kärkkäinen had significant trouble interpreting the method list. Particular problems for us were the categories DA, data analysis, and LS, laboratory experiment (software).

We debated the question of whether a study that collected existing pro- grams from various sources, ran static analyses and computed metrics on them,

and then statistically analyzed the resulting data, could be considered being “based on secondary or existing data” (Vessey, Ramesh, et al. 2005, p. 252) and thus a DA study. Professor Kärkkäinen offered the opinion that all programs are data and thus existing programs are existing data; at the time, I advocated the posi- tion that programs in such studies are analogous to human participants and that the metrics derived from them are primary data in each such study. In my later thematic synthesis code book, these studies were allocated the primary method code of CorpusAnalysis or ProgramPairAnalysis, depending on the details of the study.

Similarly, it took some time for us to understand the LS category. Vessey et al. only offered the following comment about it: “We also added [. . . ] Labora- tory Experiment (Software) to assist in characterizing computer science/software engineering work.” (Vessey, Ramesh, et al. 2005, p. 252). Presumably, it was in- tended to be an analogy to LH — Laboratory experiment (Human Subjects). A laboratory experiment, according to Alavi and Carlson (1992) (who Vessey et al. cited), “controls for intervening variables”. Typically this means assigning some participants to the trial intervention and other participants to a control interven- tion, but how does one do that when the participants are pieces of software? Eventually we agreed that, for software experiments, control of intervening vari- ables is often implicit as the effect of the control intervention is knowna priori, and otherwise typically easily instituted by resetting the software before chang- ing interventions (which cannot be done, ethically at least, to humans). This was one of the main motivations for my later definition of an experiment, which dif- fers considerably from the concept of a “true experiment” commonly defined by behavioral researchers; in my taxonomy true experiments would be called ran- domized controlled experiments. However, in practice, I ended up using non- experimental codes like ProgramRewrite and BenchmarkPrograms for studies of this type.

A problem revealed itself in the Google Scholar search performed on Septem- ber 7, 2011. It turned out that Google Scholar refuses to display more than one thousand hits. The reported hit count was 2050, and thus the particular search was abandoned under compulsion before the halfway mark was reached. Google (2011) indicates that there is no direct way to overcome this limitation. To try to find the same hits, I conducted the same search with year restrictions, covering together all years, on September 12 and 13, 2011. The combined reported hit count for the piecemeal re-search was 1744, which is 85 % of the reported count of the original abandoned search. A similar tactic for avoiding over-1,000 hits was adopted on subsequent Google Scholar searches.

Documento similar