• No se han encontrado resultados

ACUERDO CON TEMOINSA:

NO FINANCIEROS CORRIENTES Y NO CORRIENTES

RESERVAS DE COBERTURA DE FLUJOS

B) ACUERDO CON TEMOINSA:

The presented template-based crawling framework allows the user to collect data from arbitrary OSS Web repositories and store it according to the data model derived from the Mediabase model (cf. Subsection 5.2.3). Previous to analysis of the collected data, we still need to organize two more steps: (1) define data sampling and (2) develop an approach for data filtering. The OSS projects are evolving mechanisms (accumulation of critical mass of developers, explosion of end- user community, community shrinkage, project death). To analyze the evolution of the OSS repositories, some researchers divide data in periods of fixed size, e.g. [RGB06], while others use time points of releases as a cutting criteria, e.g. [WHC09]. Data sampling can significantly influence the analysis results or even lead to different ‘anomalies’.

5.4.1 Anomalies of Different Data Sampling Approaches

Non-existing Connection The OSS SNs are generated based on the data collected

from the project repositories. The edges in the SNs reflect collaborative activities or shared content (cf. Subsection 5.2.3). For example, two OSS project participants can be defined as connected, if they have performed at least one commit on the same piece of code.

Consider the following example. There are two project participants A and B. They both have modified some artifact y within the project code base. However,

A and B participated in the project in different non-overlapping time periods lif etime(A) = (t0, tx) and lif etime(B) = (tx+ , tcurrent) with  > 0. While

generating the corresponding SN, an edge e between the node A and node B will be created. However, these two persons have never really collaborated within the given project. In this way new connections are forged, which otherwise, would never have been established. The situation is often predictable insofar as the main code trunk contains the same backbone of files over the whole project life. A similar situation can also occur whenever there are long-lived threads in the communication repositories.

Damping of social importance Many SNA methods are based on the degree

value of a node, which in short is the number of connections the node has. Due to the evolving nature of the social communities in general and of the OSS projects in particular, continuous change in the number of participants can be expected.

Consider another example. One of an OSS project co-founders A (lif etime(t0, tx))

had left the community shortly before the project experienced breakthrough at

tx+  with  > 0. The importance of this person for the project is indisputable.

At the same time, the degree value of the project member A is limited by the maximum number of the project participants during the lif etime(t0, tx) period. If

a number of community members explores exponentially after tx, the importance

on degree value. Additionally, the duration of participation in an OSS project has a direct influence on the maximum number of possible connections.

Figure 5.13 displays the degree value of the BioJava co-founder Keith D. James in comparison to the minimal degree value of the project participants identified as core. The core is calculated using the hierarchical clustering in the “Core-Cut” case (cf. Algorithm 1), whereas “80%-Cut” contains the project participants who together have created 80% of the project communication (cf. Subsection 5.2.2). The analysis is applied to the project history of the period 2000/01/01−2011/01/01 (the complete data set used within this work). In neither “Core-Cut” nor “80%-Cut” is Keith assigned to the core. However, if we take a closer look at the participation history of Keith D. James in the project in Figure 5.14, we can easily recognize that until the release 1.3.1, Keith was one of the project leaders. The comparison of Figure 5.14 to Figure 5.15 visualizes two different sampling schemata: the former which is based on release dates vs. the latter based on time intervals (here year based). The social role of Keith D. James varies depending on the sampling schema. Whereas the release-based sampling reflects the real input of Keith to each release, the time-based sampling generalizes the information over the given time period.

Historical Clustering In several studies, clustering is applied to the SNs of the

OSS projects in order to detect different sub-communities bound by shared interests. Many clustering algorithms are organized in such a way that the dense sub-groups are separated from sparsely bound elements (cf. Subsubsection 3.2.2.1). Intuitively, the project members whose participation times completely overlap are connected with a higher probability than those whose participation times only partly overlap. Thus, while applying clustering to the OSS SNs, it is possible that the network gets clustered along the time axes rather than according to shared interests.

Non-structural Evolution Measures Network evolution is not the only notewor-

thy subject of OSS research. Other changing parameters include role transition, project member fluctuation, development progress, etc. These measurements can be also deformed dependent on the selected step size. For example, in the OSS projects, a high number of commits is expected around the release dates. If we define development progress as a number of commits, the time-based sampling can distort the real situation. Figure 5.15 shows that there were two releases in 2003 in the BioJava project, whereas no releases were recorded in 2004.

Rebirth of a Person A person can enter or leave an OSS community. Besides

these two processes, there is ‘rebirth’ - a process whereby a person first leaves the community for a longer time period and later re-enters it. If we analyze the entire project lifetime, we learn that a rebirth can be easily overlooked. For example, Brad C. in the Biopython project was only active in the first three years of the project. Then he left the project without making any contribution to it for about five years. Today, he is once again an active project member. Without an

Keith D. James Core-Cut 80%-Cut 250 300 350 A ctivit y Sco re

Figure 5.13: Summarized Activity of Keith D. James in BioJava.

1.1-1.2 1.2-1.3 1.3-1.3.1 1.3.1-1.4 1.4-1.5 1.5-1.6 1.6-1.7 1.7-1.8 0 100 200 Release - Release A ct ivit y Sco re 80%-Cut Core-Cut Keith D. James

Figure 5.14: BioJava Activity of Keith D. James per Release.

appropriate approach for data sampling, it may falsely appear as he had never left the project and had stayed with it for more than ten years which, in fact, does not reflect the truth.

To define data sampling, it is necessary to consider the physics of OSS projects: their rhythms and iterations. Figure 5.16 displays an OSS development cycle. The changes are planed, implemented and tested continuously. At some point in time, a current branch is frozen for the next release. From that point on, only bug fixes are allowed to be executed. In the next step, the code is released. Afterwards, only hot-fixes - small code updates which address specific problems in the last release - are allowed. Hence, a period (tj, tj+1) between two releases j and j + 1 is a logical

step for the OSS analysis. This approach also conforms to the metrics and laws of software evolution summarized in [LRW+97].