Defensa Eslava [01 1 ] - ++Usted Juega

0.2 0.4 0.6 0.8 1 1.2 0 0.5 1 1.5 2 2.5 3 Clustering Coe ﬃ cient Average Strength

Figure 5.20: Clustering coefficient over average strength. Each

data point represents one project

5.3 Metrics for different project-categories

Category Count Average Degree Average Out-Strength

all 228 13.8456 1.4385 big-data 27 15.7087 1.4860 build-management 17 16.9502 1.1942 cloud 10 22.4263 1.5872 content 14 9.8305 1.5362 database 22 23.6451 1.5136 graphics 5 21.0555 1.6763 http 14 29.7259 1.7260 httpd-module 4 24.2663 1.2070 javaee 9 19.0324 1.8808 library 82 10.0269 1.3700 network-client 18 13.9835 1.5277 network-server 35 16.6735 1.5966 retired 9 9.4677 1.5785 testing 4 13.8425 1.2042 web-framework 25 14.8778 1.5341 xml 28 13.7537 1.5609

Table 5.1: Average Degree and Strength for different project cat-

Category Count Clustering Closeness Betweenness all 228 0.6949 0.0659 39.6982 big-data 27 0.6794 0.0472 54.8562 build-management 17 0.6522 0.1318 49.1389 cloud 10 0.6261 0.0487 156.5869 content 14 0.6493 0.0515 21.3229 database 22 0.7433 0.0379 57.1528 graphics 5 0.7929 0.0419 37.8886 http 14 0.6872 0.0288 89.6320 httpd-module 4 0.7726 0.1703 46.5317 javaee 9 0.6850 0.0304 63.0450 library 82 0.6851 0.0578 24.0122 network-client 18 0.6627 0.0443 38.3626 network-server 35 0.6707 0.0494 42.5121 retired 9 0.7191 0.0575 16.3592 testing 4 0.7400 0.0829 53.5187 web-framework 25 0.7338 0.0734 33.7100 xml 28 0.7430 0.0705 26.8053

Table 5.2: Average centrality metrics and clustering coefficients

for different project categories

In Table 5.1 and Table 5.2 we can see that there are no clear correlation between the category of a project and the different network metrics

Chapter 6 Discussion

6.1 Centrality

Both centrality indices has a clear correlation to the number of committers in a project, with their correlation coefficients (0.96 for betweenness centrality and -0.96 for closeness centrality) hinting at a very high probability of linear correlation. It’s an odd coincidence that the correlation coefficients are as strong for both centrality indices as they have quite different definitions, but keep in mind that the coefficient for the closeness centrality is calculated after logarithmic transforms. This might be explained by the fact that projects often have a small core of developers that are responsible for most of the commits in a project. In fact, in the projects analyzed in this thesis, on average the 20.3% most active developers contributed more than 75% of the commits. These core developers will have made changes to almost every file in the project while the non-core developers that only make commits occasionally will only have touched a small amount of files. This creates a network structure with a small amount of developers in the centre and the rest in the periphery. The effect of this is that almost all the shortest paths in the developer network passes through a small amount of vertices (developers) giving these vertices a very high betweenness centrality. If we have a network with N committers and a new committer is added to the developer network then there will be N new shortest paths since there is a shortest path from all the ”old” vertices to the ”new” vertex. Almost all of these shortest paths will go through the core developers increasing their betweenness centrality greatly while the ”new” vertex will have a very low betweenness. Thus several vertices will have an increased betweenness centrality, increasing the average for the whole network. The same reasoning can also be used to explain the decrease in closeness centrality with an increase in number of committers. The number of developers in the periphery compared to the number of developers in the core increases with the total number of developers giving the network a lower average closeness centrality overall.

6.1.1 Distribution of centrality

While looking at the distributions of betweenness and closeness centrality for projects with different amount of developers it can be concluded that a very large amount of the developers have a betweenness index relatively small. It is only a very small amount of developers that have a high betweenness centrality but on the other hand their betweenness centrality are in some cases extremely large. This further implies a network structure where a few developers in the core are responsible for a large number of commits while most developers in the periphery of the developer networks only does relatively few commits. Closeness centrality has a similar tendency where the majority of the developers are in the lower half of the observed interval.

6.2 Clustering

No correlation was found between the clustering coefficient and any other metric, such as project age, lines of code, number of committers, average degree or average strength. The clustering coefficients found in the Apache projects also vary widely, from below 0.3 to above 0.9, making it hard to draw any conclusions.

What can be observed in the graphs is that some extreme values for the clustering coefficient can be found in smaller projects (projects with a small amount of committers) where the clustering coefficient is either 1, 0 or very close to them. This can not be seen in projects with more than 20 committers.

6.3 Evaluation

The structure used by Apache where there are several different roles that a person can have might influence the results. It can not be guaranteed that the committer that performs a commit has actually made the changes in the commit. The initiator of a change in the code could be a developer that does not have write access to the code repository, this on the other hand requires that the developer proposes the changes to a committer which then approves if it is considered a valuable improvement to the project. So there is a possibility that a committer has performed several commits that he/she is not the author of. On the other hand, the committer has to critically review the changes and probably discuss with the developer the purpose of the changes. This will make the committer really understand why the changes are needed and because of that he/she will be able to explain to other developers why the changes were made and what purpose they have.

6.4 Average Degree

As can be seen in Figure 5.3 the average degree is very high in many projects. That means the networks are not really sparse and does not really fit into the small-world phenomenon. It could be argued that the method used to create the edges between is too generous in creating connections between developers. It might have been better to limit the connections

6.4 Average Degree

(edges) between developers based on e.g. the time between the developers commits or the amount of commits that has been submitted in between.

Chapter 7 Conclusion

This thesis work investigated the developer collaboration in open-source networks with the goal to find common collaboration metrics among the projects. Some common trends were found in the data. The results show that the average betweenness centrality and average closeness centrality is correlated to the number of committers, which might be attributed to the structure of the developer networks where the core consists of a few developers that are responsible for a large proportion of the total number of commits. The distribution of centrality indices in the projects also seems to support this network structure.

It was not possible to draw any conclusions regarding the clustering coefficient. This may be related to the generous amount of edges created, which might create clusters of developers that never really worked much together. Very low clustering coefficients were rare however.

The results of this study has shown that it is possible to find common collaboration metrics in open-source projects, and some of those findings might be useful for benchmarks, such as the distribution of centrality indices within the projects. A large project that have a very low average betweenness centrality compared to the Apache-projects may have an ineffective development process because a low betweenness centrality indicates that most developers seem to do changes on almost all parts of the projects. It seems like the most effective way of developing good quality software, at least in open-source projects, is to have the majority of the developers specialized on a specific part of the project and only a small amount of the developers binding these parts together to the final product.

7.1 External Validity

The projects analyzed in this thesis has many similarities to both open and closed-source development thanks to the Apache Software Foundation’s well organized development and thanks to contributions from various companies share developers with companies’ closed-source projects, and consists of projects of many types and sizes and the data gathered should therefore be a fairly good representation of the software development industry. Still, they all follow the guidelines set by Apache, and it might have been advantageous to include projects from other sources in order to verify the results external validity.

7.2 Future Work

During this thesis several realizations have been made that could be considered for future studies. The networks created in this study did not take the difference in time between commits on the same files into account when creating the edges. Changes that are close in time are probably more relevant than changes that are several years apart. So to only create edges between developers that have made changes to the same files close in time could be a consideration for future studies.

It could also be interesting to study how the metrics change during a projects lifetime. The collaboration may be different in a new project compared to a project that has been worked on for several years.

There were an intention to include project size in terms of lines of code as a metric in this study. But it was concluded that the method used for extracting lines of code for each project gave questionable results, so the decision was made to skip lines of code as a metric. Future studies in the subject may want to use lines of code as a metric since it is a good representation of how large and complex a project is.

Bibliography

[1] About Git. March 2015. url: http://git-scm.com/about.

[2] Apache Hadoop. May 2015. url: https://hadoop.apache.org/. [3] Apache HTTP Server. May 2015. url: http://httpd.apache.org/. [4] Apache Incubator. April 2015. url: http://incubator.apache.org/.

[5] Apache Incubator Graduation Requirements. April 2015. url: http://incubator. apache.org/incubation/Incubation_Policy.html#Graduating+ from+the+Incubator.

[6] Apache license. March 2015. url: http://www.apache.org/licenses/ LICENSE-2.0.

[7] Apache OpenOffice. May 2015. url: https://www.openoffice.org/. [8] Apache software foundation. February 2015. url: http://www.apache.org/

foundation/.

[9] Apache Software Foundation index. February 2015. url: http://projects. apache.org/indexes/alpha.html.

[10] Apache Subversion Features. April 2015. url: https://subversion.apache. org/features.html.

[11] A. Barrat, M. Barthélemy, R. Pastor-Satorras, and A. Vespignani. “The architec-

ture of complex weighted networks”. In: Proceedings of the National Academy

of Sciences of the United States of America 101.11 (2004), pp. 3747–3752. doi: 10.1073/pnas.0400087101.

[12] Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. “Gephi: An Open Source

Software for Exploring and Manipulating Networks”. In: International AAAI Con-

ference on Weblogs and Social Media(2009), pp. 361–362.

[13] Christian Bird, David Pattison, Raissa D’Souza, Vladimir Filkov, and Premkumar

Devanbu. “Latent social structure in open source projects”. In: Proceedings of the

16th ACM SIGSOFT International Symposium on Foundations of software engi- neering. ACM. 2008, pp. 24–35.

[14] Michael Bloch, Sven Blumberg, and Jurgen Laartz. “Delivering large-scale IT projects

[15] Ulrik Brandes. “A faster algorithm for betweenness centrality”. In: Journal of Math-

ematical Sociology25.2 (2001), pp. 163–177.

[16] Kevin Crowston. “The social structure of open source software development teams”.

PhD thesis. Syracuse University, 2003.

[17] Curl. March 2015. url: http://curl.haxx.se/.

[18] Boru Douthwaite, Andrea Carvajal, Sophie Alvarez, Elías Claros, and LA Hernán-

dez. “Building farmers’ capacities for networking (Part I): Strengthening rural groups in Colombia through network analysis”. In: KM4D Journal 2.2 (2006), pp. 4–18. [19] Eclipse Community Survey 2014. February 2015. url: https://ianskerrett.

wordpress . com / 2014 / 06 / 23 / eclipse - community - survey - 2014-results/.

[20] Linton C. Freeman. “A set of measures of centrality based on betweenness”. In:

Sociometry(1977), pp. 35–41.

[21] Linton C. Freeman. “Centrality in social networks conceptual clarification”. In: So-

cial networks1.3 (1979), pp. 215–239.

[22] gnuplot. May 2015. url: http://www.gnuplot.info/.

[23] Michael W. Godfrey and Qiang Tu. “Evolution in open source software: A case

study”. In: Proceedings of the International Conference on Software Maintenance,

2000. IEEE. 2000, pp. 131–142.

[24] Umberto Griffo. Weighted Cluster Coefficient - Gephi Marketplace. March 2015.

url: https://marketplace.gephi.org/plugin/weighted-cluster- coefficient/.

[25] Robert A. Hanneman and Mark Riddle. Introduction to social network methods.

University of California Riverside, 2005.

[26] How the ASF works. March 2015. url: https://www.apache.org/foundation/ how-it-works.html.

[27] Matthew O. Jackson. An Overview of Social Networks and Economic Applications.

2009.

[28] Alden S Klovdahl. “Social networks and the spread of infectious diseases: the AIDS

example”. In: Social science & medicine 21.11 (1985), pp. 1203–1216.

[29] Vito Latora and Massimo Marchiori. “Economic small-world behavior in weighted

networks”. In: The European Physical Journal B-Condensed Matter and Complex

Systems32.2 (2003), pp. 249–263.

[30] Josh Lerner and Jean Triole. The Simple Economics of Open Source. Working Paper

7600. National Bureau of Economic Research, Mar. 2000.

[31] L. Lopez-Fernández, G. Robles, J. Gonzalez-Barahona, and I. Herraiz. “Applying

Social Network Analysis Techniques to Community-Driven Libre Software Projects”. In: International Journal of Information Technology and Web Engineering 1 (2008), pp. 28–50.

[32] Gregory Madey, Vincent Freeh, and Renee Tynan. “The open source software devel- opment phenomenon: An analysis based on social network theory”. In: Proceedings

of the American Conference of Information Systems. 2002, p. 247.

[33] Stanley Milgram. “The small world problem”. In: Psychology today 2.1 (1967),

pp. 60–67.

[34] Audris Mockus, Roy T. Fielding, and James Herbsleb. “A case study of open source

software development: the Apache server”. In: Proceedings of the 22nd interna-

tional conference on Software engineering. ACM. 2000, pp. 263–272.

[35] Netcraft survey. April 2015. url: http://news.netcraft.com/archives/ category/web-server-survey/.

[36] Mark EJ. Newman. “Scientific collaboration networks. II. Shortest paths, weighted

networks, and centrality”. In: Physical review E 64.1 (2001), p. 016132.

[37] Christopher Oezbek, Lutz Prechelt, and Florian Thiel. “The onion has cancer: Some

social network analysis visualizations of open source project communication”. In:

Proceedings of the 3rd International Workshop on Emerging Trends in Free/Li- bre/Open Source Software Research and Development. ACM. 2010, pp. 5–10.

[38] Tore Opsahl and Pietro Panzarasa. “Clustering in weighted networks”. In: Social

networks31.2 (2009), pp. 155–163.

[39] Alma Orucevic-Alagic and Martin Höst. “Network analysis of a large scale open

source project”. In: Proceedings of the 40th EUROMICRO Conference on Software

Engineering and Advanced Applications. IEEE. 2014, pp. 25–29. [40] R project. April 2015. url: http://www.r-project.org/.

[41] Eric Raymond. “The cathedral and the bazaar”. In: Knowledge, Technology & Policy

12.3 (1999), pp. 23–49.

[42] Charles Severance. “The Apache Software Foundation: Brian Behlendorf”. In: Com-

puter45.10 (2012), pp. 8–9.

[43] Tnet. April 2015. url: http://toreopsahl.com/tnet/software/.

[44] Duncan J. Watts and Steven H. Strogatz. “Collective dynamics of ‘small-world’networks”.

Appendix A

Script for calculating metrics with tnet

l i b r a r y ( " t n e t " ) f i l e s <− l i s t . f i l e s ( " g r a p h s " , f u l l . names=TRUE ) c a l c _ m e t r i c s <− f u n c t i o n ( f i l e ) { p r i n t ( f i l e ) d a t a <− r e a d . t a b l e ( f i l e ) d e g r e e = d e g r e e _w( d a t a ) b e t w e e n n e s s = mean ( b e t w e e n n e s s _w( d a t a ) [ , ’ b e t w e e n n e s s ’ ] ) c l o s e n e s s = mean ( c l o s e n e s s _w( data , d i r e c t e d =TRUE ) [ , ’ c l o s e n e s s ’ ] ) c l u s t e r i n g = c l u s t e r i n g _w( d a t a ) r e t u r n ( t ( c ( basename ( f i l e ) , mean ( d e g r e e [ , ’ d e g r e e ’ ] ) , mean ( d e g r e e [ , ’ o u t p u t ’ ] ) , b e t w e e n n e s s , c l o s e n e s s , c l u s t e r i n g ) ) ) } i f ( f i l e . e x i s t s ( " o u t p u t " ) ) { f i l e . remove ( " o u t p u t " ) } c o l n a m e s <− c ( " Name " , " D e g r e e " , " S t r e n g t h " , " A v e r a g e ␣ B e t w e e n n e s s ␣ C e n t r a l i t y ␣ I n d e x " , " A v e r a g e ␣ C l o s e n e s s ␣ C e n t r a l i t y ␣ I n d e x " , " C l u s t e r i n g ␣ C o e f f i c i e n t " ) m e t r i c s <− l a p p l y ( f i l e s , c a l c _ m e t r i c s ) y <− do . c a l l ( rbind , m e t r i c s )

w r i t e . t a b l e ( y , " o u t p u t " , s e p = ’ \ t ’ ,

c o l . names= colnames , q u o t e =FALSE , row . names=FALSE )

A.1 Excluded Projects

We were unable to retrieve the commit logs for the following projects: • Apache Lucene.Net • Apache Camel • Apache Wink • Apache Struts • Apache log4cxx • Apache Tuscany • Apache Olingo • Apache Empire-db • Apache Taverna • Apache Marmotta • Apache Spark • Apache Syncope • Apache Cordova • Apache Axiom • Apache Jena

• Apache Spatial Information System • Apache ActiveMQ

• Apache Allura

The following projects were too small to create any meaningful networks: • Apache .net Ant Library

• Apache Compress Ant Library • Apache ORO

By applying social network analysis on the collaboration of open-source develo-

pers for a wide variety of projects a few observations can be made that can give

some valuable insight in the development process of open-source projects.

Motivation

In software development, open source projects are projects where the code is freely available for anyone to read and copy. These projects are developed through a decen- tralized structure, where anyone capable is able to sub- mit changes and improvements to the codebase. This development process differs heavily from the way most companies do their development, yet it is still many ti- mes very successful and can produce products of consi- derable quality. As open source software is commonly used in many closed projects, they also get contributions from many companies. This makes them an interesting target for analysis of the development process as there’s a large amount of data openly available, contrary to projects developed in closed environments.

This led us to perform a study on a large amount of open-source projects hosted by a well-known open- source community called Apache Software Foundation. The foundation hosts many projects with a wide variety of sizes, including several high profile projects such as OpenOffice.

Social network analysis

To analyze developer collaboration we have used the concept of social network analysis. This is done by stu- dying networks representing developers and their collaboration. Performing a social network analysis means applying different metrics to the networks, where each metric looks at a specific property of the networks. Ex- amples of metrics are the number of other developers a specific developer has collaborated with and centrality

measures that shows the influence developers have over one another. The clustering coefficient is another metric that calculates if there exists subgroups in the developer networks. A subgroup is a set of vertices that are well connected to each other. Well connected means that each vertex in the set has an edge to a majority of all the other vertices in the set.

Result

A few conclusions can be drawn from the data gathered from the various projects. The more developers a projects has, the higher the average centrality is. The average centrality follows a line very closely, suggesting that the networks have a similar structure. The majority of the developers have a small centrality index while only a few developers are very central in the projects. On the other hand, we found no real correlation between the clustering coefficient and any other data gathered th- roughout the study.

EXAMENSARBETE Social Network Analysis of Open Source Projects STUDENTER Nicklas Johansson, Christian Tenggren

HANDLEDARE Alma Orucevic-Alagic (LTH) EXAMINATOR Martin Höst (LTH)

Social network analysis of open source projects

POPULÄRVETENSKAPLIG SAMMANFATTNING Nicklas Johansson, Christian Tenggren

In document ++Usted Juega - Zenon Franco (página 45-49)