ARCHITECTURE 133
expensive, we could not provide updates on the dataset in a frequent and automatic manner. This results in a number of threats to the availability of models in the dataset, such as models become no longer available/accessible and missing mid-flight projects in which UML models were introduced later than the time of analysis.
6.3.2
Data Curation
Having the dataset collected, we moved on to more in-depth studies about use of UML models in the context of open source projects. As these studies require a set of projects and models with specific characteristics, the dataset had to be curated. For example, when studying the practices and perception of UML use, we were interested in the projects where we could observe long-term use of UML models and collaboration between contributors [165]. To obtain these projects, we applied some filters on the number of contributors, number of commits and active time-span. With that, we are willing to accept false rejections (e.g. ‘serious’ projects that use UML might be rejected) in favour of no (very low) false positives.
Successful curation can also be achieved by adding extra knowledge to the existing dataset. In particular, we performed a number of classifications: a) Classifying types of UML models was done manually and b) Automatically classifying reverse engineering diagrams and forward design diagrams [166]. The classification results were then added/annotated in the dataset.
6.3.3
Sharing the Lindholmen dataset
The availability of the Lindholmen dataset has attracted researchers in the field to use and study the data set. For example, El Ahmar et al. used more than 3500 diagrams from the dataset to study the use of visual variables (such as size, brightness, texture/grain, etc.) in UML models in open source projects [61]. Schulze et al. used 50 sequence diagrams from the Lindholmen dataset for evaluating their automatic layout and label management [64]. Unfortunately, the results of these research have never been integrated/annotated into the Lindholmen data set because of two reasons. Firstly, these investigations have been conducted with small subsets of the dataset, making it hard to generalize the research result to the whole dataset. Secondly, there has been no systematic and convenient way for researchers to integrate their findings to the data set.
6.4
Challenges for Big-data Driven Empirical
Studies in Software Architecture
In this section, we discuss the challenges (C) for conducting big-data driven empirical studies on software architectures. The discussion reflects our ob- servations on research in the field as well as our experience in building the Lindholmen dataset.
C1: Finding a common representation for software architectures. Source code is always represented as some type of text-file that conforms to some formal grammar. For example, object-oriented source code consists of
classes, methods and interfaces. Indeed the aim to be ‘compilable’ enforces that the source code conforms to a formal grammar. Notwithstanding the existence of standards for software architecture and UML, there is a very high diversity in the representation of software architecture across different projects. Software architecture documents may be represented in formats as diverse as Word (doc(x)), PDF, HTML, PowerPoint (ppt), among others. The content of software architecture documents is a mix of natural language, images, and sometimes tables and diagrams. Indeed the content is a mix of descriptions of the system architecture, sometimes including design principles, design rationale, and even source code examples. This complicates the definition of a common representation (data-model) of which information to represent for each architecture.
C2: Capturing relevant context information.Source code has as main purpose to represent the implementation in a manner that is compilable and executable by a computer. Software architecture on the other hand, serves different purposes to different consumers over time: in early stages of projects, architecture documentation is typically used to create a shared understanding among architects. Later on such documenting happens after (or in concert with) making the implementation, and serves as to align architecture and implementation. Moreover, the documentation serves as a reference for developers to record which parts of the system have been implemented and stabilized. Also, testers of the software draw on information from the software architecture, e.g. to understand quality objectives as well as scoping decisions. In open source repositories, we can observe the production of architecture, but not its use/purpose/aim(s). The way an architecture is used is key to analysing the benefits that can be harvested from it. This includes processes and practices of the project (such as quality assurance, processes for monitoring conformance of implementation, or the way in which architecture is used in producing implementation).
Indeed, the representation, completeness and level of abstraction of the description of an architecture depends on the stage of the project it is used in: at the start of a project, architectures may not be crystallized very much, hence little of the system is represented by an explicit architecture representation. For mature projects, architecture documentation usually focuses on high level views of the system (so as to be able to provide one overview of the system), especially in large software projects. As a consequence, the representation will need to leave out many details. In summary, when we want to understand the role of architecture in a project, we need to consider as well various contextual factors, such as the stage of development, project size and geographical distribution of the development team. Fig 6.1 generalizes the complex nested contexts that influence the goals of architecture and thereby the various processes, practices and tools used (generalized from [84]). This Figure illustrates the empirical finding that there is a hierarchy of contexts that influence how software practices are used. There are organizational and project factors that include the goals of the stakeholders. For example, these may prioritize delivery date over quality of the software. Such priorities in turn
6.4. CHALLENGES FOR BIG-DATA DRIVEN EMPIRICAL STUDIES IN SOFTWARE
ARCHITECTURE 135
affect the ways in which architecting is done. In particular they will affect the goals of doing architecting and via this also the processes, practices and tools used for architecting. Indeed, for a true understanding of the value of achitecture practices, all these context factors would need to be understood. However, this contextual data is typically not obtainable via ’artefact mining’ approaches.
Figure 6.1: Impacts of contexts to software modeling approach
Stakeholder* has* Goal
SE-Process SE-Practices drives has SE-Tools * * Modeling Process Modeling Practices Modeling Tools* drives SE Approach Approach to architecting drives drives drives drives drives A p p ro a ch to Im p le m e n ta tio n
Project has* Stage
* SE-Goals Modeling Goals
drives * drives * Organization Context Project Context A p p ro a ch to D o cu m e n ta tio n
C3: High effort for crawling big-data. Empirical research into software architecture requires a non-trivial amount of software architecture (em- pirical) data in order to draw representative findings and conclusions. However, collecting/building such dataset is challenging for the following two reasons.
Firstly, due to the vast variety in representation and use of software architecture, identification of such SADs is a huge challenge per se. This becomes even more challenging when searching for SADs in big data such as GitHub, SourceForge. For example, when building the Lindholmen dataset, it was impossible to manually scan through the whole GitHub data to look for UML models. We had to apply some heuristic searches and develop automated methods to identify UML models in different file formats, including images. Building up such technology was a challenging and time consuming task in itself [117].
In addition to the unavailability of automatic identification methods, it is worth noting that the limited (human- and machine-) resources could hugely affect the amount and the quality of software architecture data to be collected. In particular, studies that involve the identification of SADs often target a small amount of SADs because of limited human resource within the research team (for identifying, verifying, maintaining the data), thus running the risk of data not being representative. Moreover, to many studies that use the GitHub API (such as [167]), the limitation of a maximum of 5000 request per hour is a technical challenge that limits
the speed and scope of SADs search.
C4: High effort for curation. Collecting software architecture artefacts from open source requires a lot of curation. Firstly, public repositories are frequently very ‘noisy’: they do not only contain software development project, but also, e.g., student projects and course material [127]. Secondly, as argued in the previous section, SADs exist in a very wide variety. Studies that aim to employ ‘big data’/machine learning techniques must realize that there are “many different animals in the SAD-zoo” that share very little commonality. One way to understand the zoo of SADs is to enable community/crowd-sourcing curation, e.g., through annotation and classification. We elaborate on the need for curation in the next section. Another recommendation is to set up mechanisms as early as possible to monitor and improve the quality of the dataset. Given that typically large volumes of data are involved, this must be automated as much as possible. This is complicated by the fact that each ‘entry’/datapoint for one software architecture is very rich in many different types of attributes and context factors.
C5: Collaborating in empirical software architecture research. Col- laboration has become a common practice in doing big-data driven empirical research. This is due to the fact that such type of research often requires a huge amount of efforts to which a single researcher might not be able to cover all parts by his own. For example, in order to build the Lind- holmen dataset, collaboration between researchers who are specialised in specific fields was necessary - some researchers were more specialised in mining big-data from GitHub, some others were responsible for developing techniques for detecting UML content in arbitrary files. Prior to forming the working team, it was important to establish the research intent and look up for potential collaborators via researcher’s own network. When analysing data, the researchers needed to communicate with each other on the steps and progress of data analysis as well as the preliminary results. Team effort was also needed in developing a community around the research. This included creating website, communicating with rele- vant research groups at various conferences/workshops and responding to (extra-feature) requests from the research community. However, the level of (tool-)support for the collaborative empirical research activities was far from sufficient. The authors of the Lindholmen dataset was not aware of any tool that supports team-working for all the above-mentioned activities.