Ranking components of scientific software using spectral methods

In this thesis we investigate the centrality ranking of functions in call graphs of scientific software using spectral method. Credit for much of the work described in this thesis goes to my supervisor, Dr. In this thesis, we study software systems specifically designed for problems arising in scientific and engineering applications [15].

The architecture of a software system involves the specification of the elements of which it is composed and the pattern of interactions between those elements. In this thesis, we study the architecture of scientific software represented as a component dependency network. The central research question that we address in this thesis is what are the important components of the network.

MOTIVATION AND BACKGROUND

SCIENTIFIC COMPUTING SOFTWARE

ORGANIZATION OF THE THESIS

CONTRIBUTION OF THE THESIS

To the best of our knowledge, the work presented in the thesis is the first time that spectral analysis has been applied to scientific software domain. Part of the work presented in this thesis has been accepted for publication as a proceeding paper in the 17th International Dependency And Structural Modeling Conference, DSM 2015, Fort Worth, Texas, USA. The complexity of product design has been a critical topic for both researchers and managers for many years.

It is a tool used to compactly represent different types of dependencies that exist between different types of components. DSM is a powerful modeling tool that enables the visualization of a system architecture and their interactions. It is used as a modeling tool to capture and analyze patterns of interdependencies between functions.

DEPENDENCY EXTRACTION

A tolerant heavyweight extractor provides complete static call graphs. The tool OINK is a heavy duty heavy duty extractor. For example, looking at Figure 2.3, we can see that the file symtable.cc depends on the file token.cc and token.h, since there is an edge from symtable.cc to token.h and token.cc. For example, looking at the usage dependency graph, we can see that token.cc uses token.h 5 times.

An edge from token.cc to token.h with an integer value of 5 indicates that token.cc uses 5 entities or objects from token.h. Dependency Browser information is stored in a comma-separated value (csv) file called usesdb.csv. In Figure 2.5, we can see that there is only one edge from token.cc to token.h with an integer value of 2.

MATRICES AND GRAPHS

Graphs are more intuitive than DSMs, but become difficult to understand as the number of nodes and edges increases. On the other hand, large and complex graphs can be represented very efficiently by DSM. Technically, the differences between matrix and graphical representations are minimal because data can be stored and presented in different ways [21].

Common product representation models can generally be classified as a matrix-based or a graph-based model. The HDDSM modeling process reduces the effort required of researchers and allows the model to be created over many people or completed at different times. A hierarchical systems approach is used, where the system is divided into groups at the beginning of the model creation process.

Then for each cluster, the HDDSM can be created separately and merged back into the system model. After merging the HDDSM array into the HDDSM system, only a subset of interactions between array elements and other system elements need to be reviewed [ 28 ]. The original intention of CFG was to formally represent concepts derived from a function structure.

If the overall task can be adequately defined, then it is possible to integrate the inputs/outputs of all the quantities involved in an overall function. In this thesis our objective is to define the centrality of the components of a scientific software. The idea of centrality has been applied by researchers in many fields such as communication network, organizational and technological structure analysis.

In complex networks it is a common phenomenon to focus on the pattern of interactions between individual components in a system and therefore the idea of centrality has been used in large networks.

EIGENVALUES AND EIGENVECTORS

HYPERTEXT INDUCED TOPICS SEARCH

In the kth iteration, the node is assigned a new authority weight (x(k)i) equal to the sum of y(k−1)j, where the sum runs over every node that points to the node. The new node weighty(k)i is the sum of (x(k)i), where the sum runs over the nodes that the nodes point to. The hub weights are calculated from the current authority weights, which were calculated from the previous hub weights.

Similarly, if it is driven by many nodes with large y values, it gets a large x value[20].

Just by looking at the graph and the connectivity of the nodes, we can see that in terms of hubs there is a connection between nodes 1,2 and 3.

BENZI’S METHOD

According to Freeman, the degree of a node is an indicator to represent the centrality or importance of a node in a graph. However, the degree of a node represents its connection to other nodes only in its immediate neighborhood. Since then, some researchers have expanded the notion of good connectivity, i.e. many other nodes can be reached from the nodes in question, giving a more accurate calculation of the node centroid.

A0(i,j) gives the number of closed walks of length l that start and end at node i. A diagonal can be interpreted as meaning that every node in the graph is connected to itself by a walk of length zero. Thus, the concept of node degree can be extended to the concept of well-connectedness or centrality by counting the number of distinct paths passing through a node.

One of the many scaling factors used by researchers is the factorial of the length of the step, which gives rise to the infinite series of the matrix exponential [2]. In the undirected case, that is, the network A, each node has only one role to play in the network. And that is any information that entered the node can leave at the edge. On the other hand, the directed network, B, has two roles for each node: hub and authority.

A node with a high hub position is unlikely to also have high authority ranking, but each node can still be seen as acting in both of these roles. Using the matrix exponential for the arrangement of pivots and authority amounts to using the entries of all eigenvectors of B, weighted by the exponential of the corresponding eigenvalues [2]. As discussed in Chapter 3, we attempt to determine the centrality of the components of scientific software.

The exponential matrix method for computational centers and authorities is compared with the HITS algorithm on both small examples and three open source scientific research software.

EXPERIMENTAL RESULTS

While it is intuitive that node 2 should receive a high score for both hub and authority, looking at the degrees alone cannot distinguish the remaining nodes.

EXPERIMENTAL RESULTS Table 4.2: Ranking using HITS algorithm

EXPERIMENTAL RESULTS Table 4.3: Degree centrality score

Looking at Table 4.4 and Figure 4.6, we see that the hub ranking of the HITS algorithm and the Benzi algorithm are the same. When eB is used to calculate the hub and authority scores, node 1 is given a higher authority ranking than all other nodes.

EXPERIMENTAL RESULTS Table 4.4: Ranking using HITS algorithm

A plot of the eigenvalues of the bipartite matrix of ADOL-C can be found in Figure 4.12. Comparing Benzi's method for the unweighted call graph with the weighted call graph, in Figure 4.15, we can see that for the unweighted call graph, in terms of authority point, the two functions loc() and else loc () both rank at number three. On the other hand, for the weighted call graph, the ranking is much more transparent and clearly shows that the free loc() function is ranked as number three and the next loc() function is ranked as number two.

In this case, the matrix exponent-based method applied to the weighted call graph produced a more transparent and distinct estimate of authority relative to the rankings of the ob-. If we compare the results of unweighted and weighted call graphs (Benzi's method), we see that in terms of nodes, the cs malloc() function ranks first among non-. But for the weighted call graph, we can see that all five features have different ranks, making it unambiguous.

Hub: The two functions call() and operator[]*() are ranked number one and number two respectively for the unweighted call graph. For the weighted call graph, the function operator[]*() and the call() function are ranked number one and number two, respectively. Authority: Regarding unweighted call graph, if we look at Figure 4.25, we can see some ambiguity for the hes fg map() and jac g map() functions.

The DSM of the call graph provides a convenient tool so that linear algebraic techniques can be applied to identify important callers and calls through the computation of matrix exponentials. HITS and Benzi et al method were applied to both non-weighted call graph and weighted call graph. Our experiments also indicate that the ranking obtained from Benzi et al method applied to non-weighted call graph differs from the ranking obtained from weighted call graph.

As we have seen in Chapter 4, there is some ambiguity in the ranking obtained from unweighted call graph, but later resolved by producing all distinct values when applied to weighted call graph. This is to be expected since weighted call graph contains more information than unweighted call graph.

SUMMARY AND FUTURE WORK

Dependency models as a basis for software product platform modularity analysis: a case study in strategic software design rationalization.