Self-organizing maps (SOM) are unsupervised algorithms for clustering different types of data (Kohonen and Somervuo 1998). Unsupervised algorithms are suited for grouping of data into clusters that are not predetermined. This means that SOMs are not the best choice in situations where we would be interested in
30
labelling data into predetermined categories. This makes SOMs inappropriate for the supervised classification tasks researched in this thesis.
2.2.2.4. Network Measures
Network measures are quantitative methods for measuring and analyzing nodes and edges in networks. Nodes are also known as vertices and edges are also known as links. There are several different types of network measures that are applicable in different situations, depending on the network and what we are interested in analyzing. In some situations, it can be appropriate to find the shortest path between two nodes. In other situations, it could be appropriate to search for the longest path. In many situations, such as in our financial news research, we are interested in measuring how different nodes are connected to each other, as well as how information propagates through the different links in the networks at the same time. In these situations, the shortest path and longest path algorithms are not suitable.
Borgatti (2005) showed that not all types of centrality measures are suitable for all types of networks. His work has also been further extended to test more measures (Amrit and ter Maat 2016). There are several different algorithms that can be used to measure information flow between nodes, the ones that we examine are degree centrality, closeness centrality, betweenness centrality, and eigenvector centrality. Next, we will review these four different types of network measures in more detail.
2.2.2.4.1. Degree Centrality
Degree centrality is simply the number of links a node in the network has. In directed networks, there are two different degree measures for each node: in- degree is the edges coming in to a node and out-degree is the edges going out from a node. This is the simplest centrality measure. In our research, degree centrality would simply provide us an absolute order of media attention, which we are not especially interested in. Degree centrality for a node in a network is mathematically represented as in (23), where 𝑣 is the node in question and 𝐶𝐷(𝑣) is the degree centrality value (Friedkin 1991):
31 2.2.2.4.2. Eigenvector Centrality
Eigenvector centrality measures influence of nodes in a network and is a more complex version of degree centrality. The links between nodes, also known as edges, is what Eigen centrality measures. Edges to higher scoring nodes are given greater influence than edges to lower scoring nodes. The relative eigenvector value for a node can be calculated using equation (24), for a graph 𝐺 ≔ (𝑉, 𝐸) with |𝑉| vertices and the adjacency matrix 𝐴 = (𝑎𝑣,𝑡), where 𝑀(𝑣) is the
neighbors of node 𝑣 and 𝜆 is a constant: (Bonacich 2007) 𝑥𝑣=
1
𝜆∑𝑡∈𝑀(𝑣)𝑥𝑡 (24)
PageRank is related to eigenvector centrality, but has an added scaling factor. PageRank is the original algorithm behind Googles search engine and is calculated as in (25), where i and j are nodes in the network and 𝐿(𝑗) = ∑ 𝑎𝑗 𝑗𝑖 is the number of neighbours to the node 𝑗: (Page et al. 1999)
𝑥𝑖 = ∑ 𝑎𝑗𝑖 𝑥𝑗 𝐿(𝑗)+ 1−𝛼 𝑁 𝑗 (25) 2.2.2.4.3. Closeness Centrality
Closeness centrality measures the distance between nodes. Between all nodes in a network there is a shortest path. Closeness centrality is measured as the average shortest path from one node to all other nodes in that network. The assumption that closeness centrality follows is that information is transferred along only the shortest path (Brandes and Fleischer 2005), which also disqualifies closeness centrality from being used in our research, as we are interested in measuring information spreading in all directions at the same time. Mathematically, we can represent closeness centrality for a node as in equation (26), where 𝑑(𝑗, 𝑖) is the distance between two nodes 𝑗 and 𝑖 in the network (Brandes and Fleischer 2005):
𝐶(𝑖) = 1
∑ 𝑑(𝑗,𝑖)𝑗 (26)
Information centrality is another closeness measure that was defined by (Stephenson and Zelen 1989). Information centrality calculates the harmonic mean of edges instead of the average shortest path. This allows information to flow through each node in a network simultaneously. Information centrality is thus better suited to model flow through multiple paths throughout a network than the standard closeness measure. Information centrality for a node in a
32
network is calculated as in equation (27), where the pseudo adjacency matrix 𝐴 is defined as in (28), 𝑆(𝑖) is the strength of node i, w is the edge weight, and B is the matrix. We will be going more into detail into the method in section 4.3.5. (Stephenson and Zelen 1989):
𝐶(𝑖) = 𝑛 𝑛𝐴𝑖𝑖+∑𝑛𝑗=1𝐴𝑗𝑗−2 ∑𝑛𝑗=1𝐴𝑖𝑗 (27) 𝐴 = 𝐵−1, 𝐵 𝑖𝑗= { 1 + 𝑆(𝑖), 𝑖𝑓 𝑖 = 𝑗 1 − 𝑤𝑖𝑗, else (28) 2.2.2.4.4. Betweenness Centrality
Betweenness centrality is a measure that quantifies the number of times a node is found among the shortest path between other nodes. Betweenness has been used in studying human communication in social networks. Nodes that often are found in the shortest path between other nodes are given a higher betweenness value. Betweenness centrality has the same limitation that closeness centrality has, it does not model multiple paths simultaneously (Brandes and Fleischer 2005). The formula for calculating betweenness centrality for a node is as in equation (29), where 𝜎𝑠𝑡 is the total number of shortest paths from node 𝑠 to node 𝑡 and 𝜎𝑠𝑡(𝑣) is the number of paths that pass through the node 𝑣 (Brandes
and Fleischer 2005): 𝐶𝐵(𝑣) = ∑ 𝜎𝑠𝑡(𝑣) 𝜎𝑠𝑡 𝑠≠𝑣≠𝑡∈𝑉 (29) 2.2.2.4.5. RiskRank
RiskRank is a network measure that can measure risks. The model has some similarities to information centrality in the sense that it also accounts for multiple flows through a network. The calculation of the RiskRank measure 𝑅𝑅 is a combination of the Choquet integral and the Shapley index 𝑣(𝑐𝑖) by Tarashev et al. (2010), and is defined in equation (30). It has a limitation, which is that interlinkages 𝐼(𝑐𝑖, 𝑐𝑗) between nodes are limited to pairs of nodes. 𝑣(𝑐)𝑥𝑐 is the
individual node risk. We will be using RiskRank in our last experiment and we will discuss the method in more detail in section 3.3.2.1 (Mezei and Sarlin 2017; Tarashev et al. 2010)
33 𝑅𝑅(𝑥1, … , 𝑥𝑛, 𝑥𝑐) = 𝑣(𝑐)𝑥𝑐+ ∑ (𝑣(𝑐𝑖) − 1 2 ∑𝑗≠𝑖𝐼(𝑐𝑖, 𝑐𝑗))𝑥𝑖 𝑛 𝑖=1 (30) + ∑ ∑𝑛𝑖 𝑛𝑗≠𝑖𝐼(𝑐𝑖, 𝑐𝑗) ∏(𝑥𝑖, 𝑥𝑗) 2.2.2.5. Summary of Methods
In our automatic classification research, we begin by comparing different pre- processing approaches using the NB machine learning algorithm (section 2.2.2.1.1), mainly because the algorithm is fast for testing and training, and because we are interested in defining a baseline performance that can be used to compare the relative performance between approaches. By using NB, we save some time on training models and have the possibility of testing many approaches before we start optimizing the classifications. We then extend our methods and compare the performance between the individual classification algorithms DT, SVM, ANN, and k-NN (sections 2.2.2.1.2 – 2.2.2.1.5). The last mathematical method we work with in classifications is the majority voting ensemble (section 2.2.2.1.6). While ensemble classifications increase the computational requirements both for training and prediction, they can in some cases increase performance.
The human cognitive abilities are limited, and there is a limit to the amount of information that we can process without getting overloaded and losing focus. Centrality measures can be used as a means of reducing the noise by pointing us to the most important nodes in networks. In our financial news research, we start by using information centrality (section 2.2.2.4.3) to analyze the networks quantitatively and qualitatively, as it allows information to flow through multiple paths. In the last part of the financial research, we change to using RiskRank (section 2.2.2.4.5) as our evaluation approach, which allows us to statistically compare different risk thresholds.
Table 2 shows an overview of the mathematical methods that were used in the different publications.
34
Paper Method used Result
1 Key word extraction
based on TF-IDF weighting, NER
No quantitative data output, evaluation done through survey
2 Machine learning
algorithm: naïve Bayes
The classification models offer improved performance over the baseline
3 Machine learning
algorithms: naïve Bayes
The classification model is compared to previous models and shows improvements
4 Information centrality
measure used to rank companies
Quantitative measures in the form of ranked information centrality news flow, no baseline comparison available 5 Individual and aggregated risk measures through RiskRank Individual and aggregated risks evaluated against benchmarks 6 Machine learning algorithms: SVM, NN, DT, k-NN, ensemble voting The classification models are compared to the previous results and show improvements
35