3. Estructura de la investigación
1.2. Referentes teóricos
1.2.3. El rol de la tecnología en el emprendimiento social según el
To test the potential effectiveness of the proposed system in distinguishing observations generated from normal traffic and those generated from dif- ferent types of attacks, we have simulated the execution of the distributed k-means-based clustering algorithm described above on an artificially gen- erated yet realistic set of SNMP observations.
We first generated the dataset by setting up a network with a “victim” server monitored by a separated host and connected to a network includ- ing machines simulating both regular clients and attackers. We gathered SNMP data at regular intervals from the monitored server in five different sessions. During the first session, we emulated regular network behavior by generating HTTP requests from the clients. Then, in the following four sessions, we added malicious traffic from the attackers by placing the same server under different kinds of network attacks. The attacks used in the respective sessions were:
1. Denial of Service,
2. Distributed Denial of Service,
3. Denial of Service on SSH, 4. Brute force on SSH.
We produced a total of 5,655 observations divided into five classes, ac- cording to the session each one was generated in, so one class corresponds to the absence of attacks, while each of the remaining four corresponds to one of the listed network attacks. After building this dataset, we reduced its initial set of hundreds of features to only 14 representative ones, using the selection algorithm mentioned above.
A.5. Simulation setup 165
This dataset has been tested on multiple simulations of the clustering algorithm. Each simulation works by setting up a virtual network of nodes, assigning a training and a test set to each of them, running the algorithm with each node using its training set and finally measuring accuracy indica- tors on all nodes using their respective test sets. To assign data to nodes, we used specific data distribution algorithms to extract from the whole dataset a training subset and a test subset for each node. Simulations are different from each other for three principal aspects: the topology of the network on which the algorithm is run, the distribution of data across the nodes and the parameters specific to the clustering algorithm. Considering the dis- tributed k-means variant presented above, we mainly tested the variation of the number k of clusters.
Five different topologies with 64 nodes have been tested: a scale-free network, a ring, a ring with 16 random additional links, a torus and a fully connected mesh. Regarding distribution of data across nodes, two different general strategies have been tested to provide a training and a test set to each node: (A) distributing different observations of the same n classes to the two sets or (B) picking independently for each n classes and provide all observations of them. In all cases, data is distributed independently to each node and the distributed classes always include that of regular traffic, as it is supposed to be observed much more than the others in a real case. For each tested combination of parameters, 20 random distributions of data are considered from the picked distribution strategy; for each of these, the distributed k-means algorithm is run 50 times with different random initial positions of the centroids. From these 50 runs, the one with the best results is considered, assuming the application of existing methods for optimal centroid initialization; the results of the 20 “best-case” simulations with differently distributed data are then averaged, as the distribution of data cannot be controlled in a real case.
For each single simulation, all accuracy measures are averaged across all nodes. The main accuracy measure, used to determine which are the best runs, is the ratio of test observations correctly classified considering the best possible mapping from clusters to classes (assumed to be super- vised), referred to as attack identification accuracy. Other than this, attack detection accuracy is also measured, which only considers the ability of the nodes to distinguish regular traffic from attacks, regardless of their class, so that detecting an attack of a type in response to one of a different type is not considered an error. At this extent, to have indications about the type
166 Appendix A. Network Security through Distributed Clustering 2 3 4 5 70 80 90 100
Training classes for each node (a) Traffic class identification
2 3 4 5
97 98 99 100
Training classes for each node (b) Attack detection 2 3 4 5 0 2 4 6 8 10
Training classes for each node (c) False positive rate
2 3 4 5 0 0.5 1 1.5 2
Training classes for each node (d) False negative rate
Fully connected Torus Ring w/ random links Ring Scale free
Figure A.2 – Percentage accuracy measures for distributed k-means (on Y axis, in percentage) run on different 64-node topologies, using data dis- tribution strategy A (train and test on different data of same classes) with a variable number of training classes (on X axis).
of committed errors, false positive rate (ratio of regular traffic observations misclassified as threatening) and false negative rate (ratio of undetected attack observations) are also measured.