• No se han encontrado resultados

Participación de padres y madres de familia en la escuela

CAPÍTULO XI PARTICIPACION DE PADRES Y MADRES DE FAMILIA EN LA

7. Participación de padres y madres de familia en la escuela

scaling MDS

Before attempting any clustering or classification algorithms to identify clusters or groups in network data, the data is analysed carefully. From statistics we have multi- dimensional scaling or MDS, which is an information visualization technique for ex- ploring similarities or dissimilarities in data. It provides an aerial view of the natural clusters that are forming in the data. This is done by reducing multi-dimensional data and visualizing it in 2 or 3 dimensions using the MDS algorithm. MDS provides a means to visualize how far away, or close together, the data points are in two dimen- sional space.

Figure 5.2: A 2D visualization of a small dataset of different types of packet data, us- ing MDS. The natural clusters forming are highlighted and partitioning between different packets types is clearly visible.

Fifteen labelled packet samples are chosen from a small dataset comprising of 66 labelled packets of worm and benign packets (discussed in experiment 2 of the previous chapter). In order to determine how similar or dissimilar each packet is from the other, a pair-wise

n×nsimilarity matrix of these 15 packets is created, using the minimum of NCD and Spamsum: min(N CD, Spamsumrev) as the similarity metric. This similarity matrix is a matrix of scores that represents how similar or far away in space each point in the dataset is to the other. Applying MDS on this small matrix and visualizing it, results in a graphic illustrated in Figure 5.2. Clusters formed between similar packets are highlighted in this figure. The empty space between the different types of packets expresses the partitioning of this type of data. This analysis serves as a preface which indicates that clustering techniques from machine learning can be applied on such data to exploit the inherent grouping that exists within it.

5.4.2 Experimental Objectives

Identify the best clustering algorithm. (Discussed in section 5.4.4 )

Identify the optimal cut-off or threshold value. (Discussed in section 5.5.4) Evaluate the correctness of the classifiers using confusion matrix and ROC curves

(Discussed in section 5.5.5 )

Finally, based on the above results, identify the best similarity metric. (Discussed in section 5.5.5 )

5.4.3 Preparing Test Datasets for experiments

For one of our experiments, ten small datasets extracted from a large dataset were created. Each dataset contained 500 malicious or benign streams randomly picked from the large dataset presented in the chapter 3. These samples comprise of the raw stream payloads, which are referred to as stream or network profiles. These labelled network profiles are in the form of a text file with the naming format of “c.a-the md5sum of the stream”, where ‘c’ is the class and ‘a’ is the subclass of the labelled stream. Five hundred such profiles were created using this methodology – one for each stream in the dataset.

In order to measure the relative similarity scores between these packet profiles, metrics such as: NCD, Levenshtein Distance, Hamming distance, Jaro distance, Spamsum and Hybrid were used. Edit distance metrics were used on raw profiles, as discussed in the previous chapter. Information theoretic measures like NCD and Spamsum with reverse string, were also discussed in the previous chapter. For each metric, a combinatorial similarity score was calculated across the dataset, resulting in 500×500 combinations per instance.

Both string metric calculations and Fuzzy hashing the payload are computationally expensive operations and they gravely affects the performance of the system, so pre- processing, automated-analysis and classification are performed offline. For this exper- iment, the system is designed to start its operation on receiving 500 streams in the buffer or wait for nseconds and process the buffer, whichever triggers first.

5.4.4 Experiment 1: Determination of the best Clustering and Clas-

sification algorithm

The similarity scores obtained via the similarity metrics are clustered, in order to group similar items together. Various clustering techniques are available in the literature. We choose the following based on their relevance with our domain. These clustering

techniques are then tested with different similarity metrics and their performance is observed and compared. Some visualizations are presented in this section using a small dataset of 100 packets extracted from malicious classes (such as: MSSQL SLAMMER, CONFICKER, ZEUS, SADMIND, SMB ATTACK) and benign classes (such as:POP3, SMTP, HTTP, SMB), later visualizations are created using 500 packets.

Minimum Quartet Tree Cost (MQTC)

To construct a tree from a pair-wise distance matrix of similarity scores between objects, we utilize the tools provided by the freely available CompLearn toolkit (Cilibrasi, 2003). This tool makes use of a heuristic to implement the quartet method. The heuristic is called the standardized benefit score S(t). The quartet method proposed by this technique is the Minimum Quartet Tree Cost problem, which is an NP-hard graph optimization problem. A visualization produced using the Complearn toolkit on a dataset of 100 packets is shown in Figure 5.3.

Limitations: While effective, it was observed that the minimum quartet tree cost (MQTC)-based clustering(Cilibrasi and Vit´anyi, 2005) becomes highly inefficient (very slow to process a result) as the dataset size increases (Abbasi and Harris, 2010). For a small test set of 200 samples, the MQTC algorithm took 26 hours to process the data and produce a result. In our analysis other clustering methods were investigated, and compared with MQTC.

Unweighted-Pair Group Method with Arithmetic Mean or (UPGMA)

UPGMA is an agglomerative or hierarchical clustering method used heavily in bio- informatics and very well known for the creation of phenetic or phylogenetic trees. The motivation behind using UPGMA was to test its ability to create phenetic trees of labelled variations of malicious streams. Based on past observations (Abbasi and Harris, 2010), we set the threshold value to 0.55 and ran the system to cluster a relatively small dataset of 100 packets. The results were quite encouraging as can be seen in Figure 5.4. The rectangular boxes show the clusters, the dark oval nodes represent similarity and dissimilarity percentages, while the light coloured flat oval nodes represent the packets. In a very short time, UPGMA was able to correctly cluster the dataset according to its labels.

Figure 5.3: MQTC on a small dataset of mixed 100 packets of benign and worm packets. The algorithm took a long time to calculate a stable tree even for such a small dataset. The clusters are highlighted here in this figure.

Limitations: UPGMA is much faster than MQTC but slows down considerably as the dataset size increases. UPGMA took over 10 hours to process a dataset of 1000 samples.

k-Nearest Neighbour algorithm (k-NN)

k-NN is one of the simplest machine learning algorithms, which attempts to classify objects by a majority vote of its k nearest neighbours. However, determining the value of k before hand is a difficult and computationally expensive task. We tested our system with k = 1,3 and 5.

Limitations: k-NN is much faster than both MQTC and UPGMA. It was able to process 1000 samples in under 15 minutes. However, determining an optimal value for k is difficult and is prone to high false positive results.

Figure 5.4: UPGMA on a small dataset @ T = 0.55. The rectangular boxes show the clusters, the dark oval nodes represent similarity and dissimilarity percentages, while the light coloured flat oval nodes represent the packets. In a very short span of time (less than a minute), UPGMA was able to correctly cluster the dataset according to its la- bels. UPGMA results were visualized using a modified version of Ero Carreras script

Threshold-based nearest neighbor or (T-NN)

T-NN is a variant of the nearest neighbour algorithm proposed for our domain. It is based on a simple linear search. For a dataset having N data points we want to find all such points in the metric space that are below the threshold value defined as a parameter at the beginning of the algorithm. These points will be considered as the nearest or closest to the reference point. For any query point, the algorithm computes distances between every other point in the dataset and adds the ones that are below the defined threshold value to the same cluster. The threshold defines the radius of the cluster. The optimal threshold value will create clusters comprising of only those points that are similar to the reference point, resulting in a perfect classification of data points from only the same class. Determination of the optimal threshold value is a challenge which directly affects the outcome of the classifier.

Limitations: Determining an optimal threshold value is a challenge.

While determining the threshold for T-NN and the value of k for k-NN might be difficult, they have shown some promising results.

Documento similar