• No se han encontrado resultados

5 PROPUESTA DIDÁCTICA

5.4.4 Formación de la comunidad

Our TCP monitoring agent is installed at each machine’s hypervisor or within in- dividual VMs. The installation is limited to only client machines communicating with various remote services within/across data centers. For example, if the re- mote service is storage, this precludes the need to run the agent on storage servers. The agent collects TCP statistics for all connections seen on its monitored node. Given that our implementation is based on Windows, we will describe the agent based on Windows terminology. These statistics can also be collected in a Linux- based system.

The agent is implemented using Windows ETW events [91], a publish-subscribe messaging system in the Windows OS. A TCP ETW event is triggered every time a TCP related event, e.g. the arrival of a duplicate ACK occurs on any one of the connections currently active in the OS. The agent collects and aggregates events at the granularity of epochs so as to minimize bandwidth/storage overhead during training. Within every epoch, it receives ETW events, extracts relevant features and stores them in a hash table based on TCP’s 5−tuple. At the end of an epoch, the TCP metrics that depend on the transmission rate are normalized by the num- ber of bytes posted by the application in that epoch. The normalized metrics are marked in Table 2.2. Each individual metric is then further aggregated by calculat-

metric statistics calcu- lated

abbreviation

Number of flows R NumFlows Maximum congestion window

inδ

S MCWND The change in congestion win-

dow inδ

S(∗) DCWND The last congestion window ob-

served inδ

S(∗) LCWND The last advertised (re-

mote)receive window observed inδ

S(∗) LRWND The change in (remote) receive

window observed inδ

S(*) DRWND Maximum smooth RTT esti-

mate observed inδ

S(*) MRTT Sum of the smooth RTT esti-

mates observed inδ

S(*) SumRTT Number of smooth RTT esti-

mates observed inδ

S(*) NumRTT Duration in which connection

has been open

S Duration Fraction of open connections R FracOpen Fraction of connection closed R FracClosed Fraction of connections newly

opened

R FracNew Number of duplicate ACKs S(*) DupAcks Number of triple duplicate

ACKs

S(*) TDupAcks Number of timeouts S(*) Timeouts Number of resets S(*) RSTs Time spent in zero window

probing

S Probing Error codes observed by the

socket

R Error Code Number of bytes posted by the

application

S BPosted Number of bytes sent by TCP S BSent Number of bytes received by

TCP

S(*) BReceived Number of bytes delivered to

the application

S(*) BDelivered Ratio of the number of bytes

posted by the application to the number of bytes sent

S BPostedToBSent Ratio of the number of bytes re-

ceived by TCP to the number of bytes delivered

S BReceivedToBDelivered

Table 2.2: Features captured by the monitoring agent during each epoch. We useR

to show that the raw value of a feature is captured andSto show that we capture the statistics of that feature. (*) indicates normalized metrics.

ing its mean, standard deviation, min, max,10th,50th, and95thpercentile across all

TCP connections going to the same destination IP/Port.

We assume that identical failures happen within a single epoch, e.g. if a con- nection experiences failure A, then all other connections between the same end- points in the same epoch either experience no failure, or also experience failure

A. Therefore, the epoch duration needs to be carefully tuned. Small epochs in- crease monitoring overhead, but large epochs run the risk that sporadic failures of different types will occur within one epoch, affecting the accuracy of the learning algorithm. We currently use an epoch of 30s. Fine tuning the epoch duration is part of our future work.

Table 2.2 shows the features maintained within an epoch by the monitoring agent. Our aggregation method reduces the amount of bandwidth required on the machines in the training stage2and has the added benefit of hiding the clients exact transmission patterns. Furthermore, when applications change their transmission pattern across connections in reaction to failures it allows for this change to be detected. In the other extreme, one may decide to use per connection statistics with more overhead but with the benefit of detecting why each individual connection has failed separately.

The agent imposes low runtime overheads. Based on our benchmarks, even in the absence of aggregation, when processing 500,000 events per second, each agent uses4%CPU load on an8core machine and less than20MB in memory.

2.4.2

Learning Agent

During the training phase, the learning agent takes as input TCP metrics gathered by monitoring agents on training nodes. At run time, it distributes the learned

2Without aggregation, the client needs to transmit31n features every epoch to the learning

agent wherenis the number of connections during that epoch. With aggregation, this number is reduced to130.

model to all clients to be used for diagnosis. The model has to quickly classify epochs with the appropriate labels to indicate whether it is a remote (Server), local (Client), or Network issue.

The learning agent uses decision trees as its classification model. In a decision tree, each internal node conducts a test on an attribute, each branch represents the outcome of the test, and the leaf nodes represent the class labels. The paths from root to leaf represents the classification rules.

In our setting, the internal nodes correspond to one of the aggregated TCP met- rics being monitored. The learning phase determines the structure of the decision tree, in terms of the choice of attributes and the order in which they are used for testing along the path from the root to label (this ordering is determined by the information gain of features in the dataset). The specific nature of the test at each node, i.e. the inequality tests, is also determined in this phase.

As noted by prior work [36, 6, 49], the structure of decision trees allows for further understanding of the attributes that identify each failure. For this reason, we found decision trees more attractive to use than other machine learning ap- proaches. We will elaborate further on this in Section 2.5.

Fig. 2.2 shows an example of a decision tree, that distinguishes packet reorder- ing from normal data. Leaf colors in the figure represent the labels of the training data that ended up in those leaves. Most leaves are ”pure”, i.e. all the data in those nodes have the same label. leaf2shows an ”impure” leaf that has a mix of both labels. In such situations, the tree picks the majority label in the leaf as its diagnosis.

Based on the concept of decision trees, our learning agent requires three en- hancements for improved stability and accuracy:

Random forests. Our learning agent uses an enhanced type of decision tree, known

as random forests [26]3. In random forests, multiple decision trees are generated

Figure 2.2: Example tree. The white/Black leave colors illustrate the labels of the training data that end up in that leaf.

from different subsets of the data, and the classification decision is majority-based, where a majority is defined based on a cutoff fraction specified by the user. For example, a cut-off of (0.2,0.8) indicates that for class 1to be chosen as the label, at least, 20% of the trees in the forest need to output 1 as the label as well. Ran- dom forests improve stability and accuracy in the face of differences in machine characteristics and outliers.

Multi-round classification. To improve accuracy, we do rounds of classifications. First, the forest is trained to classify Network failures only. The Server and Client failures in the training set are labeled as non-faulty (Normal) in this phase. Next, the Network failure data is removed from the training set, and a new forest is trained to find Server failures with Client failures labeled as Normal. Finally, the Server data is also removed and a forest is trained to identify client-side failures. At run time, data is first passed through the first forest, if classified as Network, the process terminates. If it is classified as Normal, it is passed through the second

forest. Again, if it is classified as Server failure the process terminates. If not, the data is passed through the third forest and is assigned a label of Normal/Client. In machine learning, such multi-round classifications are referred to as tournaments. In traditional tournaments, different decision trees are used in pair-wise competi- tions. Our tournament strategy is a modification of standard tournaments, as they did not work well in our setting.

Per-application training. Applications react to failures differently. One application may choose to open more TCP connections when its attempts on existing connec- tions fail while others may keep retrying on the ones currently open. Some form of normalization, such as that we use for the monitoring agent, helps avoid depen- dence on the transmission rate of the client itself. However, it does not help avoid this particular problem given that the effects of application behavior go beyond the transmission rate but also influence the number of connections, their duration, etc. Indeed, these behaviors themselves improve NetPoirot ’s accuracy as they provide more information about the failure. Hence, it is advised to train NetPoirot for each application separately. We argue that unless applications change drastically on a daily basis, there is sufficient time in between major application code releases and deploys for the model to be updated.

Two-phase tree construction with cross-validation. Each forest is constructed in two phases. First, given the training set, we determine basic parameters of the forest, e.g., its cutoff value and a minimum number of data points required in a leaf node. The latter is required to bound the tree sizes and to avoid overfitting. Once these parameters are determined, the training set is used to create the actual forest.

One of the pitfalls of any machine learning algorithm is the danger of overfit- ting, where the trained model is tailored to explicitly fit the subsequent testing set. This leads to poor future predictive performance. To avoid overfitting, we apply a standard machine learning technique, namely a modified variant ofcross validation

(CV). In a nutshell, the first phase is accomplished usingN-fold[28] CV which es- timates error using subsets of the training data, while the second phase builds the model using all the training data.

In the classic form of N-fold CV, the training data is randomly divided intoN

subsets (folds). Iterating over all folds i, in each iteration i is omitted from the training set and the model is trained over the remainingN −1folds. The trained model is then tested using theith fold. The estimated errors in each iteration are

averaged to provide an estimate of the model’s accuracy. N-Fold cross-validation, however, if used on our data set runs the danger of overestimating accuracy as models will learn specific machine/network characteristics when data from one machineleaksbetween folds. Therefore, we divide data from each machine into its own unique fold. We define CV error as the average error of CV when each fold contains data from a single machine.

To show why this is important, we tested cross-validation on our data set using both methods4. Using vanilla cross-validation, we observe an error of1.5%. How- ever, when we partition the data based on machine label, we get10.55%error. This further indicates that if data from the same machine is used for both training and at runtime one may get much higher predictive accuracy than those reported in this paper.

Normalization. We normalize TCP statistics that depend on the data being sent. Namely, features marked with(∗)in Table 2.2 were divided by the number of bytes posted by the application in that epoch in order to minimize dependency on the application’s transmission pattern.

Documento similar