• No se han encontrado resultados

2. Diagnóstico De La Problemática

3.2. Representación Gráfica De Las Principales Actividades De Comercio Que

Of the most well-known algorithms to handle concept drift is FLORA2 by Widmer and Kubat [21]. The classification rules in this algorithm are built with respect to a fixed-size window of the latest examples. These rules are updated on the arrival of new examples and on drift detection. The drift is detected by monitoring the prediction accuracy of the classifier, and the rate at which new classification rules are formulated. Whenever the drift is detected, the size of the current window is decreased by delet- ing the oldest 20% examples. To address concept recurrency, an extended version of FLORA2 is presented that considers the possibility of reusing existing classification rules. The basic idea is to store stable classification rules for future use. Hence, when- ever a change is suspected and the window is decreased in size, all stored classification rules are retrieved and compared. Previous classification rules are reused again if they appear to match labelled examples from the current window more accurately than that of the current classification rules.

Chapter 2 Literature Review 23

Other drift detection approaches proposed in the literature mainly depend on monitor- ing the performance of the classification model and on detecting changes in the data distribution. Examples of these approaches are presented below.

The drift detection algorithm (DDM) proposed by Gama et al. [6] is one of the most widely used methods for change detection. The main idea behind this approach is that if the distribution of the incoming examples is stationary, the probability of making an error will decrease or at least stabilize as more examples become available (assuming the classifier is constantly updated with these new examples). A significant increase in this probability indicates that the distribution generating the examples has changed, signalling a drift. In particular, it is assumed that the error of each incoming example represents a random variable from Bernoulli trials, and that the number of misclassified

examples, denoted as E, follows a Binomial distribution. Given this, the probability

of making an error, denoted as pi, and the associated standard deviation, denoted as

si, are computed incrementally for each incoming example according to the following

formulas: pi = Ei, si =

q pi(1−pi)

i . The learner’s performance at examplei corresponds

topi+si, which is compared against two registers: a warning threshold computed as

pmin+ 2∗smin, and a drift threshold computed aspmin+ 3∗smin. Here, pmin andsmin

are the minimum error probability and minimum standard divination, respectively,

among the examples observed so far, obtained in the sequence of calculating pi and si

for the incoming examples. Based on this, if pi+si < pmin+ 2∗smin the classifier is

in thenormal state, while ifpi+sipmin+ 3∗smin the classifier is in the drift state.

Ifpi+si is in between the above levels, a warning state is reached that requires more

examples to either confirm the drift (the error rate increases to the drift level) or return to the normal state (the error rate decreases to the normal level). The ability of DDM approach to signal warning and drift states during the model operation is an important property for data selection when adapting the classification model to changes. This is because once the drift is confirmed, all the examples between warning and drift states

Chapter 2 Literature Review 24

are used for adapting the model, facilitating data selection task. DDM is utilised in our analysis for the proposed hybrid learner as will be explained in Chapter 6.

A similar but gradual drift oriented detection approach is the early drift detection method (EDDM) presented by Garcia et al. [7]. Instead of monitoring the error rate of the classifier to detect drift, EDDM depends on monitoring the distance between two classification errors (i.e. the number of correctly classified examples between two consecutive classification errors). This is based on the idea that if the distribution is stationary, the predictions of the learning algorithm will improve during learning, and the distance between subsequent errors will increase. The average distance between

two consecutive classification errors (´pi) and its standard deviation (´si) are computed

for each misclassified example and compared to the maximum distance and standard

deviations (´pmax and ´smax) registered so far during the learning process. The warning

and drift states are defined (when at least 30 errors have occurred) in terms of the

following equations: p´i+2´si

´

pmax+2´smax < αfor the warning state, and

´

pi+2´si

´

pmax+2´smax < βfor the

drift state. Bothα and β are predefined thresholds.

Another accuracy-based approach is presented in [3], which depends on monitoring the number of misclassified examples for the currently active classifier. For this purpose, a binary variable is introduced with values 0 and 1, where 0 represents a correctly classified example, while 1 represents an incorrect classification. The values of this variable is monitored over time with respect to a sliding window by applying Shannon’s entropy [24]. The entropy measure is computed incrementally and compared to its maximum theoretical possible value (in this case 1). Whenever it reaches this value, the drift is signalled.

Other approaches for change detection depend on comparing the accuracy of two clas- sifiers with respect to two different windows: a sliding window with the most recent

Chapter 2 Literature Review 25

examples and a reference window with older examples. For instance, Bach and Mal- oof [68] propose a learning algorithm that combines two classifiers: one classifier is continuously updated on data arrival since the last drift detection point (referred to as a stable learner), and the second classifier (referred to as a reactive learner) is trained using a sliding window of a fixed size. The former classifier is used to perform the classification task, while the latter to signal a change in the target concept. On arrival of each new example, both classifiers are used to predict its class and then compared in terms of the frequency at when the stable learner misclassifies the examples correctly classified by the reactive learner. If this frequency exceeds a predefined threshold, the drift is detected and the stable learner is replaced by the reactive learner for performing classification. Other examples utilising the accuracy of two classifiers to detect changes can be found in [82, 100].

Moreover, rather than monitoring the classifier accuracy, some detection methods rely on identifying changes in probability distributions. To detect such changes, various ap- proaches are utilised. For example, in [100], the change point is detected by examining

each time step for a possible change using Hotelling multivariate T2-test. This test

compares the mean of each class sample before and after the potential change point (the examined time step), and the returned p-value of the test (which corresponds to the probability that no change have occurred) is used to compute the probability of a change at that point, assuming that the mean of each target class changes indepen- dently. Another commonly used test for concept change detection is the Page-Hinkley

test (PHT) [81]. According to this test, a cumulative variable, denoted asmT, is com-

puted at each time step by taking the difference between the observed values and their

mean. In particular, at time step T,mT is computed as follows:

mT =

T X t=1

Chapter 2 Literature Review 26

where ¯xT = T1 PTt=1xt, andσis the magnitude of allowed changes. The value of the test

is computed as P HT = mTMT, where MT =min(mt, t = 1...T). If this difference

exceeds a predefined threshold, a change in the distribution is detected.

In the adaptive windowing algorithm (ADWIN) proposed in [69], changes in the dis- tribution are detected based on comparing the difference in means of the classification errors between all possible sub-windows of a fixed size sliding window with recent exam- ples. Whenever two large enough sub-windows have distinct enough means (determined according to Hoeffding bound), the change is detected, and the older sub-window is deleted from the window. A similar idea to ADWIN, but more expensive in terms of time and memory, is introduced in [70], when data is assumed to arrive in batches of examples. Their window resizing approach depends on training the classifier on all possible window sizes, starting from the most recent batch of data and increasing the window size by one batch, so that the largest possible window contains all available historical data. The optimal window size is chosen by estimating the generalization error (classifier specific leave-one-out estimator of the error rate) of each window on the last batch of data, so that the window size that achieves the minimum error is used to classify the next batch of incoming examples.

In some detection approaches, changes in the distributions are measured with respect to two different windows: a window with recent data examples and a window with older examples. To measure the difference, a number of methods are utilised such as Kullback-Leibler divergence [17] and entropy [83]. If the values of these measures exceeds a predefined threshold, the drift is detected.

The main advantage of the above proposed approaches is eliminating the need for un- necessary reassessment of data relevance (e.g continuously re-weighting data examples even if a drift did not actually occur). Moreover, identifying the change point, even approximately, provides a better indication of the most relevant data for adjusting the

Chapter 2 Literature Review 27

classifier. For example, if an abrupt change occurs, the only data that should be con- sidered is that which come after the change point, since all previous data (before the change) will become not relevant. However, the proposed approaches mainly depend on measuring the performance accuracy of the current classifier to detect changes, and hence suffer from performance degradation. Moreover, as with continuous adaptation approaches, newer data is favoured in adapting to changes, while older data is eventu- ally forgotten without accounting for recurrency in concepts. The contextual conditions under which the data observations were collected are also not considered, which could vary the importance of the example to the current concept. The current efforts towards incorporating such contextual information during the adaptation process are presented next.

Documento similar