LISTA DE ABREVIACIONES
CAPÍTULO 2. REVISIÓN DE LA LITERATURA
2.1 Propuesta de los Modelos DIHEQS y SATSIB
2.1.7 Propuesta de Modelos Teóricos
The experiments done in the thesis use several data sets for each of the following six problems, created according to the generator: circle, sine moving vertically, sine moving horizontally, line (moving hyperplane with d = 1), plane (moving hyperplane with d = 2) and Boolean. Eight irrelevant attributes and 10% class noise were introduced in the plane data sets.
Each data set contains one drift and different drifts were simulated by varying among three amounts of severity (as shown in table 4.2) and three speeds, thus generating nine different drifts for each problem. The feature severity affects the same areas of the input space as the class severity for all the problems but plane and Boolean. So, we will refer to class severity simply as severity in the experiments. For plane and Boolean, there is no feature change. The speed was modelled by the following linear degree of dominance functions:
vn(t) =
t − N
drif ting time, N < t ≤ N + drif ting time and
vo(t) = 1 − vn(t), N < t ≤ N + drif ting time ,
where vn(t) and vo(t) are the degrees of dominance of the new and old concepts, respectively;
t is the current time step; N is the number of time steps before the drift started to occur; and drif ting time varied among 1, 0.25N and 0.50N time steps.
The data sets are composed of 2N examples. The first N examples belong to the old concept (vo(t) = 1, 1 ≤ t ≤ N ), where N = 1000 for circle, sineV, sineH and line and N = 500 for plane
and Boolean. The next drif ting time examples (N < t ≤ N + drif ting time) were generated according to the degree of dominance functions, vn(t) and vo(t). The remaining examples belong
4.3 Summary and Discussion
The range of x or xi was [0, 1] for circle, line and plane; [0, 10] for sineV; and [0, 4π] for
sineH. The range of y was [0, 1] for circle and line, [−10, 10] for sineV, [0, 10] for sineH and [0, 5] for plane. For plane and Boolean, the input attributes are normally distributed through the whole input space. For the other problems, the number of instances belonging to class one and zero is always the same, having the effect of changing the unconditional pdf when the drift occurs.
4.3
Summary and Discussion
The contributions of this chapter are a new concept drift categorisation and new artificial data sets representing several different types of drift.
The literature lacks a clear and systematic categorisation of different drifts. Currently, very few and highly heterogeneous categories are used, separating drifts according to only two criteria (speed and recurrence). The criterion speed used in the literature represents at the same time more than one feature of drifts, so that very different drifts can be considered of the same type. Besides, the literature does not consider the existence of certain features of drifts, such as frequency and predictability.
The main contribution of this chapter is a new concept drift categorisation. The categori- sation divides drifts into different types, according to several different criteria: severity, speed, predictability, frequency and recurrence. Each criterion allows consistent characterization of drifts according to a specific feature, instead of representing different features at the same time. The use of the term “intermediate” to refer to concepts is not allowed. In this way, mutually exclusive and non-vague categories are created.
The categorisation and in particular its quantifications are mainly targeted to artificial data sets, as we usually cannot know exactly what types of drift are present in real world data sets. As explained in section 4.2, it is important to use not only real world, but also artificial data sets in studies of drift, so that we can know the types of drift to which the approaches behave better/worse. The use of artificial data sets not only provides a better understanding of the approaches’ behaviour and when they are likely to perform well, but also aids the proposal of solutions to improve their performance.
Chapters 5 and 6 further illustrate the usefulness of the categorisation experimentally. For instance, in section 5.3, we can observe that the impact of severity on the test error is very different from the impact of speed. So, different strategies can be developed in order to handle drifts with different amounts of severity and speed. An approach which is accurate to different amounts of severity may have a very different behaviour considering different amounts of speed. In the literature, the criteria severity and speed are mixed, not allowing proper evaluation of approaches for this case.
4.3 Summary and Discussion
Another example of how chapter 6 illustrates the usefulness of the categorisation is when the KDD network intrusion detection data set (Hettich and Bay; 1999) is used. In this case, the high frequency of intercalated recurrent drifts allows approaches which are not prepared for recurrent drifts to achieve a good behaviour. A lower frequency would require additional strategies to deal with the recurrence. So, it is important to use not only the criterion recurrence, but also frequency, which is considered only in the categorisation proposed in this chapter.
The literature does not consider the criterion predictability either. This criterion has shown to be useful in the dynamic optimisation problems area (Branke; 2002; Branke et al.; 2005) and should also be considered in the concept drifts area. An example of use for it is in the prediction of changes in the number of times that a particular topic appears in a stream of documents (Araujo and Merelo; 2007). If a change can be predicted, appropriate measures can be designed for a certain approach to recover quicker and/or be less affected by the change. As predictable and non-predictable drifts are not distinguished in the literature, it would not be possible to correctly evaluate approaches for this case. The proposed categorisation, on the other hand, would allow a proper evaluation.
The benchmarks used in the literature, such as SEA (Street and Kim; 2001) and STAG- GER (Schlimmer and Granger Jr.; 1986) concepts, do not contain enough variety of drifts to allow principled and detailed studies. Based on existing benchmarks, a data sets generator is presented in section 4.2.1. It can be used to generate all the types of drift from the proposed categorisation. New data sets simulating drifts with low, medium and high severity and speed were created and are presented in section 4.2.2. They allow more detailed and principled anal- ysis of approaches/strategies in the presence of concept drift and are used in chapters 5 and 6. Sequences of drift are studied directly through the use of real world or semi-real world data sets in the thesis. The thesis does not focus on studies of recurrent and predictable drifts, which are proposed as future work.
Chapter 5
A Diversity Study in the Presence of
Drift
Ensembles of learning machines have been widely studied and successfully used in offline mode. As examples, we can cite approaches such as negative correlation learning (Liu and Yao; 1999a,b; Liu et al.; 1999; Chen and Yao; 2009), bagging (Breiman; 1996a) and boosting (Schapire; 1990; Drucker et al.; 1992; Freund; 1995; Freund and Schapire; 1996b,a). The success of ensembles in offline mode inspired their use for online learning (Blum; 1997; Lim and Harrison; 2003; Kotsiantis and Pintelas; 2004; Oza and Russell; 2001a,b, 2005; Fern and Givan; 2000, 2003; Lee and Clyde; 2004) and concept drift (Stanley; 2003; Kolter and Maloof; 2003, 2007; Wang et al.; 2003; Scholz and Klinkenberg; 2005, 2007b; Ramamurthy and Bhatnagar; 2007; He and Chen; 2008).
Although ensembles have been used to handle concept drift, the literature does not contain any deep study of why they can be helpful for that and which of their features can contribute or not to deal with concept drift. A better understanding of the behaviour of ensembles in online changing environments can reveal if their potential is being correctly used and would allow better exploitation of their features.
One of the features which may help dealing with drifts is diversity. As explained in section 1.2.2, in offline mode, diversity among base learners is an issue that has been receiving lots of attention in the ensemble learning literature. Many authors believe that the success of ensemble algorithms depends on both the accuracy and the diversity among the base learners (Dietterich; 1997; Kuncheva and Whitaker; 2003). However, no study of diversity has ever been done in online changing environments.
Section 5.3 presents a diversity study of ensembles in the presence of different types of concept drift, providing a deeper understanding of when, why and how online ensemble learning can help to deal with drifts. This is an answer to the research question presented in section
5.1 Analysis of Variance (ANOVA)
1.2.2. Section 5.4 extends the diversity study and shows how to use information from the old concept in order to aid the learning of the new concept, providing an answer to the research question presented in section 1.2.3. The underlying online ensemble learning algorithm used in the study is explained in section 5.2 and the statistical method used in the analyses is explained in section 5.1.
5.1
Analysis of Variance (ANOVA)
The analyses presented in section 5.2 and 5.3 use Analysis of Variance (ANOVA) (Montgomery; 2004). ANOVA is a set of statistical methods that can be used to test hypotheses about the effect of different factors on a response variable.
For understanding the general idea of ANOVA, let’s consider the simple case in which we have only one factor. For example, consider that we are interested in analysing the effect of the parameter A on a certain algorithm’s accuracy using a certain data set. The parameter A is a factor and the accuracy is the response. Consider also that we choose a different values (factor levels) for A and run the algorithm r times for each factor level. So, the total number of response observations used by ANOVA is N = a ∗ r.
ANOVA methods are based on partitioning the total variability of the response into several components, which are attributed to different sources of variation. The total variability (SST)
is measured by the sum of squares over all the observations. In the example considered here, it is decomposed into the variability due to the factor choices (SST reatments) and the variability
due to the error (SSE):
SST = SST reatments+ SSE .
There are different ways to estimate the variabilities. The most frequently used is the Type III (Marginal) Sum of Squares, which represents the additional variability explained by adding the factor of interest.
The statistical analysis can use the null hypothesis that there is no difference in treatment means (no difference in response when using different factor levels) and the alternative hypothesis that there is difference. For testing whether this null hypothesis is true, the F statistic of the test is calculated as:
F = SST reatments/(a − 1) SSE/(N − a)
.
5.1 Analysis of Variance (ANOVA)
to error, F will be larger and the null hypothesis that there is no difference in treatment means will be rejected. On the other hand, if the variability due to error is large in comparison to the variability due to the factor choice, the null hypothesis will be accepted.
When more than one factor is used, SST can be decomposed into the variability due to
each factor, due to interaction of factors and due to error. In this way, the effect of each factor and interaction on the response can be analysed.
Factors can be categorized as within-subject or between-subject (Lane et al.; 2008). Within- subject factors involve comparisons of the same subjects under different conditions (factor levels). Between-subject factors are factors in which a different group of subjects is used for each factor level. As the term subject may be difficult to understand in the computer science domain, the examples given by Lane et al. (2008) will be used to illustrate within and between-subject factors.
Consider a study of the treatment of a certain disease using drugs. Each contaminated person (subject)’s performance was measured four times, once after being on each of four drug doses for a week. Therefore, each subject’s performance was measured at each of the four levels of the factor “dose”, which is a within-subject factor.
Now, consider an experiment conducted for comparing four methods of teaching vocabulary. If a different group of students (subjects) is used for each of the four teaching methods, then teaching method is a between-subjects variable.
When more than one factor is used, it could happen that we have a split-plot (mixed) design, which involves both between-subject and within-subject factors. In this case, ANOVA has to be done in two parts: one for analysing the within-subject effects and the other one for analysing the between-subject effects. As explained in sections 5.2 and 5.3, this is the type of ANOVA used in the thesis.
The main assumption done by split-plot ANOVA is the sphericity (Demˇsar; 2006). Consider the covariance matrix of the levels of a within-subjects factor. A sufficient (but not necessary) condition for sphericity is that all the covariances are equal and all the variances are equal in the populations being sampled. This sufficient condition is frequently used to give an intuition of what sphericity is. However, the sphericity assumption is a bit less strict (Baguley; 2004). Consider the differences between the responses for each pair of factor level. For example, for a factor A with three different levels, let A1(r), A2(r) and A3(r) be the responses obtained for
each treatment (factor level) A1, A2 and A3 on the subject r. For each subject r, calculate the
differences A1(r) − A2(r), A1(r) − A3(r), A2(r) − A3(r). Then, calculate the variance for each
pair of factor level. The sphericity assumption considers that all the variances of the differences are equal.
If the sphericity assumption is violated, the split-plot ANOVA can get high type I error (re- ject the null hypothesis when it was true) (Demˇsar; 2006). Mauchly’s tests (Mauchly; 1940) can