Capítulo 2. Sensores fabricados con tecnología PCB
2.4. Pruebas y discusión de los resultados
2.4.1. Pruebas estáticas
Our presupposition is usually that the classification mapping is static and does not change over time (cf. Section ). But in practice this assumption does often not hold. In particu- lar in text classification, one often can observe that the notion or semantic of a particular class changes over the time. Sometimes it may also happen that several categories are merged into one, or conversely, one category is divided into several subcategories. This phenomenon is usually called concept drift and does typically appear in applications of stream data classification.
Thercv1 is very appropriate in order to analyze the robustness of a learning algorithm since it contains more than 800,000 documents which are chronologically ordered. But since no concrete concept drift is known for this dataset, we artificially introduce it. Following ( ), we shift the mapping of the instancesT =〈(xi,yi)1≤
i≤m〉as follows y0 i= ( yi , with probability p= mm−i (yi,m,yi,1, . . . ,yi,m−1) , otherwise (4.8)
resulting in the new drifted example setT0=〈(xi,yi0)1≤i≤m〉.
( ) very recently proposed to use ensembles of random decision trees for learning stream data with concept drifts. The main idea is to generatek1 RDTs with random attribute tests at the inner nodes and maximal depthk2. Comparably small values ofk1andk2, around 10 or 20 and maximally 100, are sufficient in practice. During the extremely fast training, the leafs incrementally collect statistics about the label dis- tributionsy and the labelset sizes|y|of the instances which passed all tests to the leafs. Hence, each RDT predicts an average distribution and cardinality, which is subsequently averaged over all trees.
RDTs are very suitable for data with a high number of examples and labels, since the costs are bounded by the selection of k1 and k2. However they are not appropriate for data with a high number of possibly sparse features, as we will also see in the following experiments. This is because a RDT tree can maximally coverk1·k2 feature dimensions with their tests, and increasing k1 and particularly k2 quickly leads to a extreme de- celeration. But for small datasets like yeast and scene, they obtain very good results, often outperforming SVMs, but using only 10% to 1% of training and testing time. An additional decision tree learner for stream data was proposed by ( ), but unfortunately the used Hoeffding trees are not suitable for concept drifts. Recently, ( ) proposed to use a windowing mechanism over the pos- itive and negative examples of the base learners of a BR ensemble, which is particularly suitable for nearest neighbor learners.
For the particular case of concept drifts in stream data, implemented a technique that subsequently decreases the weight of previous examples. More specifically, the weight of an example decreases by half after a predetermined numberhof subsequent training examples. The integration of the half-life parameter is straightforward for per- ceptrons by changing the update of the weight vector (Eq. ) to wi+1 =2−
1 hwi+αixi or wi+1=2−i−hjw i+αixi (4.9)
42 The ordering of the classes is not randomized in our version ofrcv1, so it is possible that a certain bias
was introduced towards drifting to hierarchical close labels or labels similar in size.
43 ( ) and own preliminary experiments withyeast,scene,emotions(cf. Section )
not shown here.
respectively if j was the last index withαi 6=0. Aftermtraining examples, this results in
wm+1=Pmi=12−mh−iα
ixix. It is easy to see that this corresponds to an increasing learning
rate ofηi =2hi (cf. Section ).
( ) reported an important improvement by assigning a half-life of 200 to the training examples. Particularly forrcv1, the improvement was from approx. 0.4 to 0.1-0.14 in terms of RANKLOSS. However, in our experiments using this parametrization of the examples substantially harmed the performance of CMLPP as well as RDTs, as can be seen in Figure . We adapted the RDT library of ( ) in order to support the half-life parameter and tried out different combinations for the number of features, the maximal depth of the trees and the size of the ensemble, following also the recommendations of both publications. The best combination for RDT onrcv1was to use 2500 features and 20 trees with a depth of 100. Similarly to the setting of , RDTs and CMLPP were trained on the first67034·j,1≤ j≤11examples and tested on the following 67034 ones so that we obtained eleven points for each learning algorithm. We can see that changing the weights of the training examples clearly harms the perfor- mance for both algorithms. The faster the decrease, the more pronounced is the increase in ranking loss. Figure shows only the CMLPP variants similarly to Figure with the average accumulated RANKLOSS. The last curve additionally shows the average of the previous 10000 ranking losses (obtained in the same way, by testing before training) for an infinite half-life, i.e. the default setting. It can be again clearly observed that a con- stant learning rate is the best option in this particular setting. It is of particular interest that the RANKLOSS-curves of CMLPP are almost linear and present only a slight ascending slope though the mappings are completely shuffled in the end. This demonstrated the robustness of CMLPP in this particular setting.
Regarding the contrary behavior of RDTs, and also MLPPs, than reported by
( ), we cannot exclude an error in the implementation, but we consider it unlikely since the same effect appears for both algorithms. On the other hand, the observations in our experimentation seem reasonable since, even though the drift is very radikal, it evolves very slowly during more than 800000 examples. Roughly speaking, after a pro- posed half-life of 200, theexample baseexpectedly becomes only to200/804413≈2.5% more often wrongthan 200 examples before, but the examples have already lost half of its weight. Intuitively, the drifting rate and the decrease rate of the examples weights should correspond. As the curves in Figure show, the half-life should lie even over 20000 for the proposed drifting setting. Nevertheless, this aspect should be investigated in further, more varying settings. Moreover, we simply adopted the simple technique also employed by ( ), but perceptrons were already investigated under the focus of concept drifts, e.g. by ( ) and their Forgetrons.