Rendimiento de los diferentes residuos del café

Tiempos de retención hidráulicos y retención de sólidos

3.5. Rendimiento de los diferentes residuos del café

Most of the current supervised classification needs labels and predicts according to one target value. Some of the extensions of existing learning techniques such as decision trees use random forest which in turn creates many trees using one attribute at a time. By using multiple trees the model grows to a huge size making it hard to maintain. The DC-tree algorithm (Kocev et al., 2007) and PurestTree() algorithm 6.2 uses multiple targets as labeled clusters in Figure 6.4, which needs only a single tree and are easy to model.

6.3 Concept Learning

We investigate using the standard WEATHER dataset to learn the temporal variations which are descriptive attributes. There are four basic types of learning in data mining (Tan et al., 2006) applications.

? Classification Learning

? Association Learning

? Clustering

? Regression or numeric prediction

Regardless of the type of learning these constitutes concept and the output accuracy produced by a learning scheme that uses descriptive attribute for concept descrip-

tion. As we are concerned with label-less (no class label) learning the last two from the above list are more relevant.

In practical data mining applications the following types of attributes are used:

? Categorical

? Enumerated or discrete

? Numeric or continuous

? Boolean

Machine learning can use a wide variety of information about the attributes (Witten & Frank, 2005). Some of these are:

? Number of dimensions

? Circular ordering

? Temporal context

? Partial ordering

These are often called meta-data, data about data. We like to learn more on the temporal aspects, which are listed above using the WEATHER dataset. Detecting and retaining all features is a prime consideration in categorizing good quality streams. From the example WEATHER dataset, we enumerate a few temporal attributes. The problem is to how to classify a new day for the given temporal values. Numeric prediction (Witten & Frank, 2005) is a variant of classification learning in which the outcome is a numeric value rather than a category. The data-cleaning applications of noisy streams problem is one example. Another, as shown in matrix representaion in Figure 7.1, is a version of the stream data in which what is to be predicted is not the target label Y from the sensor data matrix but rather the QoD of the real-time stream quality. With numeric prediction problems, as with other machine learning situations, the predicted value for new instances is often of less interest than the hidden features which are learned, expressed in terms of what the important attributes are and how they relate to the QoD-label outcome.

outlook = sunny yes cluster-1 [32,52.5]: 2 sunny,no,34.0,50.0 sunny,no,30.0,55.0 no outlook=overcast yes cluster-2 [15.5,72.5]: 2 overcast,no,20.0,70.0 overcast,yes,11.0,75.0 outlook = rainy windy? windy = yes cluster-3 [9,92.5]: 2 rainy,yes,10.0,95.0 rainy,yes,8.0,90.0 windy = no cluster-4 [19,91.5]: 2 rainy,no,20.0,88.0 rainy,no,18.0,95.0

Figure 6.4: Cluster tree of 4 leafs for the WEATHER dataset.

The WEATHER problem is a tiny dataset that we will use repeatedly to illustrate DC-Tree cleaning methods. In general, instances in a dataset are characterized by the values of features, or attributes. In this case there are four attributes: outlook, temperature, humidity, and windy. The descriptive attributes, which form the nodes of the DC-Tree are shown in Figure 6.4.

? Outlook

? Wind

The rest of the numeric attributes are used in clustering as shown in Figure 6.4.

? Temperature

outlook = sunny yes cluster-1 [32,52.5]: 2 sunny,no,34.0,50.0 sunny,no,30.0,55.0 no cluster-2 [14.5,85.5]: 6 overcast,no,20.0,70.0 overcast,yes,11.0,75.0 rainy,no,20.0,88.0 rainy,no,18.0,95.0 rainy,yes,10.0,95.0 rainy,yes,8.0,90.0

Figure 6.5: Cluster tree with pruning enabled.

Evaluating error from 7.1 and substituting values for variables in equations 6.4, we have the generalization error (RMSE) before and after pruning.

Before pruning = 2.6751 After pruning = 6.7338

In the data-cleaning metric the classification error for the tree in Figure 6.4 is 2.6751. It has a higher QoD(Iyer et al., 2013.) label than the pruned version with lesser number of nodes as shown in Figure 6.5. A Machine Learning framework, which allows combining outputs of many classifiers is known as ensemble learning. It is one of the most popular standards to improve robustness and accuracy of the base classifier. Some of the well know methods are bagging and randomforest, which uses ensemble learning. The framework also supports Stacking, Voting and Random Subspaces, which combines multiple classifier outputs. Data driven analysis shows that the empirical evidence combining classifiers improves accuracy if the classifiers are ”independent”, i.e. its learning function is different inside each classifier. In this chapter, we will look into various ensemble methods.

6.4 Related Work

Many studies have been conducted to improve sampling techniques. Similarly, there are many studies that use ensembles to improve classification accuracy. The most important assumption is sampling rule of i.i.d’s independence and this also holds for independency of the classifier’s accuracy. This implies that the attribute partitioning methods perform better as compared to data partitioning. Most of the current approaches emphasis data partition in terms of cross-validation and reduction in training error. The feature selection problem plays an important role in domain targeted research. It can be viewed as a multi-criterion optimization problem, which uses heuristic, search or optimization techniques. (Kohavi & John, 1997) discusses a feature relevance model and show how various feature subset selection problems are solved. The research survey includes two models: (i) wrapper model and (ii) filter model. The wrapper model uses the feature subset selection as feature evaluation and is optimal. In the second case of filter model the induction algorithm uses information content such as inter-class distance and, statistical dependence, that are dependent on particular algorithms. Some of the research work in filter model is reviewed by (Blum & Langley, 1997)

6.4.1 Justification of Ensemble Method

Due to small training sets and data values affected by faulty sensor measurements, let us assume that the base classifier has an error rate ( = 0.35). The ensemble classifier uses the methods stated below to combine the output of individual classifiers. Let us assume that the base classifiers use the same training set and induction algorithms, then the ensemble will also misclassify the same test sample as done by the base classifier. If we assume the base classifiers are independent, which means the errors

are uncorrelated. Then the ensemble will make an error for a test sample only if more than half of the base classifiers predict incorrectly. Now, substitute this assumption into a binomial probabilities as given in equation.

    n r     pr(1 − p)n−r (6.11)

Substituting the error rates ( = 0.35) of the ensemble classifier in our case is

eensemble= 25 X i=13     25 i     i(1 − )25−i= 0.06 (6.12)

We take a classification approach to data-cleaning algorithm learning to reduce errors and estimation of sensor data. A general form for data-cleaning algorithm in terms of training with samples is shown in the equation, where the estimation is dependent on bias and variance of the observed data and the noise present in the measurement domain.

T rainingerror= Biasobserved+ V ariancealgorithm+ Noisemeasurement (6.13)

6.4.2 Factors that Affect Bias

Generalization is an important aspect of the classification approach as our model was too complicated then the decision boundary would too. This would lead to a solution, which would be too premature because the rules need to perform well with observed samples which are still to be seen. Another approach is to obtain a better estimate of the parameters measured by ground sensors by getting more training samples. Due to constrains in power and the capacity of sensor lifetime it

not feasible to obtain large training samples in continuous time. A design choice would be to simply and use a less complex boundary. This could bring down the performance of the classifier but would mean better performance on novel patterns.

6.4.3 Complexity and Algorithms

Generally there are two types of algorithms supervised learning and unsupervised learning. A learning algorithm will work on any training instance which has attributes and a target label. The stronger the assumption of the attribute assumption the higher the bias. In general, the stronger the assumption of the type of decision boundary the higher the bias. For e.g. the Nearest-neighbor classifier is more sensitive to the same training set compared to a decision tree classifier.

6.4.4 Bias and Variance for Continuous Values

Sensor data is normally continuous and can consist of many reading which are high precision floating-point values. In this case we have a continuous value sensor measurement with noise, we can estimate with n samples in a set D generated by

F(x). The regression function estimate g(x;D) for a given training set D. Due to

random variation in time-series data, for some window length L the approximation will be excellent while for other data sets of the same size the approximation will be poor.

V ar(X) = ED[(X − µ)2] (6.14)

V ar(X) = ED[X2−2XED[X] + (ED[X])2] (6.15)

= ED[X2] − 2ED[X]E[X] + (ED[X])2 (6.16)

Substituting in the regression estimate = E([g(x;E)]) − F([(x)])2 | {z } bias +EE[(g(x;D) − ED[g(x;D)])2] | {z } variance (6.18) The equation consists of two additive terms, the first squared expression is the bias, which when minimized allows better estimate of the function F form D. The second term signifies the variance term(Duda et al., 2000), which can be attributed to a statistical property and if kept to a low value should affect the Root Mean Square (RMS) error in a minimal way with an existing low bias or no bias term.

It can be shown from the above equation that if we assume a large number of samples n → ∞ for training the model then bias term will be reduced and only dependent upon the noise factor and variance would be reduced to zero or to a desired quality labels as described in the next sections.

In document REVISIÓN BIBLIOGRÁFICA (página 48-54)