Capítulo 3 El aprendizaje cooperativo en la enseñanza del inglés como lengua extranjera.
3.4 La comunicación en la clase de lengua extranjera
In general, a classification problem is to predict classY given a set of featuresXi;i= 1,2, . . . , k. By applying Bayes’ rule, the probability of class Y is
P(Y|X1, X2, . . . , Xk) = P(X1, X2, . . . , Xk|Y)P(Y)
P(X1, X2, . . . , Xk) (3.1) where P(X1, X2, . . . , Xk|Y) and P(Y) can be calculated from the data during the training of the classifier. For example, we want to build a classifier to determine if a patient shows early signs of health deterioration based on some simple features such as gender (female or male), shortness of breath (yes or no) and whether the patient takes any medication (yes or no). From a given data set, we can calcu- late P(Y = health deterioration) by dividing the number of patients that show early sign of health deterioration by the total number of patients in the data set. We can then calculate other probabilities that we are interested in from the data set such as P(gender =f emale, shortness of breath=yes, medication =yes|Y = health deterioration) andP(gender =f emale, shortness of breath =yes, medication= yes). Following Equation 3.1, we can calculate the probability that a female patient who is on medication and suffers from shortness of breath shows an early sign of health deterioration.
The computation in a Bayes classifier is simple and straight-forward. However, as the number of features increases it becomes impractical to compute the probabil- ities [95]. Let us consider building a boolean class Bayes classifier with n boolean features. In this case, for each feature, we will need to calculate 2n probabilities. Since for a particular class, the sum over all probabilities for a given feature must be one, only 2n−1 calculations are necessary. Given a boolean class we need to calculate a total of 2(2n−1) probabilities, which correspond to each of the distinct features. If
n is a large value e.g., 20 boolean features then we will need to compute more than 10 million probabilities, which is impractical in real world implementations.
One solution to this problem is to assume all features (X1, X2. . . , Xk) to be in- dependent of each other, given the class Y. The resulting model is called the na¨ıve Bayes classifier. The na¨ıvete in the classifier is that all the features are assumed to be independent of each other and so the computation of P(X1, X2, . . . , Xk|Yi) can be simplified as kj=1P(Xj|Yi). The prediction of class Y, given a set of features X1, X2, . . . , Xk is therefore to select the class that maximises the posterior probabil- ity, i.e., Y = arg max yi P(Yi) k j=1 P(Xj|Yi)
Figure 3.2 shows an example of a generic na¨ıve Bayes classifier as a graphical model. The class node represents the behaviour, which is the parent to its features (e.g., sensors that are attached to the toaster, television, etc.).
Figure 3.2: A graphical representation of a na¨ıve Bayes classifier. The features are the sensors attached to the toaster, microwave oven, burner, etc., and are independent given the class ‘preparing meal’ behaviour.
In comparison to the decision tree described in Section 3.2.1, the na¨ıve Bayes clas- sifier does not need to perform an explicit search through the training data to test each sensor at every tree node, but instead, counts the frequency of sensor combinations
within the training examples.
Among the earlier work that used the na¨ıve Bayes classifier to recognise behaviours of the inhabitant is the work of Tapia, Intille and Larson [133]. They extend the na¨ıve Bayes classifier to incorporate temporal relationships of the sensor activations, such as whether token ‘a’ activated within some time interval and whether token ‘a’ is activated before token ‘b’. In order to recognise behaviours, they used a set of feature windows, each represents one behaviour and the length of each feature window is based on the average duration that the inhabitant took to carry out the activity. The probability for the current activity is calculated by shifting the feature window over the sensor sequences. The probability reaches the maximum when the window aligns with the duration of the activity from the sensor readings.
Since some behaviours are rare, this results in the distribution of the sensors not being consistent among the behaviour classes. In order to accurately learn the behaviour classes, the work of Sarkar, Lee and Lee [57] incorporates a smoothing technique by discounting the probabilities of frequently seen sensors and giving some probabilities to unseen sensors when training the na¨ıve Bayes classifier. Long, Yin and Aarts [81] attempt to use principal components analysis (PCA) to first reduce the number of features that will affect the performance classifier before training it. The work of Yang, Lee and Choi [150] introduced a penalise function where the trained classifier is penalised according to mismatch of time, and mismatch between learned model and the observed actions of an activity.
Although the strong independence assumption in the na¨ıve Bayes classifier makes it a tractable approach for learning (i.e., by considering each token in each behaviour separately), this assumption may not hold in real world applications since correlations among features are common. The smart home is one example where some sensors are correlated. For example, sensors attached to kettle and kitchen tap are correlated since making tea involves these two objects. The correlations among sensors introduce
dependencies and as such reduce the influence of other features, which can then affect the overall performance of the algorithm [74, 118].
One problem with na¨ıve Bayes is that it does not take into account the sequen- tial ordering of sensor readings. Thus, to achieve a good recognition performance, temporal information needs to be encoded in the classifier (which can be seen in the work of Tapia, Intille and Larson [133]). Another method that models the temporal information directly into the model is the hidden Markov model (HMM), which is described next.