6. Marco Teorico
6.1. La Educaciòn
6.1.1 Conceptualización
Clustering using silhouette or cross-validation accuracy is time consuming because it requires every clustering to be assessed. A more efficient method would be to stop the clustering at some estimated optimum number of clusters, mitigating the need to examine the clusterings beyond that point.
The Minimum Description Length(MDL) framework can be used to cluster time series [142], for dimensionality reduction [92], and for stopping in semi-supervised clustering [16].
Clustering by MDL is described in detail in [142]. The technique used is aimed at clustering time-series subsequences from a streaming time series. Their algorithm inspects each pair of subsequences and stores the bit save made by clustering the pair together. A bit save is the saving made by reducing the number of bits necessary to store the data. If two subsequences are similar, it should be possible to make a saving, judged in terms of MDL, by storing the centroid and the difference vectors of the two series (in fact, one series in every cluster need not be stored, as it can be recovered from the centroid and the other difference vectors). If the subsequences in the cluster are all similar to the centroid (as they should be in a good cluster) then the difference vectors will be close to straight lines with gradient 0. This is very cheap to store in terms of MDL, and hence represents a substantial saving over storing the subsequences separately.
The method proposed in [142] is for unsupervised clustering of time-series sub- sequences from a streaming time series. It is not designed for clustering shapelets, which have an associated class, and must visually resemble one another to be consid- ered a good match. We make a number of modifications to Rakthanmanon et. al ’s method to optimise the approach for clustering shapelets (see Sections 5.6.3 to 5.6.5). The first stage in using MDL for clustering shapelets is to discretise the shapelets
to six-bit precision. Six bits is an arbitrary level of precision; experiments performed by Hu et al. [92] on the UCR datasets showed that transforming to six-bit precision from double precision made no substantial difference to classification accuracy. Once the shapelets are clustered, we use the original shapelets, not the discretised shapelets, further minimising any impact the discretisation might have.
We perform the discretisation using the procedure outlined in [92]. The minimum and maximum values are found by examining every shapelet. The following formula is used to discretise the series:
Discretisationb(T ) = round T − min max − min · (2 b− 1) − 2b−1. (5.6.2)
max and min are found over the entire set of shapelets, rather than found separately for the individual shapelets; b is the number of bits we discretise the series to, in this case, six.
Once the shapelets are discretised, we can define the description length of a shapelet. The entropy of a time series is a lower bound on the average code length from any encoding of that series [142]. Hence, we can use the entropy of the shapelet as its description length.
To calculate the entropy of a discretised shapelet, S, we create the set of unique values that occur in that shapelet, VS = {v : v is the value of a point in S}. Using VS
and S, we can define a probability, P (v) for each value v in VS. P (v) is the probability
that a point, s in the shapelet S, takes the value v; it is calculated as follows:
P (v) = |{s : s ∈ S ∧ s = v}|
|S| . (5.6.3)
For each value v in the set VS, we calculate P (v). This allows us to calculate the
entropy, H(S), of the discretised shapelet S:
H(S) = − X
v∈VS
We define the description length for a length m Shapelet S as follows [142]:
DL(S) = m · H(S). (5.6.5)
In order to cluster shapelets using MDL, we must define the description length of a cluster. To do this, we calculate the description length of the centroid of the cluster using the formula above. For each member of the cluster, we create a differ- ence vector, which is equal to the difference between each point in the member and the centroid. The total description length of the cluster is equal to the description length of the centroid plus the sum of the description lengths of the difference vectors of the members, minus the difference vector of the member with the largest descrip- tion length (as this information can be recovered from the centroid and the other members). For a cluster, C, with a centroid, Ccent, we denote the difference vector
between a member of the cluster, c, and the centroid as Ccent−c. The description
length is calculated as follows:
DL(C) = DL(Ccent) + X c∈C DL(Ccent−c) ! − max c∈C(DL(Ccent−c)). (5.6.6)
The shapelets we cluster may be of different lengths, which means that the centroid cannot be a simple point-by-point average. In [142], the authors allow any possible offset between the centroid and the members of the cluster. Figure 5.13 shows an example of this, using different offsets of shapelets from the GunPoint dataset. The approach in [142] makes sense for unsupervised clustering of subsequences from a streaming time series; it is possible for the algorithm to identify two separate parts of a longer series. For shapelets, however, the approach makes less sense. In Figure 5.13, only the centroid at offset = 0 closely resembles the members of the cluster. Given that matching shapelets are, by definition, visually similar, we should not allow such centroids to be created. Hence, we restrict the length of the centroid to the length
of the longest member, and select the offset by sliding the shorter shapelet along the longer shapelet.
When two shapelets are clustered (or a shapelet is added to an existing cluster, or two clusters are merged), all allowable offsets of the shapelets (or centroids for existing clusters) are tested, and the offset giving the smallest total description length is selected. For two shapelets, Si and Sj, of lengths Li and Lj respectively, where
Li ≥ Lj, the possible offsets range from 0 to i − j, where 0 indicates that the first
indexes of the shapelets are aligned, and a positive integer indicates the index of Si
aligned with the first point in Sj.
Offset = -26
Offset = -13 Offset = 0
Offset = 13 Offset = 27
Figure 5.13: The effect of different offset values on the centroid formed by clustering two shapelets from the GunPoint dataset. The shapelets (blue and yellow) and the centroid (green) are offset on the z-axis for ease of presentation. Because the shapelets represent two instances that belong to the same cluster, the centroid is very similar to the members of the cluster at offset=0. As the offset moves away from zero in either direction, the centroid becomes less like the members of the cluster. This is accompanied by an increase in the description length of the cluster, allowing the algorithm to select the appropriate offset.
hence very similar to the centroid. If all members of a cluster are very similar to the centroid of that cluster, the difference vectors will approximate straight lines of 0 gradient. In this situation, the total description length of the cluster will be small, as the difference vectors will have very small description lengths.
A bit save is achieved when storing shapelets in a cluster results in a smaller total description length than storing the shapelets individually.
We adopt the basic MDL framework, and use it to create two novel shapelet clustering methods.