MOVIMIENTO CONVERGENCIA CIUDADANA - DEPARTAMENTO DE SANTANDER

Most research into shapelets has concentrated on time-series classification with the shapelet tree of [180, 181]. There are other methods that make use of the shapelet

intuition; they are addressed in this section.

3.4.1 Logical Shapelets

Mueen et al. [130] propose a method to improve the descriptive power of shapelets by using conjunctions and disjunctions of shapelets. Effectively, the recursive search at each node of the tree is the same as that of [180, 181], but performed multiple times until one class in the data is partitioned wholly into one side of the binary split. Logical Shapelets improves the power of the shapelet tree on three datasets.

3.4.2 Early classification

Xing et al. [175] use shapelets for early classification, where predictions are made before the whole series is examined. Problems that benefit from early classification include [77], where sepsis in infants is detected early by observing ECG data, or [17], who show that the traffic flow of a TCP connection can be classified by observing only the first five packages.

Xing et al. use shapelets because early classification must be interpretable if professionals are to trust the system, and because shapelets capture local similarity, which is necessary for early classification. Earliest match length (EML) is a measure of the utility of a shapelet for early classification:

EM L(f, t) = min

len(s)≤i≤len(t)dist(t[i − len(s) + 1, i], s) ≤ δ, (3.4.1)

where f = (s, δ, c), s is a shapelet, δ is the distance threshold associated with the shapelet, and c is the class label associated with that shapelet. EML is the first index of the first subsequence in t that is within the threshold for f . This measure could be incorporated easily into our shapelet transform, creating an early classifier.

In [71], Ghalwash et al. use multivariate shapelets for early classification of medical time series. They focus on producing results that can be used by medical professionals, who may be uncomfortable with uninterpretable, black-box methods. They present the task of finding multivariate shapelets as one of optimisation, from a start- ing point of having extracted all shapelets from each dimension.

3.4.3 Shapelets for clustering

Zakaria et al. [183] perform clustering using shapelets, and show that it can be more effective than clustering using the whole series. An unsupervised shapelet (u-shapelet ) is a subsequence of a time series for which the distances between the subsequence and one group of time series are much smaller than those to another group of time series. The algorithm searches for a u-shapelet that can separate a subset of the data, which is then removed for the next iteration, until no data remain to be separated. The u-shapelets are found by a greedy search aimed at maximising the gap between two subsets of the data:

gap = ¯xB− sB− (¯xA+ sA), (3.4.2)

where ¯x is the mean distance between the u-shapelet and the members of subset X, and sX is the standard deviation of the distances between the members of subset X

and the u-shapelet. All subsequences of the time series are candidates, and have their distance vectors computed. The vector represents an orderline, which is searched to find the optimum split point that maximises the gap function. DAis the set of points

to the left of the split, DB the set of points to the right. Pathological splits (e.g.

splits with all cases except a single outlier in one group) are prevented by checking that the ratio of DA to DB is within the range

1 k < |DA| |DB| < 1 − 1 k , (3.4.3)

where k is the total number of clusters. The optimum split point for an orderline is the point that minimises the distance to DA while maximising the distance to DB.

The u-shapelet candidate giving the split point with the best gap value is selected as the u-shapelet.

After a u-shapelet is selected, time series containing similar subsequences to the u-shapelet are removed from the dataset. A time series is considered to contain a similar subsequence if the distance between the subsequence and the u-shapelet is less than the mean distance to DA plus one standard deviation. Zakaria et al. show

that, for clustering of rock spectral signatures, synthetic data, electrical devices, and ECG data, u-shapelets generally find better clusters than if whole series are used.

3.4.4 Alternative quality measures

Lines and Bagnall [116] propose two quality measures as alternatives to using In- formation Gain for the shapelet-tree classifier. They show that their measures offer equivalent accuracy and increased speed. More measures are tested in [88]; the F-stat measure is found to be faster to calculate and more accurate than the other shapelet quality measures.

3.4.5 Shapelet-like approaches

The following papers present work that is intuitively similar to the shapelet approach, though not close enough to be considered shapelet research.

In [113], the authors apply a shapelet-like approach to the problem of detecting friendship relationships in time-series GPS data. For this problem, local similarity is more important than global similarity; proximity on a weekday is less predictive than proximity on a Saturday. The position in the time series is the most important factor. Hence, instead of searching for local shapes that are discriminative, they search for the most discriminative time interval, almost the converse of the shapelet approach.

Di Fatta et al. [49] aim to detect faults in software by mining tree structures that represent successful or failed executions. Their methodology is similar to the shapelet approach, but uses a different representation. Rather than search a set of 1D series for the most discriminative subsequence, they search a set of trees for the most discriminative sub trees. Their method is interpretable: the sub tree that best discriminates failed executions is likely to contain sub routines that cause failures.

In [102], Ko et al. use time-series classification for context recognition, where data from multiple sensors are fused into a representation of a situation. Unlike time- series classification with shapelets, they use full series, and employ DTW distance to accommodate noise in indexing. Class prototypes are used in the same way as shapelets; new examples are classified by assigning the class of the closest prototype. Hartman et al. [82] apply a shapelet-like approach to gesture recognition. There are two main differences between their algorithm and the algorithm in [180]. First, they use DTW, rather than Euclidean distance. Second, they use their own quality measures, rather than Information Gain, to evaluate prototypes, see Section 3.3.2. These differences are minor; [82] might be considered part of the shapelet literature; we examine another paper from the same research group [81] in Section 3.3.

In document DEPARTAMENTO DE SANTANDER (página 35-41)