Capítulo 3: Percepciones socio-espaciales
3.1 Percepciones en los lugares
Having dened a similarity measure on complex objects like time series, the two most basic queries for each more sophisticated data mining task, are
2.6 Data Mining on Time Series 23 the ε-range query and thek-nearest neighbor query. Together with ecient
index structures, dimensionality reduction techniques, and suitable lter- renement strategies, these query types lay the foundation for data mining techniques on large real-world datasets.
In the following sections we formally dene these queries and give a brief overview of the many data mining approaches.
2.6.1 Query Types on Temporal Data
The rst query type is the epsilon-range query. Denition 2.6 (Epsilon-Range Queries).
Let D be the domain of complex objects. Let d: D × D → R+
0 be a distance
function. The ε-range query consists of a query object q∈ D and a distance
parameter ε ∈ R+
0. The ε-range query retrieves the set Qrangeε (q) ⊆ D such
that
∀x∈Qrangeε (q) :d(q, x)≤ε
The ε-range query can be used to answer a kNN query, which is dened
as follows.
Denition 2.7 (k-Nearest-Neighbor Queries).
Let D be the domain of complex objects. Let d: D × D → R+
0 be a distance
function. The k-nearest neighbor query (kNN query) consists of a query
object q ∈ D and a parameter k ∈ N+. The kNN query yields the smallest
set QN N
k (q)⊆ D that contains at least k elements such that
∀x∈QN Nk (q),∀y∈ D \QN Nk (q) :
d(q, x)< d(q, y)
A pseudocode for both query types can be found in the GEMINI paper [FRM94]. As mentioned above, an optimal kNN algorithm was introduced in [SHP98].
elements of the same cluster are more similar to each other than to ele- ments of other clusters. So, the similarity is high within a cluster, and low across clusters. Following [Ber02], three of the main categories for clustering methods are hierarchical clustering,partitioning clustering, and density-based clustering. A further overview is given in [HK01].
Hierarchical clustering computes a cluster hierarchy, often represented as a dendrogram. Some clusters contain further child clusters. A hierarchical clustering structure is usually either obtained by iteratively splitting of the dataset, or by merging smaller clusters to a parent cluster. The rst approach is called divisive and starts with only one cluster containing the complete data. This cluster is recursively split, until some stop criterion is fullled. Agglomerative approaches start with clusters consisting of only one object. In each iteration, clusters are merged, until all the data is contained in the root cluster. Prominent examples of hierarchical clustering methods include Single Link [Sib73], CURE [GRS98], and BIRCH [ZRL96].
Partitioning clustering splits the available data into disjoint clusters. One of the rst clustering algorithms of this category was thek-means algo-
rithm [Mac67]. Further examples are the k-medoid-based approaches PAM
and CLARA [KR90], and CLARANS [NH94].
Density-based clustering groups objects into clusters according to a den- sity criterion. While approaches like k-means are often restricted to the
creation of convex clusters, density-based approaches usually detect clus- ters of any shape. DBSCAN [EKSX96] and its hierarchical variant OPTICS [ABKS99] as well as DENCLU [HK98] are prominent examples in this cate- gory.
2.6 Data Mining on Time Series 25 Classication
Classication is a supervised data mining task, i.e. a set of labeled training data is available based on which a model can be learned. This model (the classier) can afterwards be used to predict the class of a newly discovered object from the same domain as the training data. Important categories of existing classication techniques include decision trees, statistical methods, instance-based learners, and Support Vector Machines [Kot07].
Decision trees consist of nodes representing features of the instances to be classied. Each value the feature can assume is represented by a branch leading to the next node or to a leaf in case the object has successfully be classied. The most well-known decision tree algorithm is the C4.5 algorithm [Qui93]
Statistical methods assign probability values for the correct rather than a single class label. Naive Bayesian networks are relatively simple classiers with independence assumptions concerning the values of the dierent features of an object. However they were shown to be quite competitive in [DP97]. The more general Bayesian networks or belief networks are able to model probability relationships between a set of features. However they are quite dicult to compute [Kot07].
Instance-based learners are also called lazy learners, as they do not derive an explicit model like a decision tree or a Bayesian network for a given training set. They rather use the training set each time a classication task is to be performed. The most well-known instance-based learner is the
k-nearest neighbor classier [CH67] with its many variants.
Support Vector Machines are one of the newest classication approaches and were introduce in [Vap95]. SVMs try to separate two classes with a hyperplane that maximizes the so called margin, i.e. the distance to both
equals the dot product inH, it is not necessary to explicitly map all training
instances toH. This is known as the kernel trick [SBS99].