Percepciones en los lugares - Percepciones socio-espaciales

Capítulo 3: Percepciones socio-espaciales

3.1 Percepciones en los lugares

Having dened a similarity measure on complex objects like time series, the two most basic queries for each more sophisticated data mining task, are

2.6 Data Mining on Time Series 23 the ε-range query and thek-nearest neighbor query. Together with ecient

index structures, dimensionality reduction techniques, and suitable lter- renement strategies, these query types lay the foundation for data mining techniques on large real-world datasets.

In the following sections we formally dene these queries and give a brief overview of the many data mining approaches.

2.6.1 Query Types on Temporal Data

The rst query type is the epsilon-range query. Denition 2.6 (Epsilon-Range Queries).

Let D be the domain of complex objects. Let d: D × D → _R+

0 be a distance

function. The ε-range query consists of a query object q∈ D and a distance

parameter ε ∈ _R+

0. The ε-range query retrieves the set Qrangeε (q) ⊆ D such

that

∀x∈Qrange_ε (q) :d(q, x)≤ε

The ε-range query can be used to answer a kNN query, which is dened

as follows.

Denition 2.7 (k-Nearest-Neighbor Queries).

Let D be the domain of complex objects. Let d: D × D → _R+

0 be a distance

function. The k-nearest neighbor query (kNN query) consists of a query

object q ∈ D and a parameter k ∈ _N+. The _kNN query yields the smallest

set QN N

k (q)⊆ D that contains at least k elements such that

∀x∈QN N_k (q),∀y∈ D \QN N_k (q) :

d(q, x)< d(q, y)

A pseudocode for both query types can be found in the GEMINI paper [FRM94]. As mentioned above, an optimal kNN algorithm was introduced in [SHP98].

elements of the same cluster are more similar to each other than to elements of other clusters. So, the similarity is high within a cluster, and low across clusters. Following [Ber02], three of the main categories for clustering methods are hierarchical clustering,partitioning clustering, and density-based clustering. A further overview is given in [HK01].

Hierarchical clustering computes a cluster hierarchy, often represented as a dendrogram. Some clusters contain further child clusters. A hierarchical clustering structure is usually either obtained by iteratively splitting of the dataset, or by merging smaller clusters to a parent cluster. The rst approach is called divisive and starts with only one cluster containing the complete data. This cluster is recursively split, until some stop criterion is fullled. Agglomerative approaches start with clusters consisting of only one object. In each iteration, clusters are merged, until all the data is contained in the root cluster. Prominent examples of hierarchical clustering methods include Single Link [Sib73], CURE [GRS98], and BIRCH [ZRL96].

Partitioning clustering splits the available data into disjoint clusters. One of the rst clustering algorithms of this category was thek-means algo-

rithm [Mac67]. Further examples are the k-medoid-based approaches PAM

and CLARA [KR90], and CLARANS [NH94].

Density-based clustering groups objects into clusters according to a density criterion. While approaches like k-means are often restricted to the

creation of convex clusters, density-based approaches usually detect clusters of any shape. DBSCAN [EKSX96] and its hierarchical variant OPTICS [ABKS99] as well as DENCLU [HK98] are prominent examples in this category.

2.6 Data Mining on Time Series 25 Classication

Classication is a supervised data mining task, i.e. a set of labeled training data is available based on which a model can be learned. This model (the classier) can afterwards be used to predict the class of a newly discovered object from the same domain as the training data. Important categories of existing classication techniques include decision trees, statistical methods, instance-based learners, and Support Vector Machines [Kot07].

Decision trees consist of nodes representing features of the instances to be classied. Each value the feature can assume is represented by a branch leading to the next node or to a leaf in case the object has successfully be classied. The most well-known decision tree algorithm is the C4.5 algorithm [Qui93]

Statistical methods assign probability values for the correct rather than a single class label. Naive Bayesian networks are relatively simple classiers with independence assumptions concerning the values of the dierent features of an object. However they were shown to be quite competitive in [DP97]. The more general Bayesian networks or belief networks are able to model probability relationships between a set of features. However they are quite dicult to compute [Kot07].

Instance-based learners are also called lazy learners, as they do not derive an explicit model like a decision tree or a Bayesian network for a given training set. They rather use the training set each time a classication task is to be performed. The most well-known instance-based learner is the

k-nearest neighbor classier [CH67] with its many variants.

Support Vector Machines are one of the newest classication approaches and were introduce in [Vap95]. SVMs try to separate two classes with a hyperplane that maximizes the so called margin, i.e. the distance to both

equals the dot product inH, it is not necessary to explicitly map all training

instances toH. This is known as the kernel trick [SBS99].

In document Transformaciones socio espaciales en el entorno inmediato a Centro Mayor (página 81-89)