1970-1979: CUANDO ENTREN, ÁTENSE LAS MANOS
B 1975, LA LOBA MAQUILLADA
The size of modern databases and the complexity of the similarity searching task make efficiency an important issue for any similarity search application. In this section, we will present two techniques to speed up the query processing in similarity search applications. The two techniques, the use of index structures, and the use of a multi-step query processing architecture, are not meant to be mutually exclusive. Instead, they can both be applied in parallel or at different stages of the query processing.
2.3.1
Index Structures
The use of index structures is a standard technique to improve query processing times in database systems. Numerous different index struc-
tures have been proposed for many different data types and applica- tions. For similarity search in structured data two types of structures are important: structures for high-dimensional vector spaces and for metric spaces. The first category is useful whenever the feature vector approach is used as similarity model, but we will see in the second part of the thesis that it can also be applied to speed up certain subtasks when using the distance-based similarity model.
Metric index structures, on the other hand, can be applied if the distance-based similarity model is chosen, provided that the similarity measure fulfills the metric properties. But especially for the distance- based similarity model, where the similarity measure is often complex, speeding up the query processing is essential.
In the following, we will present the principles of important index structures for vector spaces as well as metric spaces.
Indexing Vector Spaces
The two main paradigms for index structures are hashing and tree structures. While there exist hashing approaches for vector spaces [NHS84, KS86], the vast majority of index structures for vector spaces are hierarchical data organizing structures. The idea behind those structures is to organize the vector data in a tree like directory to ensure logarithmic time complexity of index updates and search accesses. To achieve a tree structure for the index, the data vectors are grouped into pages which are described by a page region covering the entire subspace occupied by the data vectors on the page. The data pages are grouped into directory pages in the same manner until this recursive
process yields a single root page. The many index structures following this approach differ in the shape and size of the page regions, the strategy for splitting pages and the insertion strategies. Examples of index structures following this paradigm are, among many others, the members of the R-tree family [Gut84, BKSS90], the X-tree variants [BKK96, Sch99] and the IQ-tree [BBJ+00].
Indexing Metric Spaces
Index structures for metric spaces are more general than structures for vector spaces in the sense that they can also be applied to vector spaces, since every vector space is also a metric space. Like structures for vector spaces, index structures for metric spaces also group the data objects into data pages. But since there is only a distance measure given between pairs of objects, no arbitrarily formed page regions are possible. The limitation of the distance measure results in ball-shaped or ring-shaped page regions. For the description of the page regions, one or more representatives from the data objects together with a radius have to be chosen. The many index structures for metric spaces mainly differ in the way, those representatives are chosen. Examples of index structures for metric spaces are GNAT [Bri95] or the family of vantage-point trees [Uhl, Yia93, B ¨O97]. Ch´avez et al. give an overview over existing approaches for indexing metric spaces in [CNBYM01].
Since even in data mining applications regular updates of the data- base are common, dynamic index structures for metric spaces are the most important variants for our similarity search applications. The M-tree [CPZ97] and its variant the Slim-tree [TTSF00] are specifically
designed to allow dynamic updates. Furthermore, those structures are also designed to reduce the number of similarity distance calculations which is especially important for complex similarity measures like they are common for structured data. Therefore, we will compare our tech- niques for efficient similarity search with the M-tree in the following chapters.
2.3.2
Multi-step Query Processing
The complexity of the similarity distance measure is often a problem for efficient query processing in similarity search applications. Index structures are one way to exclude unnecessary parts of the database from scanning, which reduces the number of necessary similarity dis- tance calculations. Another way to reach this reduction goal is to employ a multi-step query processing architecture.
To reduce the number of necessary distance calculations, the query processing in a multi-step query processing architecture, as depicted in figure 2.6, is performed in two or more steps. The first step is a filter step which returns a number of candidate objects from the database. For those candidate objects, the exact similarity distance is then determined in the refinement step and the objects fulfilling the query predicate are reported. To reduce the overall search time, the filter step has to fulfill certain constraints. First, it is essential that the filter predicate is considerably easier to evaluate than the exact similarity measure. Second, a substantial part of the database objects must be filtered out. Obviously, it depends on the complexity of the similarity measure which filter selectivity is sufficient. Only if
filter
candidates
result refinement
Figure 2.6: Schema of a multi-step query processing architecture.
both conditions are satisfied, the performance gain through filtering is greater than the cost for the extra processing step.
Additionally, the completeness of the filter step is essential. Com- pleteness in this context means that all database objects satisfying the query condition are included in the candidate set or in other words, it must be guaranteed that no false drops occur during the filter step. Available similarity search algorithms guarantee completeness if the distance function in the filter step fulfills the lower-bounding property.
Definition 2.5 (lower-bounding property) For any two objects p
and q, a lower-bounding distance function dlb(p, q) in the filter step has
to return a value that is not larger than the exact distance de of p and
q, i.e. ∀p, q : dlb(p, q) ≤ de(p, q).
With a lower-bounding distance function it is possible to safely filter out all database objects which have a filter distance larger than the current query range, because the similarity distance of those objects cannot be less than the query range.
Using a multi-step query architecture requires efficient algorithms that actually use the filter steps. Agrawal, Faloutsos and Swami pro-
posed such an algorithm for range queries [AFS93]. In [SK98] and [KSF+98] multi-step algorithms for k-nearest-neighbor search were pre- sented which are optimal in the sense that the minimal number of exact distance calculations are performed during query processing. We em- ploy the latter algorithms in order to ensure efficient query processing whenever applying a multi-step query processing architecture.