For the intersectboolean predicate, it suffices to find a single intersecting query and database container pair in order to issue the database id. Obviously, it is desirable to detect such intersecting pairs as early as possible in order to avoid unnecessary
blobintersection tests. In this section, we present two optimizations striking for this
goal. First, we introduce a fast second filter step which tries to determine intersecting pairs without examining the BLOBs. This test is entirely based on aggregated infor- mation of the gray containers. Secondly, we introduce a probability model which leads to an ordering for the candidate pairs such that the most promising blobinter-
section tests are carried out first. In order to put these optimizations into practice, we
pass D(Cgray) and H(Cgray) as additional parameters to the user-defined aggregate function table. Thus, the following two optimizations can easily be integrated into this user-defined aggregate function. If the fast second filter step determines an inter- secting container pair, all other candidate pairs are deleted so that the resulting table of candidate pairs, called ctable, consists only of one intersecting pair. Nevertheless, there might be database objects where this second filter step does not determine an intersection for any of the corresponding candidate pairs. In this case, we sort the candidate pairs at the end of our user-defined aggregation function table such that the most promising blobintersection tests are carried out first.
Optimizations for one-dimensional gray intervals. Let us first mention that a gray interval with maximum density is also called a black interval. In this paragraph, we will discuss how gray and black intervals have to look like so that we can decide whether two interlacing intervals intersect each other or not without accessing their BLOBs. These optimizations are only suitable for the intersectboolean-predicate.
Two Black Intervals. If two black intervals interlace, they necessarily intersect as
well.
Black and Gray Intervals. In this case, the situation is a little bit more complicated.
In almost all cases where a black interval Iblack = (lblack, ublack) interlaces a gray interval Igray, with H(Igray) = (lgray, ugray), there is an intersection of Iblack and Igray as well. If any of the two conditions depicted in Table 3 holds, then Iblack and Igray intersect.
Object Relational Query Processing 141
Two Gray Intervals. Obviously, two gray intervals Igray and I’gray, with H(Igray) = (lgray, ugray) and H(I’gray)=(l’gray, u’gray), which interlace do not have to intersect. Fortunately, there are two cases where we can assert that they intersect without exam- ining the detailed black interval sequences. The two cases are illustrated in Table 4.
No intersection. There are only two situations, depicted in Table 5, where we can
determine that two interlacing intervals do not intersect. In this case, the interlacing
Condition Explanation Figure
L(Iblack) >
G(Igray)
If the black interval is longer than the maxi- mum gap between two black intervals of the detailed black interval sequence of Igray than the two intervals intersect.
uinterlace = ugray or
linterlace = lgray
If one of the two conditions presented in the box on the left holds, then the black and gray interval intersect. This is due to the fact that the gray intervals end and start with black intervals.
Table 3: Intersection between an interlacing black and gray interval.
Condition Explanation Figure
ugray = u’gray or
lgray = l’gray or
lgray = u’gray
If one of the three conditions depicted in the box on the left holds, then the two gray intervals intersect (gray intervals start and end with black intervals). This test is similar to the polygon boundary test in [HJR 97].
Nw(Igray) +
Nw(I’gray) <
Linterlace
If the sum of the number of the “white cells” of two gray interlacing intervals is smaller than the length of the interlacing area, then the two intervals necessarily intersect. This is the generalization of the first case of Table 3. This test is similar to the false area
test in [BKSS 94].
Table 4: Intersection between two interlacing gray intervals.
smaller than G(Igray) G(Igray) larger than linterlace=lgray uinterlace=ugray l’gray = ugray
u’gray = ugray
l’gray = lgray
L (Igray)
L’ (Igray)
Linterlace
+
interval pair is not added to ctable, i.e. we do not carry out the blobintersection test for this interval pair.
False Area Test. In this paragraph we generalize the false area test [BKSS 94] to multi-dimensional intervals, i.e. boxes and tiles. If there exist two containers which interlace, and both containers have a high density, i.e. the number of white cells is rather small, then we know that the two objects intersect without accessing the data stored in the BLOB.
Lemma 3. Let O = (id, 〈 C1 , ..., Cm 〉) and O’ = (id’, 〈 C’1 , ..., C’m’ 〉) be two objects
with their hulls H(Cgray) = and H(C’gray) = . Then the
following statement holds:
Proof. Let be the intersection volume
between two gray containers Ci and C’j. Then and
holds. As ≥ >
V holds, there must be at least one object voxel v belonging to O and O’, i.e. O and O’ intersect.■
Condition Explanation Figure
Nb(Igray) = 2 and
lgray < l’gray and
ugray > u’gray
If Igray consists only of two black cells and
I’gray is totally “included” in Igray, then we know that the two intervals cannot intersect each other, although they interlace.
Nb(Igray) =
Nb(I’gray) = 2 and
If both gray intervals consist only of two black cells and, furthermore, have distinct interval bounds, then the two intervals cer- tainly do not intersect.
Table 5: No intersection between two interlacing gray intervals.
u’gray < ugray
Lgray L’gray lgray < l’gray
Nb(Igray)= 2
lgray≠l’gray≠ ugray≠u’gray
u’gray=ugray
Lgray L’gray l’gray=lgray Nb(Igray)= 2 N’b(Igray)= 2 Xd k=1 hl k huk [ , ] Xd k=1 h’l k h’uk [ , ] i {1..m} j {1..m’}:Nw( )Ci Nw( )C’j (min h( ku,h’uk)–max h( lk,h’lk)+1) k=1 d
∏
< + ∈ ∃ ∈ ∃ intersect⇒ (O O', ) = true V (min h( uk,h’uk)–max h( kl,h’kl)+1) k=1 d ∏ = Nb( )Ci ≥V–Nw( )Ci Nb( )C’j ≥V–Nw( )C’j Nb( )Ci +Nb( )C’j V–Nw( )Ci +V–Nw( )C'jSpatial Join 143
Ranking. As shown above, we can pinpoint, based on relatively little information, whether two container pairs intersect or not. Nevertheless, there might be cases where we cannot do this for any of the database and query candidate pair. But if the hulls of two gray containers intersect, it is still helpful if we can predict how likely an object intersection according to Definition 18 might be in order to rank the pairs properly in the set of all candidate pairs belonging to the same database object.
Lemma 4. Let Cgray and C’gray be two gray intervals with densities d = D(Cgray),
d’ = D(C’gray) and hulls H(Cgray) = and H(C’gray) = . Fur-
thermore, let denote the intersec-
tion volume. Then the probability for an object intersection is:
Proof. As we assume that the black cells of both gray containers are equally dis- tributed, we can conclude that N = d×V black cells of Cgray are included in the testing area and likewise N’ = d’×V black cells from C’gray. The number of all dif- ferent possibilities how N’ black cells of C’gray can be placed over the V cells of the testing area is . Assuming that all N black cells of Cgray are already distributed over V, then the number of different possibilities how N’ black cells of
C’gray can be placed over the remaining V-N white cells such that no intersection
occurs is .
Thus, the probability for a non-intersection is equal to . ■
6.5 Spatial Join
One of the most common query types in spatial database management systems is the spatial intersection join [GG 98]. This join retrieves all pairs of intersecting ob- jects. A usual spatial join example of 2D geographical data is “find all cities which are crossed by a river”. In the automobile industry, spatial join processing of more complex 3D high-resolution objects is required. One query example is to find all temperature sensitive parts touching hot parts of the engine or the exhaust pipe. The high complexity of the objects and the fact that these objects are highly clustered yield a great challenge for efficient join processing.
Xd k=1 hl k huk [ , ] Xd k=1 h'l k h'uk [ , ] V (min h( uk,h'uk)–max h( kl,h'kl)+1) k=1 d
∏
= >0 P 1 V–d×V d’×V V d’×V ⁄ – = V N’ V d’×V = V–N N’ V–d×V d’×V = V–d×V d’×V V d’×V ⁄In this section, we demonstrate the benefits of our cost-based decompositioning approach of high-resolution spatial objects by integrating it into a simple nested-loop join procedure and a more efficient sort-merge join variant. Both methods do not assume the presence of pre-existing spatial indices on the relations.
We start with two relations R and S, both containing a set of tuples (id, mbr, link), where id denotes a unique object identifier, mbr denotes the minimal bounding rect- angle conservatively approximating the respective object and link refers to an exter- nal file containing the complete voxel set of the object (cf. Figure 75a). In this sec- tion, we assume that the voxel representation of the objects is accurate enough to determine intersecting objects without any further refinement step. In order to carry out the intersection tests efficiently, we decompose the high-resolution voxelized objects. We store the generated approximations in an auxiliary temporary relation (cf. Figure 75b), which allows us to reload certain approximations on demand keep- ing the main-memory footprint small.
The remainder of the section is organized as follows: In Section 6.5.1, we shortly present the related work in the area of spatial joins discussing three different classes dependent on the availability of index structures. In Section 6.5.2 and Section 6.5.3, we introduce a cost-based grouping approach into gray intervals, quite similar to the one presented in Section 6.3. The main difference is that we do not assume a potential query distribution, but use the statistics of the join partner relation as input parameter for the grouping algorithm. In Section 6.5.4, we introduce two different join algo- rithms based on our cost-based decompositioning. The experimental evaluation is deferred to Section 6.6.5, where we will present the benefits of our approach.
. . .
Figure 75: Spatial join procedure.
a) Object relation, b) Auxiliary relation with decomposed object approximations
Object Relation a) id A B C mbr link ’file_A’ ’file_B’ ’file_C’ Auxiliary Relation A B b) approx id
Spatial Join 145
6.5.1 Related Work
In this section, we will shortly discuss different aspects of efficient spatial join processing of complex spatial objects. Numerous spatial join algorithms have been proposed over the last decade. Most of them rely on the paradigm of multi-step query processing [BKSS 94]. A fast filter step excludes all objects that cannot satisfy the join predicate. The subsequent refinement step is applied to the join candidate pairs which are returned from the filter step. Thereby, the main focus of research is on the
filter step which is applied to geometric object approximations. On the basis of the
availability of indices for processing the filter step, spatial join methods operating on two relations can be divided into three classes:
• Class 1: Index on both relations • Class 2: Index on one relation • Class 3: No indices
The common solutions for the spatial join methods of Class 1 are the algorithms based on matching two R-trees as presented in [BKS 93b]. In the last few years, the international research community has focused on methods of Class 2 and Class 3. A simple Class 2 approach is the index nested loop, where each tuple of the non- indexed relation is used as query applied to the indexed relation. In [LR 94] seeded trees were introduced in order to process spatial joins efficiently when only one R-tree is available. The authors propose to create a second R-tree using the available tree as a skeleton and apply thereafter a Class 1 algorithm. For spatial join algorithms of Class 3, initially no indices are available which could be used to improve the query performance. Several techniques have been proposed which partition the tuples into buckets and then use hash based techniques, e.g. the spatial-hash join [LR 96] or the
partition based spatial merge join [PD 96]. The scalable sweeping-based spatial join
[APR+ 98] is w.r.t. worst-case efficiency the most promising algorithm for process- ing spatial joins. Let us note that we use a simplified version of this algorithm as starting point for our sort-merge join variant presented in Section 6.5.4. Furthermore, there exist different other approaches to improve Class 3 join algorithms, e.g. in [DS 00] the problem of redundancy and duplicate detection was investigated and in [DSTW 02] a generic technique called progressive merge join (PMJ) was introduced that eliminates the blocking behavior of sorted-based join algorithms.
Most of the approaches presented in the literature focus on the efficient computa- tion of the filter step where MBRs are used for approximating spatial objects. Our approach deals with very complex 3D objects, where the MBRs form rather poor approximations, leading to low filter selectivities. In the following sections, we will present a cost-based decompositioning algorithm for spatial join processing, which aims at finding an optimal trade-off between low redundancy and high approxima- tion quality.
6.5.2 Cost Model
For our decompositioning algorithm we take the estimated join cost between a gray interval Igray and a join-partner relation T into account. Let us note that T can be either of the tables R or S (cf. Figure 75a), or any temporary table containing derived information from the original tables R and S (cf. Figure 75b). The overall join cost
costjoin for a gray interval Igray and a join-partner relation T are composed of two parts, the filter cost costfilter and the refinement cost costrefine:
costjoin(Igray,T) = costfilter(Igray,T) + costrefine(Igray,T).
Filter Cost. The costfilter(Igray,T) can be computed by the expected number of gray intervals Igray,T of the join partner relation T. We penalize each intersection test by a constant cf which reflects the cost related to the access of one gray interval Igray,T and the evaluation of the join predicate for the pair (H(Igray),H(Igray,T)):
costfilter(Igray,T) = Ngray(T) · cf,
where Nvoxel(T) (number of voxels) ≥ Ngray(T) (number of gray intervals) ≥
Nobject(T) (number of objects) holds for the join-partner relation T. The value of the parameter cf depends on the used system.
Refinement Cost. The cost of the refinement step costrefine is determined by the selectivity of the filter step. For each candidate pair resulting from the filter step, we have to retrieve the exact geometry B(Igray) in order to verify the intersection predi- cate. Consequently, our cost-based decompositoning algorithm is based on the fol- lowing two parameters:
• Selectivity σfilterof the filter step.
Spatial Join 147
The refinement cost of a join related to a gray interval Igray can be computed as follows:
costrefine(Igray, T) = Ngray(T) · σfilter(Igray,T) · costeval(Igray).
In Section 6.3.4, it was shown how to calculate the evaluation cost costeval. In the following, we show how to estimate the selectivity of the filter step σfilter. We use simple statistics of the join-partner relation T to estimate the selectivity σfilter(Igray,T). In order to cope with arbitrary interval distributions, histograms can be employed to capture the data characteristics at any desired resolution (cf. Section 4.3.1). The selectivity σfilter(Igray,T) related to a gray interval Igray can be determined by using an appropriate interval histogram H(T,ν) of the join partner relation T. Based on H(T,ν), we compute a selectivity estimate by evaluating the intersection of Igray with each bucket span bi,ν (cf. Definition 7 and Figure 38).
σfilter(Igray, T) = ,
Join cost. To sum up, the join cost costjoin(Igray) related to a gray interval Igray and a join-partner relation T can be expressed as follows:
costjoin(Igray,T) =Ngray(T) · (cf +σfilter(Igray,T) ·costeval(Igray)).
6.5.3 Decompositioning Algorithm
For each object, there exist a lot of different possibilities to decompose the voxel- ized object into a gray interval sequence.
Lemma 5. Let Ovoxel= (id, {v1, ..., vn}) be a voxelized object and be a space filling curve. Furthermore, let W = {(l, u) ∈ IN2, l ≤ u} be the domain of intervals and let b1 = (l1, u1), …, bn = (ln, un) ∈ W be a sequence of intervals where
ui+ 1 < li+1 representing the set {ρ(v1), ...,ρ(vn)}. Then, there exist O(2n) different
gray interval sequences Ogray= (id, 〈 , , …,
〉).
Proof. The interval sequence representing Ovoxel consists of n-1 “gaps”. For each of these gaps we can decide whether it is included in a gray interval, or whether it separates two gray intervals. Thus we have 2n-1 different possible gray interval se- quences. ■ overlap H I( ( gray),bi,ν) β ---⋅ni i=1 ν
∑
ni i=1 ν∑
--- ρ:INd→IN bi 0+1,...,bi1 〈 〉 bi 1+1,...,bi2 〈 〉 bi m–1+1,...,bim 〈 〉Based on the formulas for join cost related to a gray interval Igray and a join-part- ner relation T, we can find a cost optimum decompositioning algorithm. Unfortunate- ly, as shown in the above lemma, there exist exponentially many decompositioning possibilities, which results in a high runtime of an optimum cost-based decomposi- tioning algorithm. In this section, we will present a greedy algorithm with a guaran- teed worst-case runtime complexity of O(n) which produces decompositions helping to accelerate the query process considerably (cf. Section 6.6.5).
For fulfilling the grouping rules presented in Section 6.3, we introduce the follow- ing top-down decompositioning algorithm for gray intervals, called JoinGroupCon
(cf. Figure 76). JoinGroupCon is a recursive algorithm which starts with a gray inter- val Igray initially covering the complete object. In each step of our algorithm, we look
for the longest remaining gap. We carry out the split at this gap, if the estimated join cost caused by the decomposed intervals is smaller than the estimated cost caused by our input interval Igray. The expected join cost costjoin(Igray,T) can be computed as described above. Data compressors which have a high compression rate and a fast decompression method, result in an early stop of the JoinGroupCon algorithm gener- ating a small number of gray intervals. Let us note that the inequality “costgray >
costdec” in Figure 76 is independent of Ngray(T), and thus Ngray(T) is not required