CAPÍTULO 2. MARCO TEÓRICO
2.3. Diseño y validación de secuencias didácticas como campo de investi-
2.3.2. Las SEA en la enseñanza y aprendizaje de las ciencias Modelos
2.3.2.2. El modelo de la demanda de aprendizaje
represents the average number of duplicates that the mapper phases of A generate per data pointpi ∈D.
Using thesupporting areapartitioning strategy, each data point has to be transmitted at least once from mappers to reducers, as acore pointof one cell, and possibly many more times, as asupport pointof other cells. Therefore, the efficiency of a partitioning method can be modeled using the notion of “Duplication Rate”, which refers to the average number of duplicates mappers need to create for each input data point. The larger the duplication rate, the more data points must be transmitted from mappers to reducers, and thus the higher the communication and computation costs. The duplication rate is defined next.
Obviously if we assign the whole data set to each partition as its supporting area, each reducer is able to independently discover thekNN for its core points. However the duplication rate will be extremely high. Therefore this is not a practical solution. Next we present our pivot-based partitioning method that yields a lower duplication rate than this naive approach by only including support points that have the potential to be in the
kNN of the core points in a given cell. Applying this partitioning method in ourDLOF
approach, our pivot-based partitioning based LOF approach (PDLOF) provides the first full-fledged distributed LOF solution.
19.1
Pivot-Based Partitioning
The central idea of pivot-based partitioning is to divide the dataset by choosing a small set of n initial points, orpivots, from the domain space in a pre-processing step. Each input point in the dataset is then assigned to its closest pivot. Grouping data according to their proximity to these n pivotsresults in a division of the data space inton disjoint
19.1 PIVOT-BASED PARTITIONING
a Voronoi diagram, which is depicted in Figure 19.1 forn= 5pivots. A formal definition of a Voronoi cell is as follows:
Definition 19.2 Voronoi CellGiven a datasetDand a set of pivotsP ={p1, p2, . . . , pn}
we haven corresponding Voronoi cellsV1. . . VnwhereV1∪V2∪...∪Vn=D. Ifi6=j,
Vi∩Vj =∅and
Vi ={q|distance(q, pi)≤distance(q, pj)} ∀q∈D, i6=j
Figure 19.1: DLOF: Voronoi Diagram-Based Partitioning
By partitioning data into Voronoi cells, nearby data points are grouped together. There- fore the locality of the data points are preserved. However still the nearest neighbors of some points may fall in other partitions such as the points at the edge of each Voronoi cell. Asupporting areais required to determine thekNN of such points as shown in Fig. 19.1.
A major benefit of using a pivot-based strategy is that in the process of partitioning the data we can learn information about each cell, namely, the distance from each point to the pivots. This information can be utilized to derive bounds on the possible distance from any point in a partition to it’s neighbors, and therefore a bound on thek−distance
of all points in the cell. This bound can then can be utilized to determine which points must be included in thesupporting areaof the cell.
19.1 PIVOT-BASED PARTITIONING
To establish this bound we first introduce the upper bound on the distance from one point in a Voronoi cellVj to any point in a Voronoi cellVi.
Definition 19.3 Given a Voronoi cellVi with pivot pi, the upper bound on the distance
from one points∈Vj, i6=j to any point∈Vi denoted asub(s, Vi):
ub(s, Vi) = maxdist(Vi) +distance(pi, pj) +distance(pj, s)
wheremaxdist(Vi)is the greatest distance from the pivot ofpito any point within its cell
Vi.
Figure 19.2:DLOF: Upper Bound On K-distance For Points in PartitionVi
The geometric meaning of this bound is illustrated in Figure 19.2. Intuitively given one point t in a Voronoi cellVi, in the worst case its distance to one points in another
Voronoi cellVj ub(t,s)is the distance between the pivotpi of Vi and the pivot pj ofVj
plus the distance between t and its pivotpi and the distance betweens and its pivot pj.
This worst case happens only when: (1) thet,pi,pj, andscan be connected by one straight
line and t,s; (2)sandtare located at the opposite side of their corresponding pivots. Then the upper bound on the distance from s ∈ Vj toanypoint in Vi ub(s,Vi) isub(tmax,s)
where tmax is the furthest point to pivotpi in Voronoi cellVi. The formal proof can be
found in [68].
Calculating this upper bound is straightforward if for each Voronoi cell we track and maintain the point which is furthest to its pivot during the processing of pivot partitioning.
19.1 PIVOT-BASED PARTITIONING
Utilizing this bound, we can derive an upper boundφof thek-distancefor all points in a Voronoi cellVi.
First, we find a set of k points Sj in each Voronoi cell Vj with the smallest upper
bound distances to all points in Voronoi cellVi. By definition 19.3 these k points in fact
corresponds to thekNN of pivotpj. Similar to the furthest point topj, these k points can
be discovered and maintained in the partitioning process.
Second, after we acquire thek * (n-1)pointsSfrom the n -1 Voronoi cells (excluding
Viitself), we find thekpoints fromS{s1. . . sk}with the smallestub(si, Vi)as thekNN of
all points inVidenoted asKNN(Vi). Then the upper boundk-distancecan be determined
by Lemma 19.1
Lemma 19.1 Upper Bound k-distance For each Voronoi cell Vi, the upper bound k-
distanceφi for all pointst∈Vi is given as:
φi = max
∀s∈kN N(Vi)
kub(s, Vi)k
Intuitively for any pointt inPi we can find at least k points around twithin distance
range φi, since in this range it includes at least the k points inKNN(Vi). Naturally the
actualkNN oftwould not be out of this scope. Thus the distance by which each cell must be extended to includesupport pointscan be safely bounded byφi. φi can be calculated
almost for free if the furthest point and the kNN of each pivot pi are maintained in the
partitioning process as discussed above.
Note in step (1) in fact the k pointsSj found in one Voronoi cellVjis already sufficient
to bound thekNN of all points of VoronoiVj, since utilizing the largestub(s ∈Sj,Vi)is
also able to cover at least k points around any point t inVi. However in step (2) we further