• No se han encontrado resultados

CAPÍTULO 2. MARCO TEÓRICO

2.3. Diseño y validación de secuencias didácticas como campo de investi-

2.3.2. Las SEA en la enseñanza y aprendizaje de las ciencias Modelos

2.3.2.2. El modelo de la demanda de aprendizaje

represents the average number of duplicates that the mapper phases of A generate per data pointpi ∈D.

Using thesupporting areapartitioning strategy, each data point has to be transmitted at least once from mappers to reducers, as acore pointof one cell, and possibly many more times, as asupport pointof other cells. Therefore, the efficiency of a partitioning method can be modeled using the notion of “Duplication Rate”, which refers to the average number of duplicates mappers need to create for each input data point. The larger the duplication rate, the more data points must be transmitted from mappers to reducers, and thus the higher the communication and computation costs. The duplication rate is defined next.

Obviously if we assign the whole data set to each partition as its supporting area, each reducer is able to independently discover thekNN for its core points. However the duplication rate will be extremely high. Therefore this is not a practical solution. Next we present our pivot-based partitioning method that yields a lower duplication rate than this naive approach by only including support points that have the potential to be in the

kNN of the core points in a given cell. Applying this partitioning method in ourDLOF

approach, our pivot-based partitioning based LOF approach (PDLOF) provides the first full-fledged distributed LOF solution.

19.1

Pivot-Based Partitioning

The central idea of pivot-based partitioning is to divide the dataset by choosing a small set of n initial points, orpivots, from the domain space in a pre-processing step. Each input point in the dataset is then assigned to its closest pivot. Grouping data according to their proximity to these n pivotsresults in a division of the data space inton disjoint

19.1 PIVOT-BASED PARTITIONING

a Voronoi diagram, which is depicted in Figure 19.1 forn= 5pivots. A formal definition of a Voronoi cell is as follows:

Definition 19.2 Voronoi CellGiven a datasetDand a set of pivotsP ={p1, p2, . . . , pn}

we haven corresponding Voronoi cellsV1. . . VnwhereV1∪V2∪...∪Vn=D. Ifi6=j,

Vi∩Vj =∅and

Vi ={q|distance(q, pi)≤distance(q, pj)} ∀q∈D, i6=j

Figure 19.1: DLOF: Voronoi Diagram-Based Partitioning

By partitioning data into Voronoi cells, nearby data points are grouped together. There- fore the locality of the data points are preserved. However still the nearest neighbors of some points may fall in other partitions such as the points at the edge of each Voronoi cell. Asupporting areais required to determine thekNN of such points as shown in Fig. 19.1.

A major benefit of using a pivot-based strategy is that in the process of partitioning the data we can learn information about each cell, namely, the distance from each point to the pivots. This information can be utilized to derive bounds on the possible distance from any point in a partition to it’s neighbors, and therefore a bound on thek−distance

of all points in the cell. This bound can then can be utilized to determine which points must be included in thesupporting areaof the cell.

19.1 PIVOT-BASED PARTITIONING

To establish this bound we first introduce the upper bound on the distance from one point in a Voronoi cellVj to any point in a Voronoi cellVi.

Definition 19.3 Given a Voronoi cellVi with pivot pi, the upper bound on the distance

from one points∈Vj, i6=j to any point∈Vi denoted asub(s, Vi):

ub(s, Vi) = maxdist(Vi) +distance(pi, pj) +distance(pj, s)

wheremaxdist(Vi)is the greatest distance from the pivot ofpito any point within its cell

Vi.

Figure 19.2:DLOF: Upper Bound On K-distance For Points in PartitionVi

The geometric meaning of this bound is illustrated in Figure 19.2. Intuitively given one point t in a Voronoi cellVi, in the worst case its distance to one points in another

Voronoi cellVj ub(t,s)is the distance between the pivotpi of Vi and the pivot pj ofVj

plus the distance between t and its pivotpi and the distance betweens and its pivot pj.

This worst case happens only when: (1) thet,pi,pj, andscan be connected by one straight

line and t,s; (2)sandtare located at the opposite side of their corresponding pivots. Then the upper bound on the distance from s ∈ Vj toanypoint in Vi ub(s,Vi) isub(tmax,s)

where tmax is the furthest point to pivotpi in Voronoi cellVi. The formal proof can be

found in [68].

Calculating this upper bound is straightforward if for each Voronoi cell we track and maintain the point which is furthest to its pivot during the processing of pivot partitioning.

19.1 PIVOT-BASED PARTITIONING

Utilizing this bound, we can derive an upper boundφof thek-distancefor all points in a Voronoi cellVi.

First, we find a set of k points Sj in each Voronoi cell Vj with the smallest upper

bound distances to all points in Voronoi cellVi. By definition 19.3 these k points in fact

corresponds to thekNN of pivotpj. Similar to the furthest point topj, these k points can

be discovered and maintained in the partitioning process.

Second, after we acquire thek * (n-1)pointsSfrom the n -1 Voronoi cells (excluding

Viitself), we find thekpoints fromS{s1. . . sk}with the smallestub(si, Vi)as thekNN of

all points inVidenoted asKNN(Vi). Then the upper boundk-distancecan be determined

by Lemma 19.1

Lemma 19.1 Upper Bound k-distance For each Voronoi cell Vi, the upper bound k-

distanceφi for all pointst∈Vi is given as:

φi = max

∀s∈kN N(Vi)

kub(s, Vi)k

Intuitively for any pointt inPi we can find at least k points around twithin distance

range φi, since in this range it includes at least the k points inKNN(Vi). Naturally the

actualkNN oftwould not be out of this scope. Thus the distance by which each cell must be extended to includesupport pointscan be safely bounded byφi. φi can be calculated

almost for free if the furthest point and the kNN of each pivot pi are maintained in the

partitioning process as discussed above.

Note in step (1) in fact the k pointsSj found in one Voronoi cellVjis already sufficient

to bound thekNN of all points of VoronoiVj, since utilizing the largestub(s ∈Sj,Vi)is

also able to cover at least k points around any point t inVi. However in step (2) we further