• No se han encontrado resultados

1.5. TIPO Y DISEÑO DE INVESTIGACIÓN

2.8.2. INCIDENCIA DE LOS FACTORES DE RIESGO SOBRE LA

As stated in Chapter 1, the union of two setsAandB,A∪B is the set{x|x∈A∨x∈B}(Definition 1.2.2). Approximating the size of the union of two sets can be achieved in the sketching and streaming models, by using results shown in [FM85] and developed in [AMS96]. It can also be achieved by using techniques based on the vector norm approximations that we have already discussed.

In the unaggregated streaming model, we might also want to count the number of distinct elements in the stream. This turns out to be almost identical to computing the size of the union of two sets in the same model.

Probabilistic Counting

This method to compute the number of distinct items seen in a stream of elements was first elaborated in [FM85]. Let nbe the size of the universe from which the elements are being drawn. Without loss of generality, the authors assume that these are represented as integers in the range1ton. The aim is to compute how many distinct values are seen, that is, how many integers are represented in the stream. This is precisely the quantityF0of the stream as described in Section 2.2.2.

Using spaceO(1/(2lognlog 1), it is possible to compute an approximation of the number of distinct values in a unordered, unaggregated stream.

Proof. The procedure at the core of the estimation is as follows: for each elementxwhich arrives in the stream for setA, compute a randomly chosen hash function ofx,hash(x), mapping onto[1. . . n]. In their analysis, Alon et al [AMS96] pick Linear Congruential Generators for this hash function, that is, functions of the formhash(x) =ax+b modp modn, with the parametersaandbpicked uniformly at random from the range[1. . . p], wherepis a prime chosen in the rangen≤p <2n. From this, compute the functionzeros(hash(x)), which is the number of consecutive bits counting from the rightmost (least significant) which are zero. This has the property that, over allx,Pr[zeros(hash(x)) = i] = 2(i+1). We keep a bit vector bits, and every time hash(x) is seen, then bits[hash(x)] 1. Following the processing of the sequence of values, we findminzeroas the smallest entry in the vectorbitsthat is zero. If this procedure is run on a stream of values, thenO(2minzero)is a good approximation for the number of values in the stream, using onlyO(logn)space to storebits, plusO(logn)working space. This procedure can then be made into an((, δ)-approximation by appropriate repetition and averaging form = O(1/(2log 1)different hash functions (see [FM85, AMS96, BYJK+02] for details), yielding a scheme with space requirementsO(1/(2lognlog 1). Our output is two raised to the power of the average value ofminzero, scaled by an appropriate scaling constant. This scaling factor is given in [FM85] as 1.2928. The algorithm implementing this is shown in Algorithm 2.3.1. An important property of this scheme is that if the same element occurs multiple times in the stream, it does not affect the result. This is because ifxis seen twice, hash(x)remains the same, and so seeing xagain will not change minzero. So this procedure can be used to approximately count the number of distinct elements in an unordered unaggregated stream. This is especially important in database applications, where it is useful to maintain approximate information about database relations to allow query planning and optimisation, and even approximate query answering. We can cast this as a sketch algorithm for set union size, by using the bit vector as the sketch for the set that has been processed. It is straightforward to combine two such sketches to find the size of the union of two streams: for eachbitsAandbitsB for streamsAandB, we findminzerofor(bitsA∨bitsB). This can

extend to multiple sets in the obvious way.

Hence, this approach can be used to compute the size of a set in the streaming model, with repeated information; and the size of the union of two (or more) sets can be approximated in the streaming and sketch models. Recent work has taken essentially the same approach, and using the same idea at the core (of a hash function where the probability of returningiis related to2−i) keeps a

small sample of distinct elements. This is reported in [GT01] and experimental work on this in [Gib01]

Union Computation via Hamming Norms

An alternative approach to finding the union of two streams is given in [CDIM02]. We first observe that, for a single stream, the number of distinct items is given by the Hamming norm of a vector, defined as ||a||H =

n

i=1a= 0in Definition 1.2.7. By using a sufficiently small value ofp, we can use theLpnorm

in order to approximate the Hamming norm in the unaggregated streaming model.

Theorem 2.3.3 The Hamming norm can be approximated by finding theLpnorm of a vectorafor sufficiently smallp >0provided we have a limit on the size of each entry in the vector.

Proof. We provide an alternative mathematical definition for the Hamming norm. We want to find |{i|ai= 0}|. Observe thata0i = 1ifai= 0; we can definea0i = 0forai= 0. Thus, the Hamming norm of a vectorais given by||a||H =

ia

0

i. This is similar to the definition of theLpnorm of a vector, which

is defined as(i|ai|p)1/p. Define theL0norm ofaasia0i. We show thatL0(Hamming norm) of a

We consideri|ai|p= (Lp)pfor a small value ofp(p >0). If, for alli, we have that|ai| ≤Ufor

some upper boundU, then

i a0i i |ai|p≤ i Upa0i =Up i a0i (1 +() i a0i

if we setp≤log(1 +()/logU ≈(/logU.

Hence, if we can arrange to be able to find the quantityi|ai|p, then we can approximate the

Hamming norm up to a(1 +()factor.

Corollary 2.3.1 We can make sublinear sized sketches in the unaggregated streaming model to allow the approximation of the vector Hamming distance.

This corollary follows based on a few observations. Firstly, these sketches can be computed in the unaggregated streaming model since we can use sketches for theLp norm which are computable in that model. Secondly, since these sketches are generated by a linear function (the dot product), it is easy to combine two sketches of vectorsa,bto get a sketch ofabwhose Hamming norm is the vector Hamming distance of the original vectors. The space required is the same as in Theorem 2.2.3, that is,O(1/(2log 1). We can also find the union of two streams representingaandbby noting that the quantity we seek is||a+b||H. This can be achieved by sketchingaandbseparately, generatingsk(a, r)

andsk(b, r), and then creatingsk(a+b, r) =sk(a, r) +sk(b, r). This follows, since these sketches are composable, by Observation 2.2.1.

Note that this approach to Hamming norms and related quantities is rather more general than probabilistic counting, since this still functions even if entries inaandbare negative. It is undefined and unclear what to do in probabilistic counting when entries are negative. However, here it is straightforward, and well defined. Since we are usingLpnorms, if we find the norm of a vector with

negative entries using a sketch, the result is known, this just counts one towards the total of non-zero entries. This behaviour is useful in many situations where these sketches are computed in a distributed fashion, or when negative values are a reasonable part of the input.

Documento similar