Competencias y funciones - Boletín Oficial de la Asamblea de Madrid

Numerous studies in the web caching field have concluded that web access follows a Zipf-like distribution [L. 99, L. 98]. That is, the relative probability of a request for a document is inversely proportional to its popularity rank i(i = 1::N). The probability Pd(i)of a request for the i’th popular document is proportional to1=i

(0 < 1). In our context, we think it is appropriate to model the XML document access pattern using this Zipf- like distribution. Sincein the distribution model implies document access skewness3_{, it is closely related to}

K t.

Independent of the document access pattern, we also observe that the average distance of query pairs within a given document likely indicates the skewness of query region access distribution. That is, the smaller the average distance, the more intensive is the concentration of query accesses to hot regions. Hence, the indicator of query region access skewnessK l is K l =

(1=D(Qi;Qj))

m(m 1)=2 . In this formula, the numerator is the sum of the in-

3_When

is close to 1, the top 1 popular document gets twice as many query accesses

verse of the distances (of all query pairs within a certain document cluster) adjusted by a parameter. The denominator is the number of all possible combinations of query pairs, assumingmis the number of queries access- ing the given document). Suppose the region access distribution follows a similar Zipf-like model, we then compute the probabilityPr(j)of a request for the k’th popular region is proportional to1=j

(0< 1andclosely related toK l).

For Zipf-like distributions, the cumulative probability that one of the topkdocuments (among the totalN documents) is accessed is given asymp- totically by: (k)= P k i=1 i , where =( P N i=1 1=i ) 1 (1 )=N 1 . Thus(k) (k=N) 1 (when =1,(k) ln(k=N)). Becausek=N < 1, a largerincreases(k), meaning more queries focus on a few hot documents. The probabilityPd(i)of an access to the topi’th popular document isPd(i) = i 1 N ( N i )

. Similarly, if considering the probability Pr(j) of a query request for the topj’th popular region within a particular document, we havePr(j) 1 M ( M j )

, whereM is the number of query regions in a document, and is the parameter suits the region access distribution in a particular document.

In a query-based caching environment, we are concerned about the popularity ranking of query regions across documents. First, we look at the overall probabilityP(i;j)of a query request for thej’th popular region within thei’th popular document. SupposePd(i)andPr(j)are independent of each other, we obtainP(i;j)=Pd(i)Pr(j)

(1 )(1 ) ( N ) ( M ) .

6.3. THE ANALYSIS OF CACHE PERFORMANCE 149

If the situation is simpler and a uniformsuits both the document and all the query region access distributions,P(i;j)

(1 ) 2 MN ( MN ij ) , which implies a multivariate Zipf-like distribution. We infer from this equation that if two query regions have the sameij production value (i.e., the document popularity rank times the local query region popularity rank), then they have the same overall popularity. For such a multivariate distribution, we have the cumulative probability(k) =

P t i=1 P u j=1 P(i;j), where tuk.

This model assumes that the query requests are independent and both the document and query region access patterns follow the Zipf-like distribution with the same parameter. It may be not very realistic, but the model is tractable and it is sufficient to help us understand how the hit ratio can be influenced by various factors.

Correlation between Hit Ratio and Cache Size. Studies of web caching have found that when the cache size is infinitely large, the correlation between the access frequency and document size, if any, is weak in general and can be ignored [L. 99]. We believe this finding is valid in our context as well. However, if the cache source is limited, the Zipf-like distribution will be “cut-off” and eventually the top most popular query region groups

4 will fill the cache to its size limit ideally. However, it is hard to derive from the cache sizeCdue to the factoring problem. If we assume that the query region sizes are the same and the factoring can be continuous (not a realistic assumption though), is approximated as

p 2Cdue to P i=1 P =i j=1 =C.

4_{If multiple query regions have the same overall popularity, e.g,}

P(1;6) = P(2;3) = P(3;2)=P(6;1), we consider them as one query region group.

Thus the cumulative probability (C) P p 2C i=1 P p 2C=i j=1 P(i;j) (1 ) 2 (MN) (1 ) P p 2C i=1 P p 2C=i j=1 ( 1 ij ) . The asymptotic hit ratio H(C) is closely related to (C). If is very close to 1,H(C)grows with the cache sizeClogarithmically, i.e.,H(C) ln

MN. Otherwise,

H(C) cannot easily be approximated by a particular function. However, it is bounded by some polynomial function with a small power, e.g,H(C)<(

C MN

) 1

Correlation between Hit Ratio and Query Pattern. From the hit ratio function, we can see that the parameter plays a role in controlling the slope steepness of the curve. Since C

<1, the closeris to 1, the smaller 1 is and consequently the largerH(C)gets. As we discussed before, is related to the document access skewnessK tand the query region access skewnessK l. Therefore, the more query requests are concentrated on a few hot spots, the higher hit ratios can be achieved, ideally. We also observe that, if the overall document size is fixed,N andM will increase when the average individual document size and query region sizes decrease. With the same cache size and query skewness,H(C)will become smaller. Due to the close relationship between the average query region size and the query selectivitySas we have discussed earlier, a largerSimplies a larger region and thus a smallerH(C).

6.3. THE ANALYSIS OF CACHE PERFORMANCE 151

In document Boletín Oficial de la Asamblea de Madrid (página 52-57)