• No se han encontrado resultados

Competencias y funciones

Numerous studies in the web caching field have concluded that web access follows a Zipf-like distribution [L. 99, L. 98]. That is, the relative probabil- ity of a request for a document is inversely proportional to its popularity rank i(i = 1::N). The probability Pd(i)of a request for the i’th popular document is proportional to1=i

(0 <  1). In our context, we think it is appropriate to model the XML document access pattern using this Zipf- like distribution. Since in the distribution model implies document access skewness3, it is closely related to

K t.

Independent of the document access pattern, we also observe that the average distance of query pairs within a given document likely indicates the skewness of query region access distribution. That is, the smaller the average distance, the more intensive is the concentration of query accesses to hot regions. Hence, the indicator of query region access skewnessK l is K l =

P

(1=D(Qi;Qj)) 

m(m 1)=2 . In this formula, the numerator is the sum of the in-

3When

is close to 1, the top 1 popular document gets twice as many query accesses

verse of the distances (of all query pairs within a certain document cluster) adjusted by a parameter. The denominator is the number of all possible combinations of query pairs, assumingmis the number of queries access- ing the given document). Suppose the region access distribution follows a similar Zipf-like model, we then compute the probabilityPr(j)of a request for the k’th popular region is proportional to1=j

(0< 1and closely related toK l).

For Zipf-like distributions, the cumulative probability that one of the topkdocuments (among the totalN documents) is accessed is given asymp- totically by: (k)= P k i=1 i , where =( P N i=1 1=i ) 1 (1 )=N 1 . Thus(k) (k=N) 1 (when =1,(k) ln(k=N)). Becausek=N < 1, a larger increases(k), meaning more queries focus on a few hot docu- ments. The probabilityPd(i)of an access to the topi’th popular document isPd(i) = i  1 N ( N i )

. Similarly, if considering the probability Pr(j) of a query request for the topj’th popular region within a particular docu- ment, we havePr(j) 1 M ( M j )

, whereM is the number of query regions in a document, and is the parameter suits the region access distribution in a particular document.

In a query-based caching environment, we are concerned about the popularity ranking of query regions across documents. First, we look at the overall probabilityP(i;j)of a query request for thej’th popular region within thei’th popular document. SupposePd(i)andPr(j)are indepen- dent of each other, we obtainP(i;j)=Pd(i)Pr(j) 

(1 )(1 ) ( N ) ( M ) .

6.3. THE ANALYSIS OF CACHE PERFORMANCE 149

If the situation is simpler and a uniform suits both the document and all the query region access distributions,P(i;j) 

(1 ) 2 MN ( MN ij ) , which im- plies a multivariate Zipf-like distribution. We infer from this equation that if two query regions have the sameij production value (i.e., the docu- ment popularity rank times the local query region popularity rank), then they have the same overall popularity. For such a multivariate distribu- tion, we have the cumulative probability(k) =

P t i=1 P u j=1 P(i;j), where tuk.

This model assumes that the query requests are independent and both the document and query region access patterns follow the Zipf-like distri- bution with the same parameter. It may be not very realistic, but the model is tractable and it is sufficient to help us understand how the hit ratio can be influenced by various factors.

Correlation between Hit Ratio and Cache Size. Studies of web caching have found that when the cache size is infinitely large, the correlation be- tween the access frequency and document size, if any, is weak in general and can be ignored [L. 99]. We believe this finding is valid in our context as well. However, if the cache source is limited, the Zipf-like distribution will be “cut-off” and eventually the top most popular query region groups

4 will fill the cache to its size limit ideally. However, it is hard to derive from the cache sizeCdue to the factoring problem. If we assume that the query region sizes are the same and the factoring can be continuous (not a realis- tic assumption though), is approximated as

p 2Cdue to P i=1 P =i j=1 =C.

4If multiple query regions have the same overall popularity, e.g,

P(1;6) = P(2;3) = P(3;2)=P(6;1), we consider them as one query region group.

Thus the cumulative probability (C) P p 2C i=1 P p 2C=i j=1 P(i;j) (1 ) 2 (MN) (1 ) P p 2C i=1 P p 2C=i j=1 ( 1 ij ) . The asymptotic hit ratio H(C) is closely related to (C). If is very close to 1,H(C)grows with the cache sizeClogarithmically, i.e.,H(C)  ln

C

MN. Otherwise,

H(C) cannot easily be approximated by a particular function. However, it is bounded by some polynomial function with a small power, e.g,H(C)<(

C MN

) 1

.

Correlation between Hit Ratio and Query Pattern. From the hit ratio function, we can see that the parameter plays a role in controlling the slope steepness of the curve. Since C

MN

<1, the closer is to 1, the smaller 1 is and consequently the largerH(C)gets. As we discussed before, is related to the document access skewnessK tand the query region access skewnessK l. Therefore, the more query requests are concentrated on a few hot spots, the higher hit ratios can be achieved, ideally. We also observe that, if the overall document size is fixed,N andM will increase when the average individual document size and query region sizes decrease. With the same cache size and query skewness,H(C)will become smaller. Due to the close relationship between the average query region size and the query selectivitySas we have discussed earlier, a largerSimplies a larger region and thus a smallerH(C).

6.3. THE ANALYSIS OF CACHE PERFORMANCE 151