• No se han encontrado resultados

3.5 Marco Jurídico

3.5.2 Marco Jurídico Internacional

We have explained how to construct a language model for each resource X, whether an entity or a relation. Now, we associate each resource X in our knowl- edge with a list of candidate substitutions which is defined as follows.

Definition 4.13 : Substitution List

Given a resource X, a substitution list L consists of a set of resources Y which are ordered by their similarity to the given resource.

We first explain how to compute the similarity between two given resources Xand Y and then explain how we construct the substitution lists.

Similarity between Resources

The similarity between two resources X and Y is computed as the distance be- tween their language models. Specifically, we use the square-root of the Jensen- Shannon divergence (JS divergence) between the language models of the two

4.2. Query Reformulation Framework

resources X and Y, which is a metric, to measure the distance between the two resources. The JS divergence is defined as follows.

Definition 4.14 : Jensen-Shannon Divergence

The Jensen-Shannon divergence between two probability distributions P and Q, is a symmetric measure of the distance between two probability distributions.

Given two probability distributions P and Q, the JS divergence between them is computed as follows:

JS(P||Q) = KL(P||M) + KL(Q||M) (4.8) where KL(R||S) is the Kullback-Leibler divergence (KL divergence) between two probability distributions R and S, which is computed as follows:

KL(R||S) = ΣwR(w)log R(w) S(w) (4.9) and M = 1 2(P + Q) (4.10)

We use the square root of the JS divergence since it is a metric between 0 and 1, and thus it can be used to measure the similarity between two resources.

Substitution Lists Construction

We have so far shown how to represent a resource and how to measure the sim- ilarity between two resources. To recap, for each resource X in the knowledge base KB, we construct a language model. The similarity between two resources Xand Y is then computed as the distance between the language models of the two resources. Specifically, we use the square-root of the Jensen-Shannon diver- gence (JS divergence) between the two language models. Now, a substitution list for a resource X is a simply a ranked list of other resources, ranked based on the square-root of the JS divergence between their language models and the language model of resource X.

Adding Variables to Substitution Lists

Recall that a triple-pattern query can be reformulated by replacing one of the resources that appear in it with a variable. We interpret replacing a resource

Academy Award for Best Actor Thriller

BAFTA Award for Best Actor Crime

Golden Globe Award for Best Actor Drama Horror

var Action

Golden Globe Award for Best Actor Musical or Comedy Mystery New York Film Critics Circle Award for Best Actor var

directed bornIn actedIn livesIn created originatesFrom produced var var diedIn type isCitizenOf

Table 4.1.: Example resources and their top-5 substitutions

with a variable as being equivalent to replacing that resource with any other resource in the knowledge base.

To handle variable substitutions, we interpret replacing a resource X with a variable as replacing X with any other resource in the knowledge base. To carry this out, we construct a special language model for all other resources in the knowledge base which is a mixture model of all the language models of all the resources in the knowledge base other than X. The similarity between the re- source X and a variable is then computed using the square-root of the JS diver- gence between the language model of the resource X and the special language model corresponding to all other resources in the knowledge base. Using this technique, a variable is now simply another entry in the substitution list of re- source X, .

Table 4.1 shows example resources from an RDF knowledge base about movies. For each resource, it shows the top-5 substitutions from the resource substitution list. The entry var represents the variable substitution. As previously explained, a variable substitution indicates that there were no other specific substitutions which had a higher similarity to the given resource.

4.2. Query Reformulation Framework

Pruning the Substitution Lists

Maintaining a substitution list for every resource in the knowledge base can be very impractical when these lists are long. Recall that for a given resource, its substitution list contains all other resources in the knowledge base and their similarities to the given resource. These lists can thus be extremely long in large knowledge bases. Pruning such lists is thus crucial to avoid storage bottleneck, as these lists need to be maintained somewhere in the knowledge base. Pruning can also be beneficial for efficient query processing since our query reformula- tion algorithm described next scans such lists to generate reformulated queries, which would then be evaluated. Pruning can limit the number of such reformu- lated queries to a reasonable number.

The most basic way to prune substitution lists is to use a pruning threshold. That is, reduce the list and cut off its tail whenever the similarity score between the substitution and the resource the list belongs to becomes less than a prede- fined threshold. In our framework, we use the score of the variable substitution as the threshold value after which we prune the lists. More precisely, any sub- stitution ranked below the variable substitution is pruned and removed from the substitution list. As an example, consider the substitution list for the rela- tiondirectedshown in Table 4.1. This substitution list can be pruned after the

fourth entry. This seems to be very intuitive since substitutions beyond the vari- able substitution can be seen as very dissimilar from the given resource, that they may as well be ignored and represented by the variable substitution.