3.5 Marco Jurídico
3.5.2 Marco Jurídico Internacional
We have explained how to construct a language model for each resource X, whether an entity or a relation. Now, we associate each resource X in our knowl- edge with a list of candidate substitutions which is defined as follows.
Definition 4.13 : Substitution List
Given a resource X, a substitution list L consists of a set of resources Y which are ordered by their similarity to the given resource.
We first explain how to compute the similarity between two given resources Xand Y and then explain how we construct the substitution lists.
Similarity between Resources
The similarity between two resources X and Y is computed as the distance be- tween their language models. Specifically, we use the square-root of the Jensen- Shannon divergence (JS divergence) between the language models of the two
4.2. Query Reformulation Framework
resources X and Y, which is a metric, to measure the distance between the two resources. The JS divergence is defined as follows.
Definition 4.14 : Jensen-Shannon Divergence
The Jensen-Shannon divergence between two probability distributions P and Q, is a symmetric measure of the distance between two probability distributions.
Given two probability distributions P and Q, the JS divergence between them is computed as follows:
JS(P||Q) = KL(P||M) + KL(Q||M) (4.8) where KL(R||S) is the Kullback-Leibler divergence (KL divergence) between two probability distributions R and S, which is computed as follows:
KL(R||S) = ΣwR(w)log R(w) S(w) (4.9) and M = 1 2(P + Q) (4.10)
We use the square root of the JS divergence since it is a metric between 0 and 1, and thus it can be used to measure the similarity between two resources.
Substitution Lists Construction
We have so far shown how to represent a resource and how to measure the sim- ilarity between two resources. To recap, for each resource X in the knowledge base KB, we construct a language model. The similarity between two resources Xand Y is then computed as the distance between the language models of the two resources. Specifically, we use the square-root of the Jensen-Shannon diver- gence (JS divergence) between the two language models. Now, a substitution list for a resource X is a simply a ranked list of other resources, ranked based on the square-root of the JS divergence between their language models and the language model of resource X.
Adding Variables to Substitution Lists
Recall that a triple-pattern query can be reformulated by replacing one of the resources that appear in it with a variable. We interpret replacing a resource
Academy Award for Best Actor Thriller
BAFTA Award for Best Actor Crime
Golden Globe Award for Best Actor Drama Horror
var Action
Golden Globe Award for Best Actor Musical or Comedy Mystery New York Film Critics Circle Award for Best Actor var
directed bornIn actedIn livesIn created originatesFrom produced var var diedIn type isCitizenOf
Table 4.1.: Example resources and their top-5 substitutions
with a variable as being equivalent to replacing that resource with any other resource in the knowledge base.
To handle variable substitutions, we interpret replacing a resource X with a variable as replacing X with any other resource in the knowledge base. To carry this out, we construct a special language model for all other resources in the knowledge base which is a mixture model of all the language models of all the resources in the knowledge base other than X. The similarity between the re- source X and a variable is then computed using the square-root of the JS diver- gence between the language model of the resource X and the special language model corresponding to all other resources in the knowledge base. Using this technique, a variable is now simply another entry in the substitution list of re- source X, .
Table 4.1 shows example resources from an RDF knowledge base about movies. For each resource, it shows the top-5 substitutions from the resource substitution list. The entry var represents the variable substitution. As previously explained, a variable substitution indicates that there were no other specific substitutions which had a higher similarity to the given resource.
4.2. Query Reformulation Framework
Pruning the Substitution Lists
Maintaining a substitution list for every resource in the knowledge base can be very impractical when these lists are long. Recall that for a given resource, its substitution list contains all other resources in the knowledge base and their similarities to the given resource. These lists can thus be extremely long in large knowledge bases. Pruning such lists is thus crucial to avoid storage bottleneck, as these lists need to be maintained somewhere in the knowledge base. Pruning can also be beneficial for efficient query processing since our query reformula- tion algorithm described next scans such lists to generate reformulated queries, which would then be evaluated. Pruning can limit the number of such reformu- lated queries to a reasonable number.
The most basic way to prune substitution lists is to use a pruning threshold. That is, reduce the list and cut off its tail whenever the similarity score between the substitution and the resource the list belongs to becomes less than a prede- fined threshold. In our framework, we use the score of the variable substitution as the threshold value after which we prune the lists. More precisely, any sub- stitution ranked below the variable substitution is pruned and removed from the substitution list. As an example, consider the substitution list for the rela- tiondirectedshown in Table 4.1. This substitution list can be pruned after the
fourth entry. This seems to be very intuitive since substitutions beyond the vari- able substitution can be seen as very dissimilar from the given resource, that they may as well be ignored and represented by the variable substitution.