5. MARCO DE REFERENCIA DE LA INVESTIGACIÓN
5.1. MARCO TEÓRICO
5.1.7 Principios de cero accesos
Besides the design choices for dimension projection, association strength and rele- vance function presented above, a crucial question is how to define the concept space with respect to which documents will be indexed. In the original ESA model applied to Wikipedia, concepts correspond to single articles. For CL-ESA, we extended concept definitions by introducing interlingual concepts having language specific text signatures (see Section IV.3). These text signatures are defined by the content of articles and the structure given by categories in Wikipedia databases in different languages.
In the following, we propose two new approaches to define concepts based on Wikipedia: Cat-ESA and Tree-ESA. These are novel extensions of CL-ESA inspired
The transport of bicycles on trains
ESA
Cat-ESA
Tree-ESA
0.06 1.55 0.49 0.14 0.54 … Monorail Train Kick scooter Bicycle Automobile Transportation Vehicles Rail tansport Human- powered vehicles Automobile Monorail Train Kick scooter Bicycle 0.04 0.01 0.05 0.13 … Transportation Vehicles Automobile Human- powered vehicles Kick scooter BicycleRail tansport Monorail Train Vehicles Automobile Human- powered vehicles Kick scooter Bicycle 0.04 0.00000015 …
Figure IV.7: ESA vectors based on different definitions of the concept space. The original ESA model is based on articles. Concepts in Cat-ESA are defined by cate- gories. The textual description of each category is thereby built using the articles of the category. For Tree-ESA, sub-category relations are additionally used to define the textual descriptions.
IV.4. DESIGN CHOICES 85
by the model introduced by Liberman and Markovitch [2009]. They presented a measure of semantic relatedness based on a concept space spanned by Wikipedia categories. In this thesis, we integrate the category based concept space in our cross- lingual ESA framework, which allows to find optimal design choices and parame- ters for this specific concept space. The model of Liberman and Markovitch [2009] has not been applied to IR nor CLIR/MLIR so far. Our further contribution is the evaluation of category based concept spaces in MLIR tasks. Finally, we introduce Tree-ESA that additionally exploits the hierarchical structure of categories.
Cat-ESA relies on categories of Wikipedia to define concepts. In this model,
only links assigning articles to categories are considered, while relations between categories are not used.
Tree-ESA uses sub-category relations to propagate textual descriptions of con-
cepts along the category hierarchy. Figure IV.7 contains examples for concept vec- tors based on different definitions of the concept space. Again, articles and categories presented in Figure IV.1 are used in this example. The ESA model defines concepts by articles, for example Automobile. Using Cat-ESA, concepts correspond to cate- gories which abstract from single articles. Here, all article in the text signature of a category are also used to compute the association strength, but each article has a smaller overall weight as the text signature consists of several articles. Finally, Tree- ESA exploits the sub-category structure of Wikipedia and considers all articles along the category tree to define the text signatures.
The intuition behind these concept models is that they become more and more language-independent the more concept descriptions abstract from single articles. Therefore, indexing documents with respect to Wikipedia categories instead of Wikipedia articles might be a good choice for cross-lingual retrieval. Missing lan- guage links between articles or existing language links between articles describing different concepts may have a significant influence on the performance of CL-ESA. When using Cat-ESA with many articles in each category, these problems will surely have a smaller impact. In Tree-ESA, descriptions of categories are even bigger as they also contain subcategories. Our hypothesis is that the category-based represen- tations used in Cat-ESA and Tree-ESA are better candidates for MLIR document models as the category structure is more stable across languages compared to the structure of articles. Our results indeed support this conclusion.
Category ESA. Category ESA (Cat-ESA) is based on a set of categories Γ =
{γ1, . . . , γo}. We define the function
MEMBERS : Γ→ 2C
that maps category γ to all articles contained in the category, which is a subset of the set of articles C.
Instantiated for Wikipedia as in our case, the categories Γ correspond to inter- lingual categoriesIC(W ) as defined in Section IV.3. The category membership function MEMBERS is then essentially defined by category links. These links are
part of Wikipedia and assign articles to categories in a specific language l: CLl: A(Wl) → C(Wl)
Using equivalence classes of articles and categories as defined above, these links can be generalized to interlingual articles and categories. More details about mining these links from Wikipedia will be presented in Section IV.5. As articles potentially contain more than one category link, the sets of interlingual articles of categories may not be disjoint.
In contrast to CL-ESA, the concept space of Cat-ESA is then spanned by Γ and not by C. The text signature τ of category γ is defined as the union of text signatures of all interlingual articles that are linked to one of the categories in the interlingual category:
τCat-ESA(γ, l) :=
c∈MEMBERS(γ)
τCL-ESA(c, l)
When computing term statistics, this union is equivalent to the concatenation of the articles.
Category Tree ESA. For Category Tree ESA (Tree-ESA), the categories as de- scribed for Cat-ESA are part of a tree structure. Given a single root category γrand
a sub-category relation SUB : Γ→ 2Γ, all other categories can be reached from the root category. The function TREE : Γ→ 2Γmaps a category γ to the subtree rooted in γ and is recursively defined as:
TREE(γ) := γ∪
γ∈SUB(γ)
TREE(γ)
As all categories are part of a tree structure without circles, this recursion stops at a leaf category node, for example
TREE(γ) := γ if γ is a leaf
Again, category links in Wikipedia, linking one category to another, can be gen- eralized to interlingual categories and therefore be used to define the sub-category relation. The association strength of document d to category γ is then not only based on concepts in γ but also on the concepts of all subcategories. This results in the following definition of the text signature function τ :
τTree-ESA(γ, l) :=
γ∈TREE(γ)
τCat-ESA(γ, l)
Tree-ESA requires a tree-shaped category structure. Using Wikipedia, this is not given as some links introduce circles in the category structure. In Section IV.5, we will describe our pruning method that filters such category links to make the Wikipedia category structure usable for the proposed Tree-ESA model.