MARCO REFERENCIAL
2.1 MARCO TEÓRICO
2.1.3 Fundamentación científica
2.1.3.3 Sexualidad y valores
Sew’s hyperlink propagation pipeline takes as input a Wikipedia dump and outputs a sense-annotated corpus, built upon the original textual content of Wikipedia, where word senses and named entity mentions are linked to the sense inventory of BabelNet. Some standard preprocessing is applied to the input corpus in the first place, including tokenization, part-of-speech tagging and lemmatization. At this preliminary stage we also discard disambiguation pages, ‘List of’ articles and pages of common surnames,6 as they typically contain few lines of meaningful text and tend to introduce noise in the propagation process. After preprocessing, we apply a cascade of hyperlink propagation heuristics to each Wikipage in the input corpus. Each propagation heuristic, when applied, identifies a list of BabelNet synsets Sp to be propagated across a given Wikipage p; then, for each synset s ∈ Sp, occurrences of any lexicalization of s are detected, annotated with s, and added as new hyperlinks for p.7 All propagation heuristics share a common assumption: given an ambiguous mention m within a Wikipage p, every occurrence of m across
p refers to the same sense (one-sense-per-page assumption) and hence it can be
annotated using the same synset. This assumption is a Wikipedia-specific version of the one-sense-per-discourse assumption (Yarowsky, 1995) and, albeit simple, tend
6
https://en.wikipedia.org/wiki/Lists_of_most_common_surnames
7
Thanks to BabelNet’s inter-resource mappings, each hyperlinked Wikipage can be unambiguously mapped to the corresponding Babel synset, and vice versa. Thus, in the present section we use the terms ‘propagated hyperlink’ and ‘sense annotation’ interchangeably.
Symbol Type Scope
Original Hyperlink HL - Wikipedia
Surface Mention Propagation SP Intra-page Wikipedia
Lemmatized Mention Propagation LP Intra-page Wikipedia
Person Mention Propagation PP Intra-page Wikipedia
Wikipedia Inlink Propagation WIL Inter-page Wikipedia
BabelNet Inlink Propagation BIL Inter-page BabelNet
Category Propagation CP Inter-page Wikipedia
Monosemous Content Word MP - BabelNet
Table 4.1. Summary of the hyperlink propagation heuristics used in Sew.
to be surprisingly accurate given the nature and structure of Wikipedia.8
As we apply a heuristic h to a given Wikipage p, we characterize h as being either intra-page (when it propagates synsets that already occur as hyperlinks within
p itself) or inter-page (when it exploits the connection of p with other Wikipages or
categories). Also, we refer to the scope of h as either Wikipedia (when all synsets propagated by h identify a specific Wikipedia page) or BabelNet (when h propagates synsets that may not have an associated Wikipedia page).
After all heuristics have been applied we enforce a conservative policy to remove overlapping mentions and duplicates (i.e. multiple sense annotations associated with the exact same fragment of text). We deal with overlaps by penalizing inter-page annotations in favor of intra-page ones, and by preferring the longest match in case of overlapping annotations of the same type. Similarly, we deal with duplicates by preferring intra-page annotations over inter-page ones, consistently with the
one-sense-per-page assumption. Finally, if the mention is still ambiguous, all its
sense annotations are discarded. All the propagation heuristics composing the pipeline of Sew are summarized in Table 4.1. Most of them are based on methods that proved to be robust and effective in previous works for a variety of different purposes: a one-sense-per-page assumption is used by Wu and Giles (2015) to develop sense-aware Wikipedia-based word representations; Wikipedia categories have been exploited for propagating semantic relations (Nastase and Strube, 2008), learning topic hierarchies (Hu et al., 2015) and building taxonomies (Flati et al., 2014); finally, ingoing links to Wikipedia pages played a key role in the semantic representations of Nasari (Section 2.2.3.3).
4.1.1.1 Intra-page Propagation Heuristics
Intra-page propagation heuristics collect a list of synsets Sp from the original hyperlinks occurring in Wikipage p (including the synset associated with p itself) and then propagate Sp by looking for potential mentions matching any lexicalization of a synset in Sp. Every mention discovered this way is then added to the list of propagated hyperlinks for p if part-of-speech tags are consistent. However,
8
98% of the Wikipedia pages support the one-sense-per-page assumption, according to the estimation of Wu and Giles (2015).
4.1 Sew: A Semantically Enriched Wikipedia 67
as potential mentions may contain punctuation or occur in some inflected form, propagation is performed as a two-pass procedure: a surface mention propagation (SP) over the original text of p before preprocessing, and a lemmatized mention
propagation (LP) over tokenized and lemmatized text.9
Moreover, we designed a specific heuristic to propagate person mentions (PP). This heuristic can be seen as a specialized version of coreference resolution restricted to person entities: if a synset s ∈ Sp identifies a person according to the BabelNet entity typing, we allow potential mentions to match lexicalizations of s partially (i.e. only first name, or only last name). Each partial mention is then validated by checking its surrounding word tokens against a pre-computed set of first and last names, drawn from Wikipedia itself,10 and added as sense annotation only if surrounding tokens do not match any person name. This prevents us from annotating false positives (e.g. siblings of the person identified by s).
4.1.1.2 Inter-page Propagation Heuristics
Inter-page heuristics exploit the connections of p inside Wikipedia and BabelNet. Once synsets to be propagated are collected in Sp, we apply the same propagation procedure described in the previous section for intra-page heuristics. We exploited three inter-page heuristics:
• The Wikipedia Inlink Propagation (WIL) heuristic collects ingoing links to p inside Wikipedia, that is other Wikipages where p is mentioned and hyperlinked, and adds the corresponding BabelNet synsets to Sp;
• The BabelNet Inlink Propagation (BIL) heuristic, similarly to WIL, lever- ages ingoing links to the synset sp that identifies p in the BabelNet semantic
network. These might include, in particular, hyperlinks inside Wikipedias in languages other than English, as well as connections of sp drawn from other
resources integrated in BabelNet (cf. Section 2.1.3);
• The Category Propagation (CP) heuristic propagates hyperlinks across Wikipages that belong to the same Wikipedia categories of p. Intuitively, Wikipages belonging to the same categories tend to mention the same entities. This heuristic is based on three successive steps:
1. Given a Wikipedia category c, CP harvests all hyperlinks appearing in all Wikipages associated with c at least twice, collects them into the set the set Sc, and then ranks them by frequency count;
2. In order to filter out categories that are too broad or uninformative (e.g. Living people) CP associates with each category c a probability distribution over hyperlinks fc, and computes the entropy H(c) of such distribution as:
H(c) = − X
h∈Sc
fc(h) log2fc(h) (4.1) 9A common example is the mention m =‘United States of America’: since only shallow prepro-
cessing is applied to the input text (and, in particular, no NER) a lemmatization step would reduce
m to ‘unite state of America’, which is not a valid lexicalization of the corresponding Babel synset.
Similar observations apply for song, book or movie titles.
10
# Annotations # Senses # Documents Sense Inventory
Wikipedia 71,457,658 2,898,503 4 313,373 Wikipedia
Sew (all) 250,325,257 4,098,049 4 313,373 BabelNet
Sew 206,475,360 4,071,902 4 313,373 BabelNet WordNet 116,079,163 67,774 4 313,373 WordNet Wikipedia 162,614,753 4,020,979 4 313,373 Wikipedia Wikilinks 40,323,863 2,933,659 10,893,248 Wikipedia FACC1 11,240,817,829 5,114,077 1,104,053,884 Freebase OMSTI 1,357,922 31,956 62,815 WordNet MASC 286,416 23,175 392 BabelNet
Table 4.2. Global statistics of Sew in comparison with other sense-annotated corpora. ‘Wikipedia’ (first row) refers to the English dump of November 2014, while ‘Sew (all)’ (second row) refers to the corpus before applying the conservative policy.
where fc(h) is computed as the normalized frequency count of h in Sc.
Ranking categories by their entropy values allows to discriminate between broader categories, where a large number of less related hyperlinks appear with relatively small counts (hence higher H), and more specific categories, where fewer related hyperlinks occur with relatively higher counts (and lower H);
3. Finally, given a Wikipage p, CP considers each category cp associated with p where H(cp) is below a predefined threshold ρH,11 and adds to Sp
all the synsets that identify hyperlinks in Scp.
In the last stage of the pipeline, after both intra-page and inter-page heuristics have been applied, we additionally exploit a Monosemous Content Word (MP) heuristic to propagate verb, adjective and adverb senses that are monosemous according to the sense inventory.