Based on the implicit network of Wikipedia as constructed above, we first investigate a number of example applications. Due to the versatility of the representation with regard to the types of entities that can be used in a query, we only consider a selection of possible application scenarios. However, all node-based rankings as described in Chapter3.2are viable, and are implemented in our query interface. For the explanation of the following examples and the subsequent evaluation, we introduce the concept of a subquery. Here, a subquery simply entails the splitting of input strings for entities of type location and actor into their components, which are then included in the query as well. The queries thus benefit from the completeness condition that is applied during the construction of the network. In the following, we use the syntax
hOS : (QS,value)∗
i
to describe queries, where OS, QS ∈ {Loc, Org, Act, Dat,T , S, D} are the desired type of output set and the type of query entities, respectively, whilevalue is the name of the query entity. Based on this syntax, we show a couple of results for example queries in the following, split into three primary use cases.
Browsing
The most straightforward application of the network is the ability to browse the con- nections between entities. In Table3.2, we show the top-ranked results of three queries centred on the entity Edward Snowden. By including Snowden as the only query entity, we obtain a list of organizations that are closely tied to him, including the NSA and USIS, the company that vetted Snowden prior to his employment. Duplicate entries for the different
3.4 Event Completion on Wikipedia Data
hOrg :(Act, Edward Snowden)i rank organizations score
1 nsa 1.000
2 national security agency 0.288
3 gchq 0.182
4 us national security agency 0.083
5 usis 0.043
hOrg : (Act, Edward Snowden), (Act, Barack Obama)i rank organizations score
1 nsa 0.546
2 senate 0.503
3 congress 0.340
4 republican 0.290
5 democratic party 0.283
hTer : (Act, Edward Snowden)i rank terms score 1 surveil 1.000
2 leak 0.985
3 document 0.610 4 whistleblow 0.532 5 contractor 0.496
Table 3.2: The five top-ranked results for three queries centred on Edward Snowden. Weights are given as the normalized directed importance weights ~ω, or their combination ϱ in queries with multiple entities. All terms are stemmed.
spellings of the NSA are artefacts from the named entity recognition step, which suggests that the implicit network representation can also be used as a tool for disambiguation or co-reference resolution. Moving beyond single-entity queries, when we include Barack Obama as a second query entity, the focus of the results shifts from security agencies to politics, where the Snowden incident was discussed. In contrast to querying for entities, we can also select the set of terms as output, as shown at the bottom of Table3.2). In this example, we find that the extracted terms provide a solid first impression of what made Snowden famous. Thus, we find that terms in the network can serve to describe entities or their relations. Overall, such browsing can be used as a tool for following connections and cooccurrences of entities through the data to explore events or relations, much like a knowledge graph. Due to the query speed, an interactive exploration is feasible.
3 Implicit Entity Networks
Summarization
Based on the relationships of entities and terms in the graph, extracting descriptions for entities is no different from querying for any other type of target node. However, we can also query for sentences that contain the relevant entities directly to obtain extractive summaries. For example, the resulting top-ranked sentence for the query
hSen : (Act, Edward Snowden), (Org, NSA)i
is “In early 2013, thousands of thousands of classified documents were disclosed by NSA con- tractor Edward Snowden”, which summarizes the relationship nicely. While the current approach of locating sentences that contain the specified entities is straightforward, more intricate summarization metrics can be adapted to the existing graph structure, as we dis- cuss in detail in Chapter4.3.
Entity and concept linking
Since we constructed the graph from Wikipedia texts, we can also use it to recommend documents for entities, and effectively link entities to Wikipedia pages. The query
hDoc : (Act, Edward Snowden)i
unsurprisingly yields Snowden’s Wikipedia page as the top-ranked result. However, we can link more complex or combined concepts as well, for example with the query
hDoc : (Act, Edward Snowden), (Ter, Surveillance)i.
For this query, the Wikipedia page for Global surveillance disclosures since 2013 is placed ahead of Snowden’s page in the ranking, since it lists the surveillance activities that Snow- den uncovered in greater detail that the page describing the Snowden himself. In this con- text, the concept of subqueries is helpful, since persons are rarely referred to repeatedly by their full name on their own page. For example, the query
hDoc :(Act, Albert Einstein)i
ranks the Wikipedia page for the Albert Einstein Transfer Vehicle highest, which was used to supply the International Space Station. The reason for this is that it is always referred to by the full name, while Albert Einstein himself is more often referred to as Einstein. Allowing
3.4 Event Completion on Wikipedia Data
date event description
1918-07-08 Ernest Hemingway, Red Cross volunteer, wounded in Italy
1960-10-12 Nikita Khrushchev pounded his desk with a shoe during a UN speech 1966-09-08 "Star Trek" debuted (NBC)
1973-10-06 Israel attacked by Egypt and Syria
1866-11-20 Bicycle with a rotary crank patented (Pierre Lallemont)
1960-08-19 Francis Gary Powers, U-2 spy plane pilot, convicted in a Moscow court
Table 3.3: Examples of events from the even participation evaluation data set that are annotated for actors, locations and organizations, together with the date on which the event took place. Note that these events are not necessarily represented by a single sentence in Wikipedia.
subqueries fixes this, and results in the physicist’s page being top-ranked. While the trans- port vehicle never should have been tagged as a person during named entity recognition, such an error is common even with state-of-the-art tools. However, our findings indicate that implicit networks could be used to detect such inconsistencies. We encounter the task of document linking again for the recommendation of documents based on selected entities in Chapter4.2, and for the linking of network topics to news articles in Chapter6, where we provide additional use cases and examples.