• No se han encontrado resultados

CAPÍTOL 6: DIAGNOSI DELS RESULTATS

6.3. Percepció humana

As we have demonstrated in the previous chapters, tasks that can be formulated as an entity or term rankings, or as the extraction of weighted entity graph patterns, can be addressed by using an implicit network model. In particular, events as dyadic or triadic structures of entities can be queried efficiently as we showed in Chapter3. Due to the transitivity of edge aggregation (edges can always be aggregated further), all entity-centric exploration methods designed for the static implicit network also work with the context- enriched dynamic model. We thus focus on novel exploration methods that utilize tempo- ral data and the context of entity mentions to extract evolving entity-centric topics from document streams. To this end, we show exploratory results on a large stream of entangled news articles from several news outlets.

5.4.1 Entangled news stream data

Since news streams are a typical example of complex, entangled document streams from multiple sources, we use them as data for our exploration. We first describe the acquisition and preparation of the news data, as well as the construction of the graph representation.

5.4 Entity Context Exploration

Data collection

Since we want to analyze the model on entangled news streams from multiple outlets, standard corpora from a single outlet such as the New York Times corpus[159]cannot be used. Instead, we collect articles from the RSS feeds of international outlets with a focus on high-quality news. To extract the content, we use manually created rules, since these enable a clean extraction of article contents (including multi-page articles) at a level that automatic boilerplate removal does not support[180].

Specifically, we use articles from 14 English speaking news outlets located in the U.S. (CNN, Los Angeles Times, New York Times, USA Today, CBS News, The Washington Post, International Business Times), Great Britain (BBC, The Independent, Reuters, Sky News, The Telegraph, The Guardian), and Australia (Sydney Morning Herald). The RSS feeds of these outlets differ, but we focus on feeds that cover political news. The time frame for our data collection is June 1 to November 30, 2016. We remove articles that have less than 200 or over 20,000 characters (due to limitations of the named entity recognition framework). We also remove articles that contain more than 100 disambiguated entities per article (typ- ically, these are not articles but lists of real estate in weekend editions of newspapers). The final collection of articles then contains 127,485 time-stamped documents over a period of six months, with a total of 5.4M sentences.

Data preparation

Similar to the Wikipedia implicit network, we again focus on named entities of the types location, organization, person (actor), and date, since these correspond well to the cen- tral entities of news events. Data preparation then consists of five steps: recognition of named entities, entity linking, entity classification, part-of-speech and sentence tag- ging, and temporal tagging. For the recognition and disambiguation of named entities to Wikidata IDs, we use the Ambiverse natural language understanding suite[93]. To clas- sify named entities into actors, locations, and organizations, it would be possible to use Wikidata hierarchies directly, but this can be problematic due to their constantly evolving structure[176].Therefore, we map Wikidata IDs to YAGO3 entities[124]and classify them according to the YAGO hierarchy, since it is derived from WordNet hierarchies and easier to handle (see Chapter2.4). For actors, we use the class wordnet_person_100007846, and for organizations wordnet_social_group_107950920. For locations, no comprehensive WordNet class exists, so we use yagoGeoEntity, which was designed specifically for this

5 Dynamic Implicit Entity Networks

purpose[96]. For the extraction and normalization of temporal expressions, we run Hei- delTime in the news domain setting[188].Finally, for sentence splitting and part-of-speech tagging, we use the Stanford POS tagger[200].

Implicit network construction

To construct the network, we proceed as described in Chapter5.3. We again use stemming instead of lemmatization, with the same reasoning as in the case of Wikipedia in Chap- ter3.3, and rely on the Porter stemming algorithm[149]. We impose a minimum word length of 4 characters for terms, and set the window size for the extraction of entity cooc- currences toc = 5. For the term embeddings that encode the cooccurrence context, we use Google’s pre-trained 300-dimensional word2vec[131]word embeddings as an out-of- domain source that is trained on a much larger corpus of news articles.

The resulting network then has 5.7K dates, 27.7K locations, 72.0K actors, 19.6K organiza- tions, and 351K terms, which are connected by 83.4M parallel edges (prior to aggregation). While this data is therefore substantially smaller than the Wikipedia implicit network, it contains six months of news, which is a reasonable time frame for our analyses.

5.4.2 Contextual topic evolution

To highlight the exploratory possibilities of the model, we demonstrate the extraction of evolving contextual topics. To this end, we extract topics that best describe the individual contexts in which two entities are mentioned together and consider their evolution over time. Naturally, multiple such contexts may exist for any given pair of entities, which is reflected by the multiple parallel edges.

Contextual topics

We first provide a description of our approach to the extraction of contextual topics, be- fore we consider their evolution. Recall that a context vectorκ(a) is associated with each aggregated edgea = (v,w). We define a contextual topic of edge a as a weighted list of terms that describe the context in which entitiesv and w occur in instances included in a. To extract the contextual topics for each aggregated edge between these entities, we retrieve all termsTx = N (v) ∩ N (w) ∩ T in the joint neighbourhood of the two nodes,

along with all edges that connect them tov or w. We aggregate these edges such that each termx is connected to both v and w by exactly one edge, which we denote with av and

5.4 Entity Context Exploration

aw. Based on these triangular structures, we obtain a ranking score for each termx ∈ Tx

in relation to edgea as

ϱt(x |a = (v,w)) := min{sim(κ(a),κ(av)), sim(κ(a), κ(aw))} (5.10)

Intuitively, we are ranking terms by how closely the context in which they occur with an entity matches the context in which the entities occur together. We create such a ranking of terms for all aggregated edges betweenv and w. For each such edge, we select the k top-ranked adjacent terms to describe the topic. Thus, we obtain a natural language description for each of the edges between the two entities. Since edges are aggregated based on context similarity, the assumption is that the terms then describe the context of an aggregated edge, and that each aggregated edge in turn represents a topic. Furthermore, since edges are also attributed with temporal information in the form of publication dates of the articles that induced these edges, we can consider the evolution of edge-centric topics over time, which we do in the following.

Exploration results

To demonstrate the expressiveness of contextual topics, we show a timeline visualization of contextual topics for pairs of entities. To extract these topics, we use a cosine similarity of the context vectors and rank the adjacent terms as described above. Then, we assign to each aggregated edge between the two entities thek = 5 top-ranked terms as topic descriptors. We select the three top edges by multiplicity (that is, the aggregated edges with the highestλ values). Since each such edge is associated with a set of publication times, we can plot the evolution of the topics over time. The results for two entity pairs are shown in Figure5.2. At the top, we see the evolution of contextual topics for Brazil and the IOC (that is, the International Olympic Committee, which is the organization to which the Olympic Games are linked in this data set). One can easily identify contexts as dealing with corruption, sports, and the awarding of medals. Specifically, the award topic spikes precisely at the date of the games. The second example shows the relation of Prime Minister David Cameron to the United Kingdom during the beginning of the Brexit crisis. While all three topics are related to this issue, the referendum topic spikes at the proper date, and the drastic shift of the remaining topics towards Cameron’s resignation occurs only after the result of the referendum is announced.

Overall, the intuitive notion of aggregated edges as contexts in which entities are men- tioned corresponds well with our observations. Thus, term-based topic descriptors can

5 Dynamic Implicit Entity Networks

relativ

e frequency of mentions

Topics for Brazil (Q155) − IOC (Q40970)

0.00 0.25 0.50 0.75 1.00

Jun Jul Aug Sep Oct

region decad crisis

insist corrupt olymp game athlet sport event silver bronz gold medal medalist

Topics for David Cameron (Q192) − UK (Q145)

0.00 0.25 0.50 0.75 1.00

Jun Jul Aug Sep Oct

brexit nation favour

demand govern referendum ukip votewestminst campaign prime minist leaderresign pro−brexit

Figure 5.2: Evolution of the context of edges as contextual topics for two selected entity pairs with streaming aggregation (using cosine similarity and an aggregation threshold of th = 0.65). Shown is the relative aggregated frequency of publication dates for the three edges with the highest multiplicityλ. Contexts are derived from the joint neigh- bourhood of both entities by selecting thek = 5 terms whose context is most similar to the edge context. Q identifiers denote Wikidata IDs. Top: relation between Brazil and the International Olympic Committee. Bottom: relation between David Cameron and the United Kingdom.

Documento similar