• No se han encontrado resultados

An´ alisis No-Est´ andar

3. Resumen de la investigaci´ on

3.3. Ultraproductos y Teorema de compacidad

3.3.4. An´ alisis No-Est´ andar

in the same sentence. As a result, we expect collocations to play a more significant role for word embeddings due to their similarly window-based cooccurrence extraction, rather than for implicit networks of entities. There are, however, some works on collocations that are at least conceptually related to our model in Chapter3, since they rely on graph representations of collocations[15,35]or even on the notion of edge direction[78],similar to the graph-of-word approach. However, in contrast to the networks that we consider, these graphs are intended to function on a much smaller scale and a linguistic level that is much closer to the individual words or terms.

A noteworthy aspect of cooccurrence and especially collocation networks, such as the examples above, is the process behind their emergence in contrast to their construction. This contrast has raised the question if such networks should indeed be considered as com- plex networks[232]. While it is obvious on the surface that networks can be constructed from word adjacencies and word proximities, the underlying generation process (that is, natural language) is different from those of commonly considered complex networks in network analysis[144],which often include some notion of conceptual or physical flow along edges. As a result, the application of the typical metrics from network science may be lacking a formal basis, and should therefore be considered critically and in the context of the application. In our contributions, we thus focus on local relationships in the implicit networks, instead of flow-based global network metrics or indices.

2.3 Graphs and Heterogeneous Information Networks

Since our implicit network model is a graph structure, we consider some of the basic termi- nology for graphs and networks in the following, and discuss related work on knowledge graphs. Knowledge graphs in particular are interesting due to their dual overlap with our own work. On the one hand, they serve as resources for the annotation of entities as we discuss in Chapter2.4. On the other hand, the underlying principles of relation and infor- mation extraction from unstructured text is conceptually similar to the intuition behind implicit networks, albeit different in methodology.

2.3.1 Foundations of graphs and networks

The terms network and graph are often used interchangeably, which we also do to some degree in the following. Formally, a graph G = (V, E) is a mathematical construct that consists of a set of nodes V (often also referred to as vertices), that are connected by a

2 Background and Related Work

set of edges E (sometimes also called links). Thus, edges are elements in the set of the Cartesian product of the set of nodes with itself, meaning that E ⊆ V × V. Strictly speaking, this results in edges being ordered tuples, such that for two nodesv, w ∈ V, we have(v, w) , (w,v). Such a graph is called directed since the order of nodes in the tuple matters and there are two reciprocal edges that could connect any two given nodes (one in either direction). Since many applications do not consider a direction of edges, undirected graphs can also be considered, in which edges are formally treated as sets of nodes of size two, instead of ordered tuples. On occasion, it can be useful to represent a graph as an adjacency matrix M of dimension |V | × |V |, whose binary entries Mvwindicate if there

exists an edge between nodesv and w.

Both nodes and edges may be assigned additional attributes such as weights, which we model as functions f : V → Attf orд : E → Attд that map the respective nodes or

edges to some corresponding attribute space Att. Typical attributes of graphs that we use in our model are the neighbourhood and the degree. The neighbourhood of a nodev is the set of all nodes that are adjacent tov, meaning that there exists an edge that connects the node tov. We denote the set of neighbours with N (v) and let, for an undirected graph

N (v) := {w ∈ V : (v,w) ∈ E}. (2.4) The degreedeд(v) := |N (v)| is then the number of neighbours of node v. For directed graphs, the neighbourhood and degree are defined analogously, but we discern between the indegree and outdegree of nodes, depending on the direction of edges.

While graphs provide the formal representation, and graph theory is concerned with the mathematics of graphs, the term network is typically used in applications that focus on the data or on processes that happen inside the graph. Network analysis is then concerned with modelling the topological structure and the principles of the emergence of graphs that represent observed complex systems. In the following chapters, we utilize only few additional graph notations or tools from network analysis, which we introduce as they are needed. For a more in-depth background, we refer to the textbook by Newman[144].

As we have already touched upon in Chapter2.2, the application of many established network-analytic measures should be approached with caution, since the implicit net- works in our model constitute a type of cooccurrence network. Therefore, we do not go into detail on the metrics that are frequently applied in network analysis to uncover the topology and global characteristics of such networks, but instead focus on leveraging the local structures for the discovery and retrieval of term relations. Thus, our approach also covers applications that typically fall into the domain of knowledge graphs.

2.3 Graphs and Heterogeneous Information Networks

2.3.2 Information networks and knowledge graphs

While almost any system with interconnected components can be modelled as a com- plex network, an important type of such networks are information networks. In contrast to plain (complex) networks, information networks encode discrete labels as attributes of nodes and edges[191]. Typical examples include scientific collaboration or citation net- works that contain authors, publications, and journals or conferences as nodes, and (co- )authorship relations or editorial positions as relations, as well as cinematic networks of movies and actors. From an analytic point of view, such information networks are espe- cially interesting when they have a heterogeneous structure in which nodes and/or edges have multiple different attribute values. For an overview of heterogeneous information networks, we refer to the recent review by Shi et al.[168].

Knowledge graphs

An important type of heterogeneous network for natural language processing are knowl- edge graphs (or knowledge bases), both as a product of language processing and as a re- source. While the term knowledge base stems from the difference to data base with the intention of highlighting the fact that it stores knowledge and not just data, the knowl- edge bases that are in widespread use have a graph structure, so the term knowledge graph is equally appropriate. This structure is also reflected in the used RDF storage format that represents the contained information as triples[92],which can be interpreted as two nodes and a connecting edge. Based on this structure, knowledge graphs then support a variety of tasks in which the added knowledge is helpful, or allow the inference of information that is not explicitly contained[172].

There currently are a number of knowledge bases that started as scientific projects and serve as major sources of structured knowledge in scientific publications (and beyond), as well as some more focused knowledge bases with a more narrow focus (for a thorough overview, see[190]). Of the large knowledge bases, DBpedia[14]deserves mention as one of the major representatives of knowledge bases that aim to extract structured knowledge from the semi-structured and unstructured content of Wikipedia. Over the years, it has grown to include knowledge from over 100 language editions of Wikipedia. Of the more specialized knowledge bases, EventKG[76]is related to our work in its focus on entities as the central components of events. EventKG is a temporal event knowledge graph and is itself composed of aggregated event-centric knowledge from larger and more general knowledge bases, and is designed to provide event data for timeline generation or question

2 Background and Related Work

Figure 2.1: Example of an entry in Wikidata for the item that corresponds to Alan Turing. In con- trast to traditional knowledge bases, Wikidata does not store facts but claims, which are backed by references and can be ranked by the community. The data model consists of only items and properties (classes do not exist explicitly), which entails that the entire class hierarchy can be edited by the users, causing it to gradually change according to the requirements of new data being added.

answering. In our work, we rely on two of the major knowledge bases, namely YAGO[124]

and Wikidata[206],which we describe in more detail in the following.

YAGO. Similar to DBpedia, YAGO started as a project for the extraction of knowledge from the semi-structured content of Wikipedia, and its name is a tongue-in-cheek acronym for Yet Another Great Ontology. However, despite the apparent redundancy, YAGO includes a unique feature that makes it useful in the classification of entities: its taxonomy. Categories in YAGO are derived from a combination of classes that are extracted from Wikipedia categories with the WordNet ontology[133],which enables a much cleaner categorization of named entities than comparable knowledge bases. Several versions of YAGO have been published over the years, each with updated content and more included sources. Recently, the extraction of further contemporary versions of YAGO has been suspended in favour of making the extraction code available as open source code.

Wikidata. Unlike previous knowledge bases, Wikidata arose not from the need to extract knowledge from existing sources, but as a knowledge base behind Wikipedia to provide structured knowledge for handling the challenges of updating facts across the multilin- gual versions of Wikipedia. In contrast to previous approaches at constructing knowledge bases, Wikipedia is a collaborative knowledge base into which the users can enter state- ments manually, although it is increasingly expanded by automated bots[187]. Since its inception, Wikidata has been merged with the historically influential knowledge base Free- base[32]in a collaborative process that had users confirm imported statements[196],and has recently become a major source of structured knowledge well beyond Wikipedia. For our work, however, Wikidata is not interesting merely due to its size, but because of the

2.3 Graphs and Heterogeneous Information Networks

collaborative process behind it that ensures the addition of emerging entities in a timely fashion, which is relevant for the processing of news streams. Because of this collaborative process, however, Wikidata is using a hierarchy that is less rigid than those of extraction- based knowledge graphs, and features a more dynamic structure. As shown in Figure2.1, statements can be assigned ranks and qualifiers to model their potentially limited validity periods. The perpetual change of the knowledge graph that this structure entails is both a blessing and a curse when Wikidata is used for linking entities, as we discuss in more detail in Chapter2.4.

Knowledge extraction at a web scale

In the construction of knowledge graphs, semi-structured information such as tables or Wikipedia infoboxes are an invaluable source of information, but they are limited in scope. To include purely textual sources as well, the extraction of information from unstructured text is necessary. As a result, numerous approaches have been presented over the years towards such a procedure, which are too numerous and removed from our approach to cover here (for an overview, see[211]).

However, an especially important concept in this respect is open information extraction, which aims to identify novel entities and relations at a Web scale[16].Conceptually, this can be achieved by repeatedly chaining the extraction of relations and entities. Newly identified relations are used to identify previously overlooked or emerging entities as one of their arguments, while entities in turn help to locate new relations that have the en- tities as arguments. As a result, this process is often pattern-based, although newer con- tributions also approach the problem by jointly embedding entities and relations[154].In addition to providing knowledge for the extension of knowledge bases, open information extraction also serves to provide structured knowledge for questions answering (QA) from unstructured sources[109]as an information retrieval task.

A downside of these approaches is the rigid concept of relation, which must necessarily be qualifiable as a well-specified relation between two entities that is explicitly mentioned in the text. As a result, ambiguous relation information is often discarded, even though it might be sufficient to satisfy a user’s information need in an exploration of the docu- ments. This focus on qualifiable relations is the primary difference between knowledge graphs and implicit networks. Where knowledge graphs and relation extraction methods target explicit statements to extract discrete relations between entities from a local scope with a focus on precision, implicit networks are designed to extract quantifiable relation strengths between entities. Thus, the former process is more precise in the extraction

2 Background and Related Work

of relations from individual documents that can be described in an information network, whereas the latter emphasizes a higher recall to identify relevant relations across the doc- ument collection that are never explicitly stated in the documents.

Documento similar