• No se han encontrado resultados

ÍNDICE DE FÓRMULAS

1 DIAGNÓSTICO SITUACIONAL.

6. DISEÑO DE LAS TÉCNICAS E INSTRUMENTOS DE INVESTIGACIÓN.

6.4 Tabulación y presentación de los resultados.

We begin by giving an intuitive definition of entity-centric sentence extraction, before formalizing the extraction operations based on the implicit network model. For the con- struction of the implicit network, we consider the same components as in Chapter3.2.

Descriptive sentence extraction

To introduce the extraction tasks, we begin with the special case of single-entity sentence extraction, which we then extend to the more general multi-entity sentence extraction. Single-entity sentence extraction. Given a collection of documentsD that consist of sentencesS, a set of entities E, and a query entity q ∈ E, we let

Sq := {s ∈ S | q ∈ s} (4.8) be the set of sentences in which entityq is mentioned. The aim is then to identify the sen- tencesq ∈Sq that generally best describesq. Extending this notion, we can also consider the summarization of relationships between entities by focusing on sentences that jointly describe the occurrences of a set of entities.

Multi-entity sentence extraction. Given a collection of documentsD, sentences S, en- titiesE, and a subset of query entities Q ⊆ E, we let

SQ := [

q ∈Q

{s ∈ S | q ∈ s} (4.9)

be the set of sentences in which at least one of the entities is mentioned. A descriptive sentence for the set of entitiesQ then is the sentence sq ∈ SQ that best describes the joint occurrences of entities fromQ in D. From this definition, it is clear that single-entity sentence extraction is a special case of multi-entity sentence extraction for |Q | = 1. In the following, we thus focus on the more general task.

Implicit network construction

For the underlying graph representation, we follow the definitions in Chapter3.2. That is, we use a graph G = (V, E), in which the set of nodes corresponds to V := T ∪ E ∪ D ∪ S. The set of edges E again represents the containment or cooccurrence of terms or entities in sentences, and each of these edges is weighted with asymmetric weights ~ω

4.3 Entity-centric Summarization

Figure 4.5: Schematic view of an implicit network for one document with four sentences. Entities (purple) and terms (yellow) as the main components of sentences are highlighted.

that we can use as input for the scoring functions. A schematic view of the graph that visualizes the sentence-centric focus is shown in Figure4.5. Finally, recall that we denote the neighbourhood of a nodev in the graph with N (v).

Scoring functions for sentences

Based on the implicit entity network, we now introduce realizations of sentence extraction methods. We treat the task as a sentence ranking problem, in which we rank sentences according to their relevance for a set of input query entitiesQ ⊆ E, and then select the top-ranked sentence(s). Formally, we use scoring functionsϱ : S → R that allow a ranking of sentences in the collection by their descriptiveness for the input entities. The answer to a query then is a sentences ∈ S such that ϱ(s) ≥ ϱ(s0) ∀s0 ∈S \ {s}. In the following, we describe four different scoring methods in a sequential manner. Thus, with the exception of the last method, each subsequent method includes and builds upon the previous methods. Entity count (ENCO). As a baselineϱenco, we use the method that we proposed in Chap-

ter3.2 and employed for initial exploration there. Recall that it counts the number of entities from the query set that occur in a sentence. Using set notation for the neighbour- hoods, we obtain the score of a sentences as

ϱenco(s, Q) := |N (s) ∩ Q|. (4.10)

For descriptive sentence extraction, this method can serve as a solid baseline, but suffers from two shortcomings. First, it performs poorly in the extraction of descriptive sentences for single query entities, since it assigns equal score to all sentences that contain the en- tity. Second, it does not consider the context of a sentence beyond the contained entities. Thus, ties between sentences with the same number of entities cannot be broken, which is especially of interest if no sentence exists that contains all query entitiesQ.

4 Applications of Implicit Networks

Term influence (TERI). Based on the above observations, we suggest an improved two- component scoring function that corresponds to the sentence ranking that we used in our implementation of EVELIN in Chapter4.2. The number of entities in the sentence is kept as the first component, while we derive the second component from the set of terms that are most relevant to the query entities. To this end, we consider a ranking of terms in the neighbourhood of a query entityq by the directed edge importance ~ω of the connecting edge, and lettn be then-th ranked such term. Let

Tn(q) := {t ∈ T | ~ω(t |q) ≥ ~ω(tn|q)}, (4.11) then the setTn(q) contains the n top-ranked terms in the graph with regard to q. From

this, we obtain the most relevant terms for a set of query entitiesQ as

Tn(Q) :=

[

q ∈Q

Tn(q). (4.12)

We then use these terms to represent the context of query entities and act as possible place- holders for query entities in a sentence that does not contain all query entities directly. That is, we let the context of entities act as their placeholder for ranking the sentence. We still rank by the number of query entities in the sentence first, but break ties by using the n most relevant terms for each entity. Formally, we let

ϱteri(s, Q, n) := |N (s) ∩ Q| +

|N (s) ∩ Tn(Q)|

|Tn(Q)| + 1 . (4.13)

Since the first component is an integer and the second component is strictly less than 1, we obtain the score based on the two-component intuition discussed above.

We find that identifying and using such relevant terms in addition to the query enti- ties works well for sentence selection, but suffers from sentence length. While short and compact sentences are preferable descriptions in practice, both ϱenco andϱteri assign a

higher weight to sentences that contain more entities (and terms), and thus favour longer sentences. In the following, we consider two normalization schemes.

Normalization by length (NORL). One possible way of normalizing with the length of a sentences is to directly use the length in characters len(s). Thus, we introduce the normalized scoreϱnorlas

ϱnorl(s, Q, n) := 1 logl (s) " |N (s) ∩ Q | + |N (s) ∩ Tn(Q)| |Tn(Q)| + 1 # . (4.14)

4.3 Entity-centric Summarization

The addition of 1 in the denominator is due to the border case of sentences that contain no terms. Since we found in our empirical evaluation that we would otherwise give preference to extremely short sentence fragments that contain little more than the entity itself, we use the logarithm of the length (an alternative would be the length of a sentence in words). While this scheme normalizes based on the length of a sentence, it does not account for the number of entities and terms in the sentence overall and does not distinguish between terms and entities in the sentence.

Normalization by count (NORC). As a final method, we thus include a two-factor nor- malization that is based on the number of entities and terms per sentence. We defineϱnorc

as ϱnorc(s, Q, n) := |N (s) ∩ Q | |N (s) ∩ E| + |N (s) ∩ Tn(Q)| |Tn(Q)| · (|N (s) ∩ T | + 1) (4.15)

by normalizing the two components of the scoring function separately. The resulting func- tion effectively measures the fraction of relevant entities and relevant terms that occur in a sentence. The factor |Tn(Q)| in the second term also ensures that the contribution of

entities to the final score is larger than the contribution of relevant terms.

In the following, we utilize and compare these four methods for the description of entity relations in general and relations between geographic locations in particular.

Documento similar