If a concept map should be created to provide an overview of a set of documents, the selec- tion of representative concepts and relations, as discussed in the previous section, becomes a crucial part of the task. A large body of research exists for this problem when the overview should be provided in textual form, a task known as text summarization.
The goal of automatic text summarization is to create a short, limited-size text that de- scribes the most important contents of a given text document or set of documents (Nenkova and McKeown, 2011). The size limit is usually defined as the maximal number of words the summary text can have. Within that size limit, a good summary should provide as much information about the input document(s) as possible, should prefer the more important aspects of the content and should be a fluent and natural text.
Different variations of this task have been studied in the NLP community. Single- document summarization (SDS) deals with a single document that should be summarized, multi-document summarization (MDS) with creating a summary for several input docu- ments. Other variations are query-focused summarization, where only contents relevant to a given query should be part of the summary (Dang, 2005), and update summarization, where the existing knowledge of a user is specified as a set of documents and should not be repeated when creating a summary for another, overlapping set of documents (Dang and Owczarzak, 2008). Many computational approaches for these tasks have been devel- oped, presented and evaluated in the Document Understanding Conference (DUC)9 and Text Analysis Conference (TAC)10series (Over et al., 2007).
9http://duc.nist.gov 10http://tac.nist.gov
Chapter 2. Background
In the remainder of this section, we present the main computational approaches to the summarization problem and point to seminal or exemplary papers for each direction. For a comprehensive review of all related work, we refer the reader to the surveys by Nenkova and McKeown (2011), Yao et al. (2017) and Gambhir and Gupta (2017).
2.3.2.1 Extractive Summarization
Extractive summarization systems produce summaries reusing parts — mostly complete sen- tences — taken from the input documents without modifications. More formally, let 𝐷 be a set of documents, 𝒮(𝐷) the set of all sentences in 𝐷 and ℒ the maximal length of the desired summary. The task is then to select a subset of sentences 𝑆 ⊂ 𝒮(𝐷) with ∑𝑠∈𝑆𝑙𝑒𝑛(𝑠) ≤ ℒ, where 𝑙𝑒𝑛(𝑠) is the length of 𝑠 in words. Two subtasks, importance estimation and sentence selection, are usually modeled to create extractive summaries.
Importance Estimation In order to include the most important information in a summary, the importance 𝑖(𝑠) of each sentence 𝑠 ∈ 𝒮(𝐷) needs to be estimated. Luhn (1958), the very first work on automatic summarization, used word frequencies to derive importance esti- mates for sentences. Almost 60 years later, summarization systems using frequency as the only indicator for importance still yield competitive results (Boudin et al., 2015). Edmund- son (1969) added the position of a sentence in the document and the presence of predefined cue words as additional indicators. Among many other metrics explored in later work, im- portance estimates derived from graph structures with the PageRank algorithm (Page et al., 1999) had a particularly large impact. Both TextRank (Mihalcea and Tarau, 2004), which uses a graph representing co-occurring words, and LexRank (Erkan and Radev, 2004), using a graph of sentence similarities, have been regularly used as benchmarks. A commonality of all these approaches is that they use a hand-designed indicator (or several ones) to derive importance estimates, which makes these approaches unsupervised summarization models. Given the large number of suggested metrics that indicate importance, supervised sum- marization systems that use annotated data to learn how to combine different indicators to make the best estimate have been explored as well. Early work in this direction was by Kupiec et al. (1995), who combine several features in a Bayesian binary classifier trained to decide if a sentence should be in a summary or not. Later work modeled the problem with probabilistic models such as hidden Markov models (Conroy and O’Leary, 2001) or logistic regression (Hong and Nenkova, 2014) and with support vector machines in classification (Yang et al., 2017) and regression (Li et al., 2007) setups. Typical features include term and document frequencies, sentence lengths, sentence positions, unigrams, bigrams, parts-of- speech, named entities, capitalization and stopwords (Berg-Kirkpatrick et al., 2011, Hong and Nenkova, 2014, Li et al., 2016a, Yang et al., 2017).
Recently, neural supervised models for importance estimation have been proposed by several authors. Cheng and Lapata (2016) use a combination of convolutional neural net-
2.3. NLP Methods Supporting Document Exploration
works (CNNs), recurrent neural networks (RNNs) and attention to classify sentences for SDS. Cao et al. propose a regression model based on recursive neural networks for MDS (Cao et al., 2015) and a CNN-based model with attention and a ranking loss for query- focused summarization (Cao et al., 2016). A two-layer RNN with a set of hand-crafted features is developed by Nallapati et al. (2017). Al-Sabahi et al. (2018) propose a similar hi- erarchical encoder in combination with an attention mechanism. Compared to traditional supervised models, all of these approaches seem to benefit from the powerful distributed representations that neural networks can learn (Goldberg, 2017). A common trend is the use of an attention mechanism. Apart from that, a broad range of neural architectures has been proposed and none of them has so far been identified as being consistently superior.
Sentence Selection Once importance estimates for all sentences are available, the remain- ing task is to select the subset 𝑆 ⊂ 𝒮(𝐷) that makes the best summary. This is usually formulated as an optimization problem maximizing the importance within the size limit:
𝑆 = arg max 𝑆⊂𝒮(𝐷) ∑ 𝑠∈𝑆 𝑖(𝑠) s.t. ∑ 𝑠∈𝑆 𝑙𝑒𝑛(𝑠) ≤ ℒ
In other words, one tries to include as many important sentences as possible while not exceeding the size limit. This optimization is difficult, as one has to decide whether it is better to add an important and long sentence to the summary or instead a less important but also shorter sentence, leaving more space for additional sentences. To make the best decision, one has to consider the full search space of all subsequent decisions, i.e. optimize globally. The optimization problem is known as the 0-1 knapsack problem and is NP-hard (McDonald, 2007). In the case of MDS, an additional challenge is that sentences from dif- ferent documents might contain the same information. Thus, only one of them should be in the summary — although all of them are estimated to be equally important. This is typically handled by adding a redundancy penalty to the objective function, leading to an optimiza- tion problem that is also NP-hard (McDonald, 2007).11
Carbonell and Goldstein (1998) proposed a greedy optimization approach called max- imal marginal relevance (MMR). Sentences are added iteratively until the length limit is reached, choosing them based on their importance and redundancy with what is already in the summary. That does not necessary yield the optimal subset, but was shown to work well in practice. Other approaches, such as Hatzivassiloglou et al. (2001), rely on sentence clustering to first group redundant sentences together and then use only one sentence per cluster in the summary. Lin and Bilmes (2011) point out that the objective functions dis- cussed here are submodular. For submodular objective functions, greedy optimization al-
11For the easier version without the redundancy term, there is a pseudo-polynomial algorithm (Kellerer et al.,
Chapter 2. Background
gorithms with provable lower bounds exist that guarantee that a greedy solution is at most a constant factor worse than the optimal solution.
Exact solutions can be found by formulating the problem as an integer linear program (ILP), for which a broad range of off-the-shelf solver software exists. McDonald (2007) pi- oneered this approach, but also showed that it is much more computationally expensive than the greedy alternatives. Gillick and Favre (2009) proposed a new objective function that computes importance and redundancy in terms of included concepts rather than sen- tences. This has the advantage that the importance and redundancy terms simplify to a single term and yields ILPs that are more efficient to solve.
2.3.2.2 Abstractive Summarization
Extractive summarization methods have several problems. Using just the existing sen- tences, they might need to include unimportant details in a summary if something more important only occurs in a sentence together with these details. Moreover, extractive sum- maries can lack fluency and clarity, as the selected sentences might contain unresolvable pronouns or miss important context. Ordering the sentences in the most coherent way is a difficult problem on its own (Nenkova and McKeown, 2011). Abstractive summarization methods try to circumvent these problems by going beyond the set of existing sentences.
Sentence Modification Most of the early work has focused on compressing single or fus- ing multiple of the original sentences. By dropping unimportant parts from the sentences, the length budget of the summary can be used more efficiently. Both rule-based (Jing, 2000, Zajic et al., 2007) and learned (Knight and Marcu, 2002, Clarke and Lapata, 2007) models were proposed to compress sentences. Sentence fusion techniques (Barzilay and McKeown, 2005, Filippova and Strube, 2008) have also been explored since compressing only can lead to having unnaturally many short sentences in a summary. Rather than using these tech- niques as preprocessing for extractive models, joint models for selection and compression have also been proposed (Berg-Kirkpatrick et al., 2011, Chali et al., 2017).
Traditional Generation A summarization paradigm differing more radically from extrac- tion is the generation of completely new sentences. Such models typically first parse the input documents into a symbolic meaning representation, then summarize that representa- tion and finally generate a realization of the summary from it. While this approach gives a system more freedom to produce a good summary, a crucial point is that the intermediate representation offers enough representational capacity as well as good enough parsing and generation models. An early attempt in this direction was the system of Vanderwende et al. (2004) in DUC 2004. Li (Li, 2015, Li et al., 2016a) proposes an entity-based graph represen- tation well-suited for news documents from which they successfully generate summaries.
2.3. NLP Methods Supporting Document Exploration
Liu et al. (2015) used abstract meaning representation (AMR) as their intermediate repre- sentation, but left the generation step for future work. A proposition-based representation was shown to work well for educational texts, covering the full pipeline of parsing, sum- marization and generation (Fang and Teufel, 2016, Fang et al., 2016).
Neural Generation In recent years, the use of neural network models and large-scale training data led to improved performance in various NLP tasks, including text genera- tion tasks such as language modeling (Mikolov, 2012) or machine translation (Cho et al., 2014, Sutskever et al., 2014). The predominant approach of generating text with word-level RNNs has been first applied to summarization by Rush et al. (2015). Their framework of using RNN encoder and decoder modules with attention was quickly adopted and refined (Nallapati et al., 2016, Chopra et al., 2016, Wang and Ling, 2016). These models are able to produce much more fluent summaries than previous generative models, and thereby sub- stantially renewed the interest in abstractive summarization. Important extensions to this architecture are copy mechanisms that allow a model to include unknown words from the input in the summaries (Gu et al., 2016, See et al., 2017) and strategies to avoid repetitions in the generated sequences (Suzuki and Nagata, 2017, See et al., 2017). The greatest lim- itation so far is that most work focuses on SDS from a few sentences to short headlines, as training models for bigger inputs and outputs requires huge amounts of computational resources. In addition, no large-scale training corpora are available for MDS. Very recently, strategies such as pre-summarizing documents with extractive methods (Tan et al., 2017, Liu et al., 2018) or hierarchical encoders (Cohan et al., 2018, Celikyilmaz et al., 2018, Zhang et al., 2018) have been proposed to improve the scalability. These neural models are able to handle SDS examples with on average 5,000 input and 220 output words (Cohan et al., 2018) and MDS examples with 10k input and 100 output words (Liu et al., 2018).