Incremental speech processing involves using the available information from the context to constrain an upcoming input (which can be a word, a phrase, a sentence etc.) and integrate it into the prior context once it is heard in order to constrain a subsequent input more accurately. This cycle continues until the speaker ends his message. This conceptual description of
incremental speech processing fits well in the Bayesian framework of language
comprehension. The motivation of this framework originates from Bayes’ theorem which describes the probability of an event based on the prior information and knowledge related to the event. A simple mathematical description of Bayes’ theorem is as follows:
𝑃(𝐴|𝐵) =𝑃(𝐵|𝐴)𝑃(𝐴)𝑃(𝐵) … (1)
where A is a target variable and B is a context variable on which the target A is conditioned on. As a simple application to language processing, suppose that a listener hears an adjective- noun phrase like “yellow banana”. The goal is to model the listener’s internal beliefs about “banana” given the preceding adjective “yellow”. By simply substituting 𝐴 with “banana” and 𝐵 with “yellow”, we obtain the following:
𝑃("𝑏𝑎𝑛𝑎𝑛𝑎"𝑡|"𝑦𝑒𝑙𝑙𝑜𝑤"𝑡−1) =𝑃("𝑦𝑒𝑙𝑙𝑜𝑤"𝑡−1𝑃("𝑦𝑒𝑙𝑙𝑜𝑤"|"𝑏𝑎𝑛𝑎𝑛𝑎"𝑡)𝑃("𝑏𝑎𝑛𝑎𝑛𝑎"𝑡)
𝑡−1) … (2)
where 𝑡 and 𝑡 − 1 indicates the relative position of each word in the phrase. The goal is to model the posterior 𝑃("𝑏𝑎𝑛𝑎𝑛𝑎"|"𝑦𝑒𝑙𝑙𝑜𝑤") describing the probability of “banana” given “yellow”. This expression already proves its usefulness by showing an explicit mapping between the goal (posterior) and the prior. The prior 𝑃("𝑏𝑎𝑛𝑎𝑛𝑎") describes the listener’s beliefs about the target “banana” (i.e. subjective probability of “banana” alone) before knowing the context “yellow”. Then, the likelihood 𝑃("𝑦𝑒𝑙𝑙𝑜𝑤"|"𝑏𝑎𝑛𝑎𝑛𝑎") evaluates the context “yellow” against his prior beliefs about the target “banana”. The evidence
𝑃("𝑦𝑒𝑙𝑙𝑜𝑤") works as a context normaliser whose practical role is explained in Footnote 1 in Chapter 1. The concept of belief updating is reflected by the shift from a prior to a posterior at any given cycle until the posterior converges to the delta distribution (target = 1 or 0 otherwise). In a modelling perspective, this Bayesian approach provides useful insight into how prediction may change and develop as new words are incrementally unfolded in a sentence.
59
Another important aspect of this approach is that it models the cyclical development of prediction in sentence and discourse comprehension. Suppose that we are modelling the listener’s syntactic prediction of a complement structure in a sentence: “The intrepid child found the picture”. For illustration purposes, I assume that the subject NP “The intrepid child” is independent of the following complement structure such that it is constrained entirely by the verb “found” in a preceding context. Then, it is possible to track changes in prediction as follows (Figure 2-1):
Figure 2-1: A simplistic visual illustration of belief updating about the complement syntactic structure across different cycles in time. SCF = subcategorization frame.
In Figure 2-1, Cycle 1 describes the process of incorporating the main verb “found” into prediction. Cycle 2 shows that this verb-incorporated prediction becomes a new prior to constrain the syntactic frames. As a direct object structure is confirmed by the determiner “the”, the prediction cycle ends in Cycle 2 in this example and the prior facilitates the integration of the direct object structure into the sentence. Hence, by tailoring the prediction more specifically to the up-to-date context, this Bayesian model promotes more rapid and accurate integration of the target frame (direct object). It is worth noting that any posterior at
60
the end cycle (Cycle 2 in this example) converges to a delta distribution and the process of belief updating becomes conceptually equivalent to integrating the target into the context (the “target”, in practice, refers to a specific property (e.g. semantic meaning or grammatical category etc.) of a particular linguistic unit (e.g. a word, a phrase, a clause etc.) that appears after the context).
As shown in (2) and Figure 2-1, incremental speech comprehension proceeds with updating the beliefs each time an input (i.e. verb) that constrains the target (i.e. SCF) is heard.
However, as already discussed in Chapter 1, prediction in speech processing is not merely limited to words but includes a variety of linguistic aspects from perception (phonological- lexical) to cognition (syntax-semantics). The psycholinguistic accounts based on the Fodorian modular theory (Fodor, 1983) claims that the processing streams are organized into separate, autonomous modules (Frazier, 1987). Other accounts propose jointly interacting streams (Marslen-Wilson, 1975; Altmann & Steedman, 1988). In this section, I briefly review a recent generative framework proposed by Kuperberg (2016) in the Bayesian perspective. Kuperberg’s framework claims that listeners infer the underlying cause of the observed inputs from a set of hierarchically organized representations (or internal generative model). These representations best explain the statistical properties of the observed inputs based on their beliefs about the message that the speaker tries to convey. The beliefs propagate down to lower levels to tailor the representations by generating probabilistic predictions before processing the new input. Predictions at these various domains hierarchically interact with each other: for example, predictions about semantic meanings or syntactic structures of possible continuations could influence the predictions about candidate words which could, in turn, affect the expected sequences of phonemes. These probabilistic predictions are
evaluated against the bottom-up evidence once the new input is heard to update their prior beliefs. This top-down prediction scheme facilitates the processing of an input word in a sentence and the input, in turn, enables flexible updating of the multi-level constraints through bottom-up projections. This process is simplistically illustrated in Figure 2-2 below.
61
Figure 2-2: Incremental speech processing of a simple direct object sentence “The giant crocodile attacked the wildebeest” in the light of the BBU generative framework (Kuperberg, 2016). This describes the role played by each input (i.e. a subject noun phrase, a verb and a complement noun phrase) in constructing the event representation (i.e. a message) in a predictive processing framework. Blue arrows indicate “prediction” and orange arrows indicate “update” or “integration”.
62
Now, the problem simplifies to characterizing the arrows in Figure 2-2: prediction and update. Under the view of prediction as a graded/probabilistic phenomenon (see Kuperberg & Jaeger, 2016), the conditional probability distribution about the upcoming input directly represents information used to predict the upcoming input (i.e. constraints). Also, it is important to quantify the certainty of beliefs because the strength of top-down prediction depends on the certainty with which the beliefs are held (Kuperberg, 2016). Lastly, the difficulty of updating reflects the proportion of variance in constraints (a.k.a. “pruned
probability mass” in Levy (2008, p. 1131)) which cannot be explained by the bottom-up input, so-called “prediction error”. The human language system aims to minimize this prediction error by an iterative process of predicting and updating throughout a sentence and will eventually obtain converged representations at various levels each of which best explains the observed sentence. The ways to characterize prediction and to quantify certainty and error are described in the following sections.
This Kuperberg’s BBU framework is a variant of “predictive coding” framework (Friston, 2005, 2008) which has drawn significant attention in the field of cognitive/perceptual neuroscience. As stated in Kuperberg and Jaeger (2016), “Hierarchical predictive coding in the brain takes the principles of the hierarchical generative framework to an extreme by proposing that the flow of bottom-up information from primary sensory cortices to higher level association cortices constitutes only the prediction error, that is, only information that has not already been “explained away” by predictions that have propagated down from higher level cortices…”. This specific neurobiological hypothesis from the predictive coding account has been tested and corroborated in a series of behavioural and neuroimaging studies of speech perception (Sohoglu, Peelle, Carlyon & Davis, 2012, 2014; Sohoglu & Davis, 2016). They consistently reported the reduced activity in superior temporal gyrus (STG) when the speech input (target) was more expected, supporting the claim that brain is sensitive to the mismatch (error) between expected and actual input.