Vision System Motion Tracking - Gross Motion Ability Assessment

5.3 Gross Motion Ability Assessment

5.3.2 Vision System Motion Tracking

We introduce, in this section, a series of stochastic processes necessary to define sev-eral nonparametric Bayesian language models we use later on in this thesis. We keep notations from Goldwater (2006), our main source for this presentation.

Chinese restaurant process An important stochastic process for nonparametric Bayesian language models is the so-called Chinese restaurant process (CRP), which

17The term nonparametric refers to that ability, and does not mean that these models have no parameters.

18The training of these models, however, often requires a lot of computation, a fact that – similarly to artificial neural networks – long hindered the applicability of these models for realistic corpora. This was only overcome in the past two decades with the advent of faster computing units, but most of the theoretical foundations for these methods arose the 1970s or earlier.

generates partitions of integers. The analogy goes as follows: each customer i (repre-sented as an integer) sequentially enters a restaurant with an infinite number of tables, each table accommodating a potentially infinite number of customers. When customer i enters, an arrangement z−i of the previous customers is observed, withK(z−i) non-empty tables, each already accommodating nk(z−i) customers for k∈ [1, K(z⁻ⁱ)]. The customer either seats at a non-empty table with probabilityP (zi = k| z−i), or chooses a new one with probabilityP (z_t= K(z−i) + 1| z⁻ⁱ). These terms are defined as follows:

with α ≥ 0, a parameter of the process called the concentration¹⁹ parameter. Larger values for this parameter result in a tendency towards opening more new tables, hence a more uniform distribution of customers across the tables, and more clusters in the partition produced. It is also clear that a “rich-get-richer” effect will ensue from this definition of the CRP, and that this effect will get stronger asα gets smaller.

The probability of a given sequence of table assignments z for n customers is given by: with K(z) the total number of tables in the arrangement z, and n_k(z) the number of customers at table k in this arrangement. The Gamma function is defined by Γ(x) = R∞

0 u^x−1e^−udu for x > 0.

Dirichlet process A Dirichlet process DP(α, G₀), with a concentration parameter α and a base distribution G₀, is a stochastic process whose sample path is a prob-ability distribution over a measurable set S. For any partition B1, . . . , Bn of S, if X ∼ DP(α, G0) then

(X(B1), . . . , X(Bn))∼ Dir(αG0(B1), . . . , αG0(Bn)) , (2.8) whereDir(·) is the Dirichlet distribution.

In an alternative view of the Dirichlet process, a stick-breaking process makes the fact that the DP generates discrete distributions with a countably infinite support more explicit. In this view, the base distribution of the DP distributes independently the locations of the probability mass function. The α parameter, on its end, influences the probability of each of these locations: a series of independent random variables β_k are drawn sequentially from a Beta(1, α) distribution; β1 breaks the unit “stick”, and this portion is the probability mass for the first location drawn from the base distribution;

β₂ breaks the remaining portion of the stick, and this defines the probability mass of

19Maybe the term dispersion, sometimes used in place of concentration, can seem more natural since higher values of this parameter lead to a higher dispersion of the customers across the tables. The standard terminology (that we keep) comes from the fact that in a Dirichlet process DP(α, G0), G0

corresponds to the mean of the process, and higher values of α lead to distributions that are closer to, or more concentrated around, G0.

2.4. NONPARAMETRIC BAYESIAN MODELS 31

the second location, etc. This ensures that the total probability mass will be 1. This also gives another intuition as to why small values ofα will lead to more “concentrated”

probability mass and sparser distributions.

Another intuitive way to understand the Dirichlet process is to look at it as a “two-stage” CRP model, in which 1) customers are seated according to a CRP process with a certain concentration parameter α as defined by Equation (2.6), and 2) each new opened table is then labelled with a draw from a distributionG₀. This two-stage CRP model is equivalent²⁰ to a Dirichlet process with concentration parameter α and base distributionG0. Goldwater(2006) calls the CRP in the first step the adaptor, and the distribution G₀ in the second step the generator.

Pitman-Yor process A Pitman-Yor process PYP(α, β, G₀) is a generalization of the Dirichlet process DP(α, G0). In the two-stage view of the DP, the CRP adaptor is modified in such a way that the conditional probability for the i^th customer to seat at table k is defined by:

with0≤ β < 1, α > −β (α corresponds to the concentration parameter in the standard CRP), andK(z−i) the number of tables already occupied when the i^th customer enters the restaurant. The new parameter, β, of the Pitman-Yor process, gives more control over the shape of the tail of the distributions generated by the process. It allows to

“save” some probability mass to augment the likelihood of opening new tables, even as the number of customers grows and tends to decrease the probability to open a new table in the equivalent DP. Hence, the β parameter is often called the discount parameter of the PYP.

Application to word generation IfG₀ (the generator in the two-stage view of the DP) corresponds to a distribution defined over a lexicon, this means that tables in the restaurant will be labeled with words and that each customer entering the restaurant will represent a word token. The CRP (the adaptor, responsible for assigning customers to tables) will, on the other hand, control word frequencies according to a power-law distribution. We will define such language models more formally in Section2.4.3. Note that PYP language models differ from DP language models only in the definition of the adaptor (Equation (2.6) vs. Equation (2.9)).

Sometimes confusing is that the generator in a DP (or a PYP) can generate du-plicated labels, in other words multiple tables can share the same label in the Chinese restaurant analogy. In fact, the expected number of CRP tables (in a DP) for a type corresponding to n tokens is α log^n+α_α (Antoniak, 1974, cited by Goldwater (2006)).

For a given number of tokens corresponding to a particular type, the average number of tables labelled by that type grows withα. This intuitively agrees with the dispersion effect of α already mentioned, with greater values of α leading to the opening of more tables.

20The technical explanation can be found in Section 3.6 of (Goldwater,2006).

In document INSTITUTO TECNOL ´ OGICO Y DE ESTUDIOS SUPERIORES DE MONTERREY (página 94-98)