5. Marco Teórico
5.1. Violencia contra la mujer
A common approach when analysing heterogeneous data is to appeal to mixture models. This rich class of models allow us to infer subgroups contained within data without (prior) information on subgroup membership of individual observations. A subgroup can be thought of as a cluster of individual observations which form a “homogeneous” group. Individual observations within each subgroup are assumed to follow the same underlying
distribution.
The simplest models within this class are finite mixture models; see, for example, Everitt and Hand (1981) and Lindsay (1995). Note that, for ease of notation and exposition, we write f (·|·) for a density (or probability) function (depending on whether the quantity is continuous or discrete) but simply refer to these functions as densities. In finite mixture models the parametric density that defines the model, denoted f (x), is comprised of a fixed (finite) number N of mixture components. More formally we say a density f (x) is an N -component mixture if it takes the form
f (x|ψ, λ) =
N
X
c=1
ψcfc(x|λc) (3.1)
where f1(x), . . . , fN(x) are component densities and the ψc are mixture weights for each
component. Each component density is also parameterised by a unique value λc. In order
for this density to be well defined the following constraints must hold: (a) the component densities fc(x|λc) must all be valid density functions, that is, we require fc(x|λc)≥ 0 for all
x and R fc(x|λc) dx = 1 for c = 1, . . . , N , and (b) the mixture weights ψc must lie on the
(N− 1)–dimensional simplex, that is, ψc≥ 0 for c = 1, . . . , N and Pcψc= 1. Assuming
these conditions hold this mixture distribution is defined for any choice of component densities be they continuous or discrete. In practice however these densities are often chosen from the same family.
If we have n observations, denoted x = (x1, . . . , xn), the (observed data) likelihood is
π(x|ψ, λ) = n Y i=1 ( N X c=1 ψcfc(xi|λc) ) , (3.2)
which, in general, is very complicated. It is however possible to make the form of the likelihood substantially more straightforward by appealing to data augmentation methods – specifically by introducing latent component/cluster indicator variables which we now discuss.
A common approach when implementing mixture models is to introduce latent cluster indicator variables, here denotedc = (c1, . . . , cn), where ci= j denotes that observation i
belongs to component/cluster j. Conditional on the latent cluster indicator variables, the model is simplified significantly as the conditional density for observation xi is simply
fci(xi|λci). These random (unobserved) variables follow a categorical distribution defined
as Pr(ci= c) = ψcfor i = 1, . . . , n, c = 1, . . . , N and denoted ci|ψ ∼ Cat(ψ). Therefore the
is π(x, c|ψ, λ) = n Y i=1 ψcifci(xi|λci),
as, given the parameters λ, ψ, the pairs (xi, ci) are independent. This form of the like-
lihood is substantially more straightforward than (3.2) and is the reason latent cluster indicators are typically introduced when fitting mixture models.
It follows that Bayesian implementations of finite mixture models, given the latent indi- cators, are generally of the form
Xi|λ, ci,ψ ∼ fci(xi|λci)
ci|ψ ∼ Cat(ψ) (3.3)
ψ ∼ Dir(α)
where i = 1, . . . , n, and Dir(α) denotes the Dirichlet distribution with concentration pa- rameters α = (α1, . . . , αN) where αi > 0. Note that in the model definition above the
mixture components (and the latent cluster indicators) are exchangeable, that is, they can be arbitrarily relabelled while maintaining the equivalent model specification (Stephens, 2000). Therefore, within an inference context, it is perhaps not sensible to favour a particular mixture component a priori. This is achieved by choosing the concentration parameters to be αi = α = 1 which gives ψ ∼ Dir(1), that is, the mixture component
weightsψ follow a uniform distribution over the (N− 1)–dimensional simplex.
Naturally we could choose to form an N –component mixture of Plackett–Luce models by letting the component distributions be of Plackett–Luce form, with Xi|Λ, ci ∼ PL(λci)
where λc = (λc1, . . . , λcK) is the parameter vector associated with component (clus-
ter) c and Λ = {λc}Nc=1 is the collection of all such parameter vectors. Indeed Gormley
and Murphy (2008a,b, 2009) and Mollica and Tardella (2014) propose finite mixtures of Plackett–Luce and related models to allow for differing preferences between rankers. This approach was also taken by Vitelli et al. (2018) but instead they chose a distance based model, namely that of Mallows (1957). Of course, this approach could be trivially ex- tended to form an N –component mixture of Weighted Plackett–Luce models by letting Xi|Λ, ci∼ PLW(λci,w). Under this setting the ranker weights w would be common across
all components with only the parameter vectorλ being cluster specific. In some sense the models described in Chapter 2 could be considered to be a trivial case within the (finite) mixture model framework, with N = 1 mixture components, that is, a single homogeneous subgroup which contains the entire population of rankers.
Although finite mixture models give the flexibility to model heterogeneous data, specifying an appropriate form of such a model is a non-trivial task. One of the main issues that
arises when fitting finite mixture models is the constraint that a fixed number of mixture components must be chosen a priori. This requires the analyst to decide how many subgroups are contained within a population before performing their analysis. In an attempt to overcome this issue many choose instead to fit numerous models, each with differing numbers of components, and then appeal to model selection techniques (such as Akaike information criterion (AIC) or Bayesian information criterion (BIC)) to determine which model best fits the data. This solution however comes at the cost of performing numerous analyses. The analyst is still also required to choose the (different) number of components to consider. Ideally the mixture model would be defined so that the number of components is not fixed a priori and instead allows the number of components to be inferred using, for example, reversible jump methods (Richardson and Green, 1997). Alternatively we can appeal to a more flexible class of models, namely infinite mixture models. As the name suggests infinite mixture models contain an “infinite” number of components and thus the underlying density f (x|ψ, λ) can be thought of as the limiting case as N → ∞ of a finite mixture (3.1). Note that an “infinite” number of components only exists in theory and in practice the number of non-empty components can be at most the number of observations. Given the form of a finite mixture model (3.3) it is clear that we require an infinite dimensional Dirichlet distribution in order to define an infinite mixture model. The generalised version (to infinite dimension) of the Dirichlet distribution is the Dirichlet process – this the topic of the next section.