For each individual we observe genotypes g = (g1, . . . , gM) at M SNPs along a
chromosome. For generality, let N denote the number of chromosome copies per
individual (so that N = 2 for diploid organisms). The genotype of an individual
is thought of as an unordered list of alleles, gm
= {gm1, . . . gmN}, and each allele is
assumed to have been derived from an ancestral haplotype, drawn from a fixed pool
of z distinct ancestral haplotypes. The unordered list of ancestral haplotypes at the
marker m is denoted by sm
= {sm1, . . . , smN}. We will also denote by [sm1, . . . , smN]
the ordered list of ancestral haplotypes; by π(sm) = [smπ(1), . . . , smπ(N )] a permuta-
tion of this ordered list, and by Π(sm) the set of all such permutations. Thus, for
example, if sm
= [1, 2] there are two permutations, namely [2, 1] and [1, 2], whereas
if sm
= [1, 1] there is only one permutation. The sequence s = (s1, . . . , sM) forms a
Markov chain on the space of unordered lists of ancestral haplotypes. In this thesis,
we use ‘ancestral haplotype’, ‘cluster’ and ‘state’ exchangeably.
Transition probability
The two hidden states in the HMM are modelled by the given initial distribution
of hidden states and the transition distribution of states at two successive markers.
Recombination events are captured by transitions in the HMM, at which an indi-
vidual’s haplotype can switch the ancestral haplotype from which it is considered
to have descended. Gene conversion events are not modelled explicitly but can be
accommodated in our model by two proximal recombination events.
Transitions are allowed to occur continuously along the sequence. First we define
the transition probability in a haploid HMM from clusters kn
to ln
between markers
m−1 and m by
p(smn
= ln|s(m−1)n
= kn) =
(1 − Jm) + Jmαmln
ln= kn
Jmαmln
ln6= kn,
(2.1)
where Jm
is the probability of a jump occurring at marker m−1. The probability
kn
which means that “transitions” can occur that do not change the ancestral hap-
lotype. Recall that kn
and ln
are indices of ancestral haplotypes, and hence αm
is a
probability vector with z elements. For tightly-linked markers, Jm
is small so that
cluster changes occur infrequently, but are allowed between any pair of markers.
Based on this haploid model and under the assumption of HWE (i.e. haplotypes
making up each polyploid genotype are independent at each marker), the transition
probability amklbetween unordered lists of clusters k = k1, . . . , kN
and l = l1, . . . , lN
at marker m − 1 and m is given by
amkl
= p(sm
= {l1, . . . , lN}|s(m−1)
= {k1, . . . , kN}) =
X
π∈Π(sm)
Y
n=1,...,N
p(smn= lπ(n)|s(m−1)n
= kn), (2.2)
Also note in equation (2.2) that we sum over permutations of the to ancestral haplo-
type. To understand this intuitively, consider first transitions between ordered lists
of ancestral haplotypes:
p(sm
= [l1, . . . , lN]|s(m−1)
= [k1, . . . , kN]) =
Y
n=1,...,N
p(smn
= ln|s(m−1)n= kn). (2.3)
We can consider an unordered list of ancestral haplotypes as the collection of all or-
dered lists of ancestral haplotypes which are equivalent to each other under permu-
tation (in other words using the permutation operator to define equivalence classes
on the set of ordered lists of ancestral haplotypes). It is clear that the transition
probability from an ordered list of ancestral haplotypes [k1, . . . , kN] to the unordered
list of ancestral haplotypes {l1, . . . , lN} should be equal to the sum of the transi-
tion probabilities from the ordered list of ancestral haplotypes to all of the ordered
list of ancestral haplotypes comprising this equivalence class, which is just the sum
under all permutations as given in equation (2.2). Finally, we can see that this
ensures equal transition probabilities from each ordered list of ancestral haplotypes
[kπ(1), . . . , kπ(N )] to the unordered list of ancestral haplotypes {l1, . . . , lN}, and hence
we are able to use this as the transition probability from the unordered list of an-
cestral haplotypes {k1, . . . kn}.
Emission probability
The relation between hidden cluster and observed genotype data is modelled by
emission probabilities. As for the transition probability, the emission probability
of the polyploid HMM can be derived from a haploid model. For generality, we
assume multiallelic markers with alleles h ∈ {0, . . . , H}. Denote by θmln(h) the
emission probability of allele h in a haploid model at marker m given the ancestral
haplotype cluster ln. Under the HWE assumption, we then obtain the emission
probability eml(gm) of a genotype given an unordered list of ancestral haplotypes sm
by
eml(gm) = p(gm
= {h1, . . . , hN}|sm
= {l1, . . . lN})) =
X
π∈Π(gm)
Y
n=1,...,N
p(gmn= hπ(n)|smn
= ln) =
X
π∈Π(gm)
Y
n=1,...,N
θmln(hπ(n)), (2.4)
If gm
is completely missing for an individual at marker m, we set θmln(h) to be
uniform over all alleles. If gm
is partially missing, we set θmln(h) to be uniform over
The haplotype of an observed individual is not an exact copy of the ancestral haplo-
type from which it has descended, because of evolutionary processes and imperfect
inferences. Thus the θmln(h) are generally different from zero and one, but they
should typically be close to one of these values.
Full probability of an observed genotype sequence
The probability of an observed genotype sequence g = (g1, . . . , gM) is obtained by
summing over all possible paths S of hidden states:
p(g|α, J, θ) =X
S
p(g, s|α, J, θ) =
X
S
Y
m=1,...,M
p(sm|Jm, αm, sm−1)p(gm|sm, θm)
(2.5)
where the transition and emission probability (the two terms of the product in the
equation) are determined by equation 2.2 and 2.4. Although the number of possi-
ble paths increases exponentially with the length of the sequence, the property of
Markov chain (the probability of current state is conditional independent of the pre-
vious state) allows this full probability to be computed efficiently using the forward
and backward algorithm which only increases computational cost linearly with the
length of the sequence.
In document
La naturaleza retórica del lenguaje
(página 111-115)