Capacidades e incapacidades del lenguaje retórico

For each individual we observe genotypes g = (g1, . . . , gM) at M SNPs along a

chromosome. For generality, let N denote the number of chromosome copies per

individual (so that N = 2 for diploid organisms). The genotype of an individual

is thought of as an unordered list of alleles, gm

= {gm1, . . . gmN}, and each allele is

assumed to have been derived from an ancestral haplotype, drawn from a fixed pool

of z distinct ancestral haplotypes. The unordered list of ancestral haplotypes at the

marker m is denoted by sm

= {sm1, . . . , smN}. We will also denote by [sm1, . . . , smN]

the ordered list of ancestral haplotypes; by π(sm) = [smπ(1), . . . , smπ(N )] a permuta-

tion of this ordered list, and by Π(sm) the set of all such permutations. Thus, for

example, if sm

= [1, 2] there are two permutations, namely [2, 1] and [1, 2], whereas

if sm

= [1, 1] there is only one permutation. The sequence s = (s1, . . . , sM) forms a

Markov chain on the space of unordered lists of ancestral haplotypes. In this thesis,

we use ‘ancestral haplotype’, ‘cluster’ and ‘state’ exchangeably.

Transition probability

The two hidden states in the HMM are modelled by the given initial distribution

of hidden states and the transition distribution of states at two successive markers.

Recombination events are captured by transitions in the HMM, at which an indi-

vidual’s haplotype can switch the ancestral haplotype from which it is considered

to have descended. Gene conversion events are not modelled explicitly but can be

accommodated in our model by two proximal recombination events.

Transitions are allowed to occur continuously along the sequence. First we define

the transition probability in a haploid HMM from clusters kn

to ln

between markers

m−1 and m by

p(smn

= ln|s(m−1)n

= kn) =











(1 − Jm) + Jmαmln

ln= kn

Jmαmln

ln6= kn,

(2.1)

where Jm

is the probability of a jump occurring at marker m−1. The probability

kn

which means that “transitions” can occur that do not change the ancestral hap-

lotype. Recall that kn

and ln

are indices of ancestral haplotypes, and hence αm

is a

probability vector with z elements. For tightly-linked markers, Jm

is small so that

cluster changes occur infrequently, but are allowed between any pair of markers.

Based on this haploid model and under the assumption of HWE (i.e. haplotypes

making up each polyploid genotype are independent at each marker), the transition

probability amklbetween unordered lists of clusters k = k1, . . . , kN

and l = l1, . . . , lN

at marker m − 1 and m is given by

amkl

= p(sm

= {l1, . . . , lN}|s(m−1)

= {k1, . . . , kN}) =

X

π∈Π(sm)

Y

n=1,...,N

p(smn= lπ(n)|s(m−1)n

= kn), (2.2)

Also note in equation (2.2) that we sum over permutations of the to ancestral haplo-

type. To understand this intuitively, consider first transitions between ordered lists

of ancestral haplotypes:

p(sm

= [l1, . . . , lN]|s(m−1)

= [k1, . . . , kN]) =

Y

n=1,...,N

p(smn

= ln|s(m−1)n= kn). (2.3)

We can consider an unordered list of ancestral haplotypes as the collection of all or-

dered lists of ancestral haplotypes which are equivalent to each other under permu-

tation (in other words using the permutation operator to define equivalence classes

on the set of ordered lists of ancestral haplotypes). It is clear that the transition

probability from an ordered list of ancestral haplotypes [k1, . . . , kN] to the unordered

list of ancestral haplotypes {l1, . . . , lN} should be equal to the sum of the transi-

tion probabilities from the ordered list of ancestral haplotypes to all of the ordered

list of ancestral haplotypes comprising this equivalence class, which is just the sum

under all permutations as given in equation (2.2). Finally, we can see that this

ensures equal transition probabilities from each ordered list of ancestral haplotypes

[kπ(1), . . . , kπ(N )] to the unordered list of ancestral haplotypes {l1, . . . , lN}, and hence

we are able to use this as the transition probability from the unordered list of an-

cestral haplotypes {k1, . . . kn}.

Emission probability

The relation between hidden cluster and observed genotype data is modelled by

emission probabilities. As for the transition probability, the emission probability

of the polyploid HMM can be derived from a haploid model. For generality, we

assume multiallelic markers with alleles h ∈ {0, . . . , H}. Denote by θmln(h) the

emission probability of allele h in a haploid model at marker m given the ancestral

haplotype cluster ln. Under the HWE assumption, we then obtain the emission

probability eml(gm) of a genotype given an unordered list of ancestral haplotypes sm

by

eml(gm) = p(gm

= {h1, . . . , hN}|sm

= {l1, . . . lN})) =

X

π∈Π(gm)

Y

n=1,...,N

p(gmn= hπ(n)|smn

= ln) =

X

π∈Π(gm)

Y

n=1,...,N

θmln(hπ(n)), (2.4)

If gm

is completely missing for an individual at marker m, we set θmln(h) to be

uniform over all alleles. If gm

is partially missing, we set θmln(h) to be uniform over

The haplotype of an observed individual is not an exact copy of the ancestral haplo-

type from which it has descended, because of evolutionary processes and imperfect

inferences. Thus the θmln(h) are generally different from zero and one, but they

should typically be close to one of these values.

Full probability of an observed genotype sequence

The probability of an observed genotype sequence g = (g1, . . . , gM) is obtained by

summing over all possible paths S of hidden states:

p(g|α, J, θ) =X

S

p(g, s|α, J, θ) =

X

S

Y

m=1,...,M

p(sm|Jm, αm, sm−1)p(gm|sm, θm)

(2.5)

where the transition and emission probability (the two terms of the product in the

equation) are determined by equation 2.2 and 2.4. Although the number of possi-

ble paths increases exponentially with the length of the sequence, the property of

Markov chain (the probability of current state is conditional independent of the pre-

vious state) allows this full probability to be computed efficiently using the forward

and backward algorithm which only increases computational cost linearly with the

length of the sequence.

In document La naturaleza retórica del lenguaje (página 111-115)