• No se han encontrado resultados

5 EL APOYO A LA INTEGRACIÓN EUROPEA A NIVEL REGIONAL (I) Descripción de variables

ACTITUDES HACIA EUROPA

B) Interacciones entre niveles:

A convenient illustration of a DNA motif based on its PFM is the so-called sequence logo (Schneider and Stephens, 1990), see Fig. 2.4 for an example. The sequence logo shows the

preference of the TF to a certain nucleotide for each position. If the preference is very high, the position-specific distribution is a Dirac distribution with all probability weight at the prefered nucleotide. From a point of information theory (for an introduction, see Shannon and Weaver, 1949; MacKay, 2003), the information content is maximal (Shannon, 1948). The more unspecific the preference at a position is, the lower the information content. Therefore, one can use the information content to represent the strength of affinity of the TF for each position κ defined by

log2|A| +X

a∈A

πκ,alog2πκ,a. (2.3)

This formula assumes an equi-probable background distribution. Otherwise, one has to use the relative entropy (or Kullback-Leibler distance) between the position-specific and the background distribution defined by

X

a∈A

πκ,alog2

πκ,a

µ(a).

For a given background distribution, the relative entropy reaches its maximum at log2|A| if πκ,· is a Dirac distribution. The minimum 0 occurs if the position-specific distribution is

equal to the background distribution. Note that the information content in Eq. (2.3) can be derived by setting µ(a) = |A|−1 for all a. The contribution of each nucleotide to the relative entropy is retrieved by multiplication with the position specific nucleotide frequency πκ,a.

Based on these thoughts, one can draw a sequence logo where each position contains the contribution of each letter encoded by the height of the nucleotide such that the summed heights correspond to the relative entropy. In Fig. 2.4, the PFM from Ex. 2.1 with consen- sus ’GCCAA’ is shown created by the program weblogo (Crooks et al., 2004). Obviously, the first, third, and fourth positions have a very strong preference towards the consensus letters. In contrast, the second and the fifth positions have a weaker affinity but still a preference to ’C’, ’T’ respectively ’A’, ’T’ exists. Hence, the sequence logo is a suitable tool to visualize a DNA motif. However, dependencies between positions are not reflected in the sequence logos. This would require a more sophisticated approach like structural logos (Gorodkin et al., 1997).

Chapter 3

Word Count Statistics

3.1 Introduction

Rapid sequencing of DNA (Maxam and Gilbert, 1977; Sanger et al., 1977) generated a vast amount of sequences to be analyzed. First studies focused on protein coding sequences and analyzed codon usage (Almagor, 1983). Later, interest rose in non-coding sequences and in detection of exceptional words in sequences hinting for biological function (Pevzner et al., 1989). Pattern occurrences in random strings is a classical problem (Feller, 1968). First exact results for the expected value of the number of occurrences were revealed based on simple probabilistic models (Dayhoff, 1984; Santibanez-Koref, 1987). This chapter reviews different methods for computing the distribution of the number of words (for other reviews, see Reinert et al., 2005; Robin et al., 2005). We also investigate the distribution of word clusters (clumps). We present a new exact formula to compute the variance and the exact distribution without using generating functions or automata. Computational issues are mainly ignored except for few remarks about algorithmic complexity (for an overview, consult Gusfield, 1997; Waterman, 2000; Lonardi, 2001).

Our review starts by considering single words (roughly following the exposition in Robin et al., 2005). We present two exact approaches: First, the classical approach based on waiting time (Gentleman and Mullin, 1989; Gentleman, 1994; Robin et al., 2005) and, second, a very recent approach (Zhang et al., 2007) - we call it conditional approach - using optimally spaced seeds motivated by homology search (Ma et al., 2002). Although this approach is specifically designed for PFMs, it takes as input a set of words (for PFMs, the set of compatible words). Hence, we classify it as a word counting approach. Since the exact approaches for words are infeasible to compute for large sets of words, we also introduce approximations. Initially, we consider an independence model and obtain a binomial and a Poisson approximation. For the Poisson approximation, we derive the Chen-Stein bounds explicitly. We also introduce a normal approximation. Then, we consider clumps instead of occurrences. After presenting the exact distribution, the binomial approximation and the Poisson approximation, we introduce the compound Poisson distribution. Although the compound Poisson distribution is used to compute the distribution of the number of occurrences, clumps are modelled explicitly. Therefore, we include this this approximation in Section 3.3 about clumps. We finish by deriving a asymptotic normal distribution. In the same order, we discuss the statistics for multiple words and clumps of multiple words. The whole chapter serves as a basis to treat bigger sets of words as encoded by PFMs.

Preliminaries We briefly repeat the notation from the last chapter and introduce some new definitions. The random sequence X consists of nucleotides X1, . . . , Xn∈ An assumed

to be i.i.d. in the alphabet A. The alphabet A is a set {0, 1, . . . , |A| − 1} where for DNA |A| = 4. For better readability, we sometimes refer to 0 ∈ A as ’A’, 1 ∈ A as ’C’, 2 ∈ A as ’G’ and 3 ∈ A as ’T’. Each position Xi has the nucleotide distribution µ which is a

map (of the σ-algebra of) A → [0, 1]. We also write µ(w) for the probability of a word w = w1, . . . , w`: µ(w) = ` Y κ=1 µ(wκ).

Note that we use greek letters for indices within a word.

An occurrence of w is the event of w starting at any position i in the sequence X. The binary random variables Yi(w) indicate this event:

Yi(w) :=



1 if Xi, Xi+1, . . . , Xi+`−1= w1, w2, . . . , w`,

0 otherwise. (3.1)

Hence, the number Nn(w) of w occurring in a random sequence of length n is given by

Nn(w) = Pn−`+1i=1 Yi(w). This is the key random variable. This chapter reviews different

approaches to compute the distribution L(Nn(w)) and its properties such as its first two

moments. We always assume the parameters of the sequence model (basically µ) to be given, hence, not to be estimated. Otherwise, the asymptotical distributions change (Lundstrom, 1990; Prum et al., 1995; Waterman, 2000; Robin et al., 2005).

Chen-Stein Error Bounds For (compound) Poisson distributions, one can bound the ap- proximation error using the Chen-Stein method (Chen, 1975). For a good introduction with many examples, see Arratia et al. (1989, 1990); Barbour et al. (1992). Furthermore, Barbour and Chryssaphinou (2001) give a guide to using compound Poisson distributions as approximations. Quantification of the approximation error is performed in terms of the total variation distance. Let U and V be any two random processes taking values in the same space E, then the total variation distance between their distributions is (Barbour et al., 1992)

dTV(L(U ), L(V )) = sup

D⊂E|P(U ∈ D) − P(V ∈ D)| .

(3.2)

The subsets D are assumed to be measurable. For E = N all subsets D ⊆ E are measurable and the total variation distance can be written as

dTV(L(U ), L(V )) = 1 2 X i≥0 |P(U = i) − P(V = i)|