• No se han encontrado resultados

5. Dise˜ no software 51

5.2. Especificaciones y modos de funcionamiento

Association studies offer an alternative strategy to study genetic factors involved in complex psychiatric disorders. Historically, genetic association analyses have been conducted in a population-based setting (Silverman and Palmer, 2000), where the aim of the association study has been to demonstrate a significantly different distribution of allelic variants in affected (case) and unaffected (control) individuals. The basic unit of analysis in association studies is the individual, who can be included regardless of the status of their other family members.

Association studies can be conducted using one of two strategies, both reliant on the CD/CV hypothesis (section I.3.2.3). Candidate gene association analyses investigate variants within a particular genomic region, based on physiological, biochemical or pharmacological evidence.

These investigations take advantage of the increased power of association studies to detect genes of moderate effect, whilst capturing an account of the current biological understanding of the tissues, proteins and genes likely to play a role in the pathogenesis of OCD. An alternative to candidate-gene based analyses, known as genome-wide association analyses, involves screening the entire genome for causal genetic variants. No prior assumptions are made regarding the location of the susceptibility variants, implying that the procedure represents an unbiased, systematic approach to identifying the causal variants (Hirschhorn and Daly, 2005). Genetic association studies pertaining directly to OCD will be discussed in section I.6.

In both of these association methods, the usefulness of the selected marker depends on its ability to identify the susceptibility allele. This is achieved by exploiting the preferential association between the marker and susceptibility loci, due to a characteristic known as linkage disequilibrium.

I.3.2.5.1. Linkage Disequilibrium (LD)

LD refers to the non-random statistical association of sequence variants along an individual chromosome that results in an increased tendency for the alleles of closely linked loci to co-segregate with an increased frequency across a population. This represents a powerful tool for investigating population history, human evolution and the genetic aetiology of complex disorders (Jorde et al., 1995; Kidd et al., 1998).

In LD mapping, a group of affected individuals, descended from a single founder, form part of a large multigenerational pedigree of which all initial generations, except the current few, are missing. Numerous meiotic and recombination events have therefore occurred, narrowing the region of DNA that possesses the susceptibility allele. The ability to identify genetic components of complex phenotypic variation depends to a large extent on our knowledge of how different parts of the genome are correlated. Focussing on LD and haplotype analyses (discussed further on) will afford a unique insight into these processes.

i. Measures of LD

The majority of LD measurements represent the pairwise association between markers (Pritchard and Przeworski, 2001), with the most widely used being the absolute values of the normalised disequilibrium coefficient (|D’|) (Hedrick, 1987; Lewontin, 1964) and the absolute value of the correlation coefficient, r2 (Hill and Robertson, 1968). Both of these measurements are derived from the LD pairwise coefficient, D, but have slightly different interpretations (Wall and Pritchard, 2003).

The value of |D’| = 1 indicates the lack of recombination between the two loci under investigation (complete LD), whereas |D’|<1 represents a disruption in LD sometime in the past. However, since D’ is not dependent on allele frequencies, |D’| = 1 if three out of the four possible haplotypes are present (i.e. the alleles in LD with one another do not have to possess the same allele frequencies) (Weiss and Clark, 2002). Although |D’| does not depend on allele frequencies per se, it does depend on the size of the sample under study - values of |D’| have been found to be inflated in small samples, even when the loci are in linkage equilibrium (Gabriel et al., 2002). Moreover, the intermediate values of |D’| are difficult to interpret, and have been found to vary in simulations for pairs of sites at a given distance (Wall and Pritchard, 2003).

The value of the correlation coefficient, r2 represents the statistical correlation between two sites. The value of r2 = 1 if, and only if, no historical recombination has occurred, and the markers have the same allele frequencies (i.e. only two out the four possible haplotypes are observed in the sample). The value of r2 is useful in that it is indicative of the power of the LD study – the inverse of r2 represents the factor by which the sample size should be increased to detect statistically significant association between the marker locus and disease, providing a rough guide as to the usefulness of a given level of LD (Ardlie et al., 2002; Weiss and Clark, 2002). Higher values of r2 are indicative of a greater ability of one SNP to predict the behaviour of the SNP in LD with it.

Another useful advantage to using the r2 value is that it is related to the average recombination fraction in the population which summarises LD over a particular genomic region, not just between pairs of markers (Pritchard and Przeworski, 2001). A further advantage of using r2 to measure LD is that it is comparable across studies. However, due to its sensitivity to allele frequency, it may mean that two markers that are adjacent to one another may yield different

r2 values with a third marker (Ardlie et al., 2002). It is therefore at the discretion of the investigator to decide on the most appropriate measurement of LD to use in the study conducted. In the present study, both D’ and r2 values are represented, in order to allow the reader a comprehensive view of pairwise LD between markers investigated.

ii. LD in genetic association studies

LD patterns are useful in association studies because they impart knowledge regarding the genetic distance over which signals of causation may be generated in case-control studies.

Furthermore, they facilitate the identification of the susceptibility allele by identifying the neighbourhood surrounding the variant (Risch and Merikangas, 1996). However, it should be noted that there are various factors that disturb the relationship between LD and distance, both evolutionary (e.g. population dynamics and natural selection [Nordborg et al., 2002; Reich and Lander, 2001; Kruglyak, 1999[a]; Pritchard and Przeworski, 2001]) and genetic (recombination, inversion [Pritchard and Przeworski, 2001] and conversion polymorphisms [Langley et al., 2000; Ardlie et al., 2002; Frisse et al., 2001], genetic drift and mutation rate [Terwilliger et al., 1998; Ardlie et al., 2002]) . Indeed, markers that are closely linked have been found to exhibit low levels of LD, or none at all (Clark, 1998; Moffat et al., 2000; Ardlie et al., 2002; Kidd et al., 2000; Rieder et al., 1999; Templeton et al., 2000), whilst relatively high levels of LD have been observed between markers that are comparatively far apart from one another (Collins et al., 1999; Abecasis et al., 2001; Reich et al., 2001; Stephens et al., 2001; Gordon et al., 2000).

iii. The importance of demographic history in LD association studies

It is clear that patterns of LD in the human genome are strongly shaped by evolutionary history; in turn, each disease has its own genetic architecture, shaped by aspects of population dynamics and history. The demographics of any population is complex, with each population experiencing differential degrees of isolation, migration, admixture, expansion and bottlenecks (Ardlie et al., 2002), aspects of which will inevitably remain unknown. This underscores the critical importance of characterising the LD landscape in the region of interest in the population under investigation.

Significantly higher levels of LD have been noted in “younger”, more recently founded populations (Jorde et al., 2000; Peltonen et al., 2000; Puffenberger et al., 1994; Hall et al., 2002), implying a large degree of LD over longer stretches of the genome. For example, LD

has been found to extend over much longer regions in younger, non-African populations, probably reflecting the loss of genetic variation caused by the bottleneck that occurred when modern humans migrated out of Africa (Weiss and Clark, 2002; Frisse et al., 2001; Reich et al., 2001; Tishkoff et al., 1996; Wall, 2001). Such populations may be useful in the coarse-mapping of disease susceptibility alleles (i.e. identifying the region in which the allele may be situated), but will not be amenable to fine-mapping procedures. Older populations, on the other hand, exhibit less LD and larger amounts of recombination over shorter genomic regions, thus facilitating fine-mapping procedures (Jorde et al., 2000; Wilson and Goldstein, 2000).

I.3.2.5.2. Haplotype association analysis

Single marker investigations may provide little information regarding the association, and, although they may be situated in the candidate gene, it is possible that the markers may not be in LD with the susceptibility allele. A haplotype refers to a specific combination of alleles that co-occur on an individual chromosome, and therefore share a common evolutionary history.

Single nucleotide polymorphisms (SNPs), genotyped sequentially over the length of a chromosome, can be ordered into haplotypes. Such haplotype scanning may provide more information regarding variation within specific genomic fragments and interrelationships between polymorphisms in the surrounding regions, thereby imparting a greater amount of power to the study (Akey et al., 2001). This is because the historical crossover points can be analysed with greater accuracy from preserved and non-preserved portions of the mutation-bearing chromosome, which will, in turn facilitate localisation of the disease allele.

It is also possible that alleles at several SNPs jointly influence susceptibility to a disease by influencing regulation and/or functioning of the susceptibility variants, or that the alleles may act in combination with one another (much like a “super-allele”) to precipitate the phenotype, or certain aspects of the phenotype. Indeed, haplotype association studies have allowed the successful localisation of susceptibility genes in Hirschsprung disease (Puffenberger et al., 2004) and Crohn’s disease (Hugot et al., 2001; Rioux et al., 2001), and in locating candidate susceptibility regions in schizophrenia (Shifman et al., 2002) and cerebral malaria (Burgner et al., 2003). In addition, the merits of haplotype analyses in association studies have been illustrated using known associations between the apolipoprotein E locus and Alzheimer’s disease (Fallin et al., 2001), and adenine phosphoribosyltransferase gene and adenine phosphoribosyltransferase deficiency (Kuno et al., 2004).

i. Haplotype inference

The investigation and subsequent analysis of haplotype data rests on the assumption that the haplotypic phase information pertaining to the individuals in the study is available.

Ambiguous haplotypes can be resolved using data from relatives or genealogical information, allowing one to infer ancestral haplotype compositions. However, these methods are often costly (due to extra genotyping efforts) and impractical in population-based case-control association studies, where there is usually limited access to any kind of family genetic data.

An alternative would be to employ laboratory-based molecular haplotyping methodology, such as chromosomal localisation, single-molecule dilution or allele-specific polymerase techniques (Ruano and Kidd, 1989; Clark et al., 1998; Ruano et al., 1990; Michalatos-Beloin et al., 1996), which are also expensive and technologically demanding (Niu et al., 2002).

The solution therefore seems to be to predict the haplotype phase of unrelated, diploid individuals probabilistically, based on estimated allele frequencies from the population.

Several assumption- and likelihood-based methods have been created, these can be roughly divided into three major categories, based on the algorithm employed: Clark’s algorithm (Clark, 1990); the expectation-maximisation (EM) algorithm (Dempster et al., 1977; Hawley and Kidd, 1995; Long et al., 1995; Excoffier and Slatkin, 1995) and the coalescent-based algorithm (implemented in the program “Phase” [Stephens et al., 2001]). Recently several other methods, mostly based on the three aforementioned ones, have been created and successfully used to infer haplotype phase in genetic association studies (Zollner and Pritchard, 2005; Niu et al., 2002; Qin et al., 2002; Gusfield, 2001).

Clark’s algorithm assigns haplotypes to phase-unambiguous (i.e. homozygotes or single-site heterozygotes) individuals first. For each unresolved, ambiguous haplotype, the aim is to determine whether the known haplotype can be formed from some combination of the ambiguous sites (hence the “subtraction method” as the alternative name for this method).

Each time a haplotype is inferred in this way, it is viewed as another potential unambiguous haplotype from which the ambiguous haplotypes can be inferred. This chain of inference continues until all haplotypes have been recovered, or until one identifies a sequence that cannot be derived from any of the known haplotypes (Clark, 1990; Clark et al., 1998).

The EM algorithm obtains maximum likelihood estimates of haplotype frequencies within the sample, and uses the initial set of frequencies to calculate conditional distributions for

haplotype pairs that an individual carries (the expectation step). In the maximisation step, the haplotype frequencies are updated based on haplotypes inferred in the previous step. The EM algorithm iterates between the two steps until the frequency estimates converge. This method may, however, not be viable when analysing a large number of markers, due to the computational burden involved (Fallin and Schork, 2000).

The Phase algorithm uses a combination of the coalescent-based ancestry model and Bayesian-based algorithms to assign phases to the linked loci and estimate haplotype frequencies accordingly (Stephens et al., 2001). The method regards unknown haplotypes as random quantities and aims to evaluate their distribution, given the genotype data. The program also confronts certain population genetics features of haplotype inference by incorporating prior knowledge that the unresolved haplotypes will be more similar to commonly observed, resolved haplotypes. Phase can be applied to both SNP and multiallelic data.

In studies comparing the EM and Phase methods, Zhang et al. (2001) observed no major differences in accuracy between the two methods. Xu et al. (2002) incorporated levels of LD between markers into their study and found that, when LD between the markers was maintained, all three methods performed equally well. However, if LD between the markers was not maintained, Clark’s algorithm did not perform as well as Phase or the EM algorithms.

In a more recent study, Adkins (2004) compared the efficacies of leading computational methods in haplotype inference (including Phase and EM algorithms) and found that all performed with high accuracy, even when identifying rare haplotypes. They also observed that haplotype assignment remained accurate among subjects for up to five sites. It is thus clear that there is no agreement as to which algorithm may be best in estimating haplotype frequencies in population-based association studies; perhaps it would be more conducive to the investigators to focus on parameters that decrease the estimation error in computational inference of haplotypes.

According to Fallin and Schork (2000), error in haplotype estimation can be reduced by following a few pointers. Firstly, they advocate using the appropriate set of markers, taking note of LD between them. Some algorithms may not be able to handle the large number of haplotypes if the markers in the study are in linkage equilibrium (Zhao et al., 2003). They also suggest increasing the sample size and decreasing the haplotype ambiguity (i.e. fewer

individuals with haplotypes that cannot be resolved) where possible. Finally, increasing the dispersion of haplotype frequency values in a sample can also result in fewer haplotype estimation errors: as haplotype values become less uniform, the difference between the most and least common haplotypes becomes more extreme. The null-frequency haplotypes can thus be accurately predicted as zero, since there will be little evidence from the data for their non-zero frequencies, resulting in more accurate estimation of the commoner haplotypes.

ii. Haplotype blocks and recombination hotspots

Recently, studies investigating several genomic regions have indicated the presence of long chromosomal tracts in which the markers exhibit strong LD and limited haplotype diversity (known as haplotype blocks), separated by areas in which the recombination rate is relatively high (recombination hotspots) (Daly et al., 2001; Johnson et al., 2001; Patil et al., 2001;

Gabriel et al., 2002). Due to the limited haplotype diversity within the haplotype blocks, a small number of haplotypes represent the varation in most of the chromosomes within the population. Furthermore, high levels of LD within the blocks signify that some of the markers contain redundant information, allowing the variation within the haplotype blocks to be distinguished by one, or a few, SNPs. Therefore, theoretically, a disease susceptibility gene could be mapped to one of the haplotype blocks using a so-called tag SNP, which would improve the chances of detecting association when only a fraction of the markers are genotyped, saving significantly on time and money (Gabriel et al., 2002; Patil et al., 2001;

Johnson et al., 2001).

Efforts are presently underway by the United States National Human Genome Research Institute, in the form of an international intiative, the HapMap project, which aims to delineate the structure and boundaries of common haplotype and LD blocks in the genome, using populations from Africa, Asia and Europe (The International HapMap Consortium, 2003).

However, although the idea of haplotype blocks and recombination hotspots affords an insight into the distribution of LD within the human genome, even they seem to have an erratic distribution (Stephens et al., 2001; Pritchard and Przeworski, 2001), underscoring the importance of investigating the LD landscape for each region of interest, rather than applying a general LD value for that particular region.

I.4. DETERMINING THE VALIDITY OF AN ASSOCIATION: STATISTICAL