“ MODELO DE CONTRATO “

Traits are often described as being either Mendelian or complex (Lander and Schork, 1994). Mendelian traits follow the classical pattern of having dominant or recessive modes of inheritance and are typically caused by mutation at a single genetic locus resulting in impaired or loss of function of that gene. Although rare, high myopia, in the presence or absence of other ocular or systemic pathology, can sometimes be classed as a Mendelian trait (Tang, Yap and Yip, 2008). Complex traits on the other hand can be under the influence of multiple genetic loci, gene-gene and gene-environment interactions and non-genetic effects (Lander and Schork,

1994). Examples of complex traits include common trait variations within the population such as eye colour, height and refractive error. Following, are the primary genetic features that need to be considered when attempting to discover causal genetic features of complex traits.

1.2.1 Genetic variants

The majority of phenotypic differences between individuals arise as a result of differences in our genomes. These genetic differences can occur in the form of single nucleotide substitutions in the genome, or more complex insertions/deletions of one of more nucleotides (Frazer et al., 2009).

The simplest and most frequently occurring type of genetic variant is a single nucleotide polymorphism (SNP) (Wang et al., 1998). This is where the commonly occurring nucleotide at a specific locus in the genetic sequence is substituted for an alternative nucleotide. For example, at a specific position in the human genome, most of the population may possess the ‘A’ nucleotide but some individuals may instead possess an alternative nucleotide at that position (e.g. ‘T’). The different variant nucleotides are referred to as alleles. Often, SNPs do not cause changes to the amino acid sequence (so-called “synonymous” variants), however, there are many occasions where a SNP does alter the amino acid sequence (“non- synonymous” variants) (Hunt et al., 2009). This occurs if the variant alters a codon, a sequence of three nucleotides, to one that codes for an alternative amino acid. Whether synonymous or not, either type of SNP can result in subtle variations to an

known to exist throughout the human genome with many more yet to be discovered. SNPs are defined as being common if the rarer allele is present in at least 1% of the population (Wang et al., 1998).

Other more complex variations occurring in the genome include INDELs and copy number variations (CNVs). These sources account up to 20% of all variations in the human genome (Frazer et al., 2009). INDELs refer to insertions and/or deletions of nucleotides to the genetic sequence. These modifications can be as small as a single nucleotide or cover several hundred nucleotides (Mullaney et al., 2010). Unlike SNPs, INDELs have the potential to cause frame-shift mutations. Frame-shift mutations occur when the additional (or deleted) nucleotides are not a multiple of three (the length of a codon), resulting in a change to the codon sequence starting from the site of the INDEL, thus altering the amino acid sequence from this point and ultimately the functionality of the affected gene (Mullaney et al., 2010). Copy number variants (CNVs) are structural variations whereby there is either a deletion or replication of an extensive section of genomic sequence, usually of at least one kilobase (kb) in length (Redon et al., 2006). Such alterations in copy number can result in functional effects due to altered gene expression resulting from loss of one or both copies of a gene, or the presence of additional copies of that gene.

1.2.2 Genotyping and Imputation

Ascertaining the genotype of a specific individual at a particular genetic locus can be performed either directly (“genotyped”) or inferred (“imputed”). Large-scale direct genotyping is usually achieved by using SNP arrays, which are able to genotype

individuals at hundreds of thousands if not millions of SNP loci simultaneously (Rabbee and Speed, 2006). SNPs that “tag” complex variants such as INDELs are increasingly being included in genotyping arrays. Genotyping is not the same as reading the entire DNA sequence of an individual, as only known locations of genetic variation are assessed during genotyping, and therefore will not detect all SNP variants (LaFramboise, 2009). Sequencing of the whole genome is still, to date, more costly, and therefore array-based genotyping is still widely used for larger- scale studies in order to collect data about common variants spread throughout the human genome.

As current genotyping methods cover only a small fraction of all known variants, imputation methods can be applied in order to increase the number of variants available for phenotypic association testing (Howie, Donnelly and Marchini, 2009). Imputation is the inference of unknown genomic variants based on other known variants (Marchini and Howie, 2010). Reference panels are required for imputation with the most commonly-used panels being those from the International HapMap Consortium and the 1000 Genomes Project (International HapMap et al., 2010; The 1000 Genomes Project Consortium et al., 2012). These reference panels each contain genomic maps for hundreds, now thousands of individuals, with 2,504 genomes included in Phase 3 of the 1000 Genomes Project (Sudmant et al., 2015). These genomic maps spread across multiple population ancestries and were obtained through either genotyping or sequencing methods, respectively.

For each variant, imputation aims to identify which one of two alleles would most likely be present in a particular individual at that locus. Imputation software such as IMPUTE2 and MaCH essentially match haplotypes between the individuals requiring imputation and the reference panels. The alleles chosen at each untyped variant are those present on the best fitting reference haplotype (Howie et al., 2009; Li et al., 2010). One of the major limitations of imputation relates to variants with low minor allele frequency (MAF). These rarer variants occur less frequently in the reference panels and therefore, there is reduced accuracy when imputing these SNPs due to incomplete haplotype matching between the individual and the reference panels. To aid users, imputation software also provides metrics relating to the certainty of imputation for each variant. This enables researchers to exclude variants that are of poor imputation quality, as poorly imputed variants can affect the accuracy of subsequent association analyses.

1.2.3 Linkage Disequilibrium

As mentioned above, during imputation, untyped variants are inferred through knowledge of haplotypes (Marchini and Howie, 2010). These, in turn, arise through patterns of linkage disequilibrium (LD) throughout the genome (Wall and Pritchard, 2003).

Linkage disequilibrium refers to the non-random statistical correlation between two alleles (Goldstein and Weale, 2001; Slatkin, 2008). This means that if two variants are in LD with each other, and one of these variants has demonstrated significant association for a particular trait, association of the other variant to the trait can be

inferred with near certainty – and it may be that this alternative locus is the actual causal variant (Hirschhorn and Daly, 2005; Wang et al., 2010; Hormozdiari et al., 2015). This is because variants in strong LD are usually inherited together as part of the same “haplotype block” (Goldstein and Weale, 2001; Wall and Pritchard, 2003). Complete linkage equilibrium, on the other hand, occurs when the presence of an allele at one locus is completely independent of that at another locus.

The magnitude of LD between two loci can be determined by examining the relationship between the frequencies of alleles at these loci and the possible haplotypes (Wall and Pritchard, 2003). One commonly used measure of LD is the r2 value and it represents the square of the statistical correlation between the two loci (Devlin and Risch, 1995). Values range between one (complete linkage disequilibrium) and zero (complete linkage equilibrium). The r2 value is calculated by applying Equation 1.1.

𝑟2₌(𝑥𝐴𝐵𝑥𝑎𝑏− 𝑥𝐴𝑏𝑥𝑎𝐵)2

𝑝_𝐴𝑝_𝑎𝑞_𝐵𝑞_𝑏

Equation 1.1: Calculating LD between two biallelic loci. r2 = linkage disequilibrium between loci p and q; pA = frequency of allele A at the first locus (p); pa = frequency of the alternative allele (a) at the first locus (p); qB = frequency of allele B at the second locus (q); qb = frequency of the alternative allele (b) at the second locus (q);

xAB, xab, xAb and xaB refer to the probabilities of the four possible haplotypes (x). Adapted from Devlin and Risch (1995)

In document ABIERTO (página 53-68)