The strategies used to identify a disease gene fall into four main categories, functional cloning, positional cloning, the candidate gene approach and a combination o f the latter, the positional candidate approach (Collins, 1995).
The choice o f strategy for identifying a disease gene is dependent upon the nature o f the disease and the resources available. Generally, the strategies used all aim to identify one or more candidate genes, which must then be tested further for evidence implicating them as the disease gene.
1.4.1 Functional cloning
Functional cloning relies on an understanding of the biochemical basis o f a particular disease. If the gene product is known, then characterization of the deficient protein or enzyme could subsequently lead to cloning o f the responsible gene using the cDNA. Partially degenerate nucleotides coding for the protein can be prepared and used to screen a cDNA library. The cDNA clone(s) is then used to screen a genomic library to enable characterization o f the gene from a genomic clone. Alternatively, a functional gene assay is required. In this case, antibodies are raised to the protein o f interest and used to screen an expression cDNA library.
Thus a gene can be cloned with no knowledge o f its chromosomal position. This approach has been successful in only a few cases, as there are a limited number o f human diseases for
which the biochemical defect could be directly determined. Enzyme deficiencies such as phenylketonuria, due to defiency o f the enzyme phenylalanine hydroxylase (DiLella et al, 1986) or Haemophilia A, due to deficiency o f Factor Vlll (Gitschier et al, 1984) were successfully identified using the functional cloning approach.
1.4.2 Candidate gene approach
The candidate gene approach relies on some knowledge of the type o f gene anticipated to be involved in a particular disorder, but no prior infomiation regarding its cliromosomal position. The major limitation o f this approach is that predictions o f the biochemical functions of an unknown gene are often inaccurate in view of the complexity o f disease and its molecular pathology. However there are several ways in which ‘educated guesses’ for good candidates can be made.
For some disorders, observations on the pathogenesis may immediately suggest candidate genes with an appropriate expression pattern or function. If an animal phenotype shows a striking similarity to a human disorder, then it might result from mutations in the animal ortholog of the disease gene. This approach has been less useful than might be hoped, as mutations in orthologous genes in humans and mice not infrequently produce considerably different phenotypes (Wynshaw-Boris, 1996). New candidate genes may also be suggested where a phenotypically similar disease has previously been mapped to a particular gene. Further investigation o f the biochemical pathway or determining if the original gene is a member of a multigene family may provide candidates for phenotypically similar conditions.
The causative gene for a novel form of spinocerebellar ataxia (SCA8) was identified in a completely position independent way (Koob et al, 1999). Neurodegenerative triplet repeat conditions often display anticipation. Utilizing this knowledge, the authors isolated and cloned a novel CAO expansion on chromosome 13q21 present in a patient with ataxia.
1.4.3 Positional cloning
Positional cloning is the method whereby a disease gene can be isolated knowing only its subchromosomal localization, without using any information about its biochemical function or pathogenesis (Collins, 1995).
The initial localization of the disease gene may come from the presence o f visible cytogenetic abnormalities including translocations, duplications or deletions, or from linkage analysis. Linkage analysis (section 1.3) is the most common means of initial chromosomal localization for inherited diseases. An initial genome wide search for linkage uses markers spaced at relatively wide intervals (often 10 - 15cM) and so will define only a broad candidate region. When a candidate region has been identified, it must be saturated with more closely spaced polymorphic markers. For placing the disease within the marker framework, crossovers are then identified by analysing haplotypes. Disease haplotypes are identified in individual pedigrees and recombinants are identified, in order to define proximal and distal flanking markers. This defines a candidate region which cannot be narrowed further except by finding new recombinants in other affected families. Linkage disequilibrium (section 1.3.6) when present, can sometimes help to narrow the candidate region further, but it rai ely pinpoints the location o f a gene to a single clone within a contig.
The next step is to construct an ordered contig o f clones (usually YACs (yeast artificial chromosomes)) across the critical region. With the progress of the Human Genome Mapping Project, ordered contigs o f clones are available for most regions o f the genome. If these are not available, it is possible to identify YACs by screening publicly available libraries. As YACs contain large inserts (often up to 1 Mb) they are often unstable and prone to rearrangements, so other vectors such as PACs (PI artificial chromosomes) or BACs (bacterial artificial chromosomes) with smaller insert sizes o f 100 - 200kb may be used to define the contig. When a physical map of the interval with overlapping DNA clones has been built, these can serve as substrates to identify novel candidate genes.
Several methods can then be used to identify expressed sequences within these genomic clones. These methods include cDNA library screening using genomic clones from the
candidate region as probes, cDNA selection, zoo blotting to seek evolutionarily conserved sequences, CpG island identification to seek regions o f undermethylated DNA lying close to genes, exon trapping to find genomic sequences flanked by functional splice signals and sequence analysis looking for genomic sequences having characteristics o f exons. A transcript map o f all expressed sequences within the candidate region is then generated, and the most likely candidate genes are screened for the presence of pathogenic mutations in affected individuals.
A positional cloning strategy was used in the identification o f the PAX6 gene for aniridia, discussed in section 1.2.1 (Jordan et al, 1992; Hanson et al, 1993). The O Al gene for ocular albinism was also identified by this method. A novel transcript demonstrating high expression patterns within the retinal pigment epithelium was isolated from the O Al critical region. A number o f mutations were subsequently identified within this novel gene (Bassi et al, 1995).
Positional cloning is laborious, but recently with the wide availability o f transcript maps and the gradual increase in the number o f annotated genes (due to the success of the Human Genome Mapping Project), it is increasingly being superseded by the positional candidate approach.
The strongest candidate genes are those which map to the same chromosomal location as the disease gene. This is known as the positional candidate approach, and is the most widely used method o f gene identification today.
1.4.4 Positional candidate approach
The positional candidate approach combines the processes used in pure positional cloning and the position independent candidate approach. It is the most commonly used method of gene identification today (Collins, 1995) and has been facilitated by the rapid expansion of
information available to researchers on the public databases, generated in part by the Human Genome Mapping Project. However, as candidate regions usually contain dozens of genes, many o f which are identified only as cDNAs or ESTs, it is essential to prioritise candidate genes. This can be achieved by analysing the available data on their spatial and temporal expression patterns (determined by Northern blotting or in situ hybridisation of mRNA in tissue sections), functional domains and homology to relevant genes implicated in other model organisms or similar human phenotypes.
Nonsense mediated decay can also aid in the prioritization of candidate genes. An mRNA carrying a premature stop codon is often rapidly degraded in vivo. Therefore a finding of decreased mRNA expression in affected individuals in cultured fibroblasts can suggest a suitable candidate (Hentze and Kulozik, 1999).
A recent example o f the use of the positional candidate approach is the identification of the
ECM l gene for lipoid proteinosis (Hamada et al, 2002). Linkage mapping narrowed the candidate region to a 2.3cM interval at lq21. A candidate gene approach comparing control and affected gene expression in cultured fibroblasts and subsequent direct sequencing o f genomic DNA identified the E C M l gene.