Arbol CHAID ´ - Desarrollo de los ´arboles

4.2 Desarrollo de los ´arboles

4.2.5 Arbol CHAID ´

After the above strategies to identify SCAll had proved unsuccessful, the remaining ESTs in the SCAl 1 candidate interval were determined with reference to Genemap ’99. As mentioned earlier, there are 274 genes and EST clusters in the D15S146-D15S161 interval. The Genemap database was created by the mapping of large numbers o f existing ESTs with different database identities and accession numbers to their respective locations by different investigators. During this process, many o f these were found to represent the same full-length transcript, as a result of their sequence homology, and hence some of the ‘redundant’ or ‘duplicated’ ESTs were grouped together into the UniGene database. However, some groups o f ESTs wiU have been mapped to the same broad region o f the genome, have no sequence homology to each other, but may stiU represent the same fuU- length transcript and therefore the same gene; this is especiaUy likely if the ESTs were originaUy identified by a combination o f 5’ and 3’ reads. Hence the first task when

considering the ESTs in the SCA ll candidate region was to attempt to identify as many of these redundancies as possible, in order to shorten the list o f possible transcripts to be evaluated.

In order to carry out this redundancy search, the Ust of EST clusters mapping between D15S146 and D15S161 was studied. For each o f the clusters, the EST with the longest read was selected. This EST was entered into the H G I reports section o f the TIGR database (www.tigr.org/tdb/hgi/searching/reports.html). This faciUty enables so-caUed tentative human consensus sequences (THC) to be identified. These THCs are created by assembling ESTs into virtual transcripts. In some cases, THCs contain fuU or partial cDNA sequences obtained by classical methods. THCs contain information on the source Hbrary and abundance of ESTs and in many cases represent fuU-length transcripts.

Alternative splice forms are built into separate THCs. THCs are actual assemblies, with a consensus sequence, and not simply clusters o f overlapping sequences. Despite this very useful facility, there are a number o f important caveats:

1. THCs are only as ‘good’ as the ESTs underlying them; there may be unspliced or chimaeric ESTs and thus THCs.

2. There is still redundancy in the THC set because sequences must match end to end and at a certain percent identity to be combined.

3. Directionality of the THCs should not be assumed.

4. N ot all THCs contain protein-coding regions.

In addition to the identification o f a THC corresponding to each EST mapping to the SCAl 1 region, the HG I report section also gives THCs corresponding to a read from the opposite end o f the putative transcript. For example EST ID AA086045 is part of the 3’- read THC 378691. This THC has information for the corresponding opposite end THC 336132 which is a 5’-read EST. By following this procedure for each of the EST clusters, a significant number of redundancies were identified as different ESTs pulled out the same 5’ or 3’ THC. Further, each THC identified was BLASTed against the non-redundant, swissprot, and HTGS portions of GenBank in order to identify homologies with existing nucleotide or protein sequences, and to identify the RPCI-11 human BAC clone which contains the particular THC and therefore EST sequence. Genemap already lists a number of the EST clusters which do not represent genes o f known function as being ‘highly* or ‘weakly similar’ to other nucleotides (e.g. known genes o f unknown function or full-length transcripts) or proteins. These homologies were confirmed and further useful homologies were identified. In this way an additional number o f ESTs likely to represent the same transcript were identified. As a result o f the above analyses, 120 EST clusters were identified as likely to be redundant. Therefore, the number of non-redundant ESTs potentially mapping to the candidate region was 130, as the 24 genes o f known function mapping to the region are not included in this figure.

During the above homology searches, particular care was taken to note potential sequence homology between a THC and the cDNA sequences o f the SCAl, SCA2, SCA3, SCA6

and SCA7 genes. Importandy, no BLAST search identified any o f these cDNAs and therefore no such homology has been detected.

Ten o f the 130 ESTs identified in this way are o f particular interest. They represent 10 large genes o f unknown function, which were identified from a foetal brain cDNA library. All these genes have the notation KIAA followed by a number. There are many more of these genes distributed throughout the genome. Their functions are likely to be many and varied but currendy remain unknown. In view o f their expression and localisation, the consideration o f these genes will be prioritised during future investigation.

5.8 DISCUSSION

This chapter has described the efforts to identify and clone the SCAl 1 gene. Thus far, the causative mutation has not been found. The SCAl 1 candidate region is a large (~7Mb) interval, and therefore attempts to identify the gene initially centred on the likely nature of mutation in order to eliminate the need for detailed physical mapping.

The first strategy employed was based on the hypothesis that an expanded GAG/CTG repeat was the disease-causing mutation in SCA ll. This was a reasonable possibility given the number o f SCAs caused by expanded trinucleotide repeats. As described earlier, the RED method depends on the presence o f GAG or the complementary CTG trinucleotide repeat, and the sizes o f the RED products correspond to the length o f the target genomic repeat The largest product obtained in the SCA ll family was 150bp but this was found in 5 unaffected subjects without the disease haplotype as well as 3 affected subjects who had the disease haplotype. Hence, RED analysis was uninformative.

The RED technique can be a useful method for determining the presence o f an expanded CA G/CTG repeat as the mutational mechanism in diseases where this is suspected. The

method was originally described by Schalling and co-workers (SchaUing et al., 1993). Its

usefulness was evaluated in a number o f studies including one where 30 samples from SCAl, SCA3 and H D patients with known repeat lengths were analysed with RED

(Martorell et al.^ 1997). There was a good correlation between the number o f repeats

detected by RED and the number as determined by sequencing. However, in 17%, RED gave additional fragments for ligation products o f different size than the CAG/CTG repeat expansion detected by sequencing. The same was observed in a group o f 78 control subjects in which products of more than 40 repeats were detected in 27%. Another study

(Sirugo et al^ 1997) has surveyed the maximum CTG /CA G repeat lengths in humans from different ethnic populations; the results showed that repeats o f at least 51 were found in 27% of all individuals studied with repeats o f up to 85-102 found in one fifth to one third o f individuals from east Asian populations.

Some insight into the origin o f the large RED products has been gained from studies which have identified two loci where expanded C A G/CTG repeats are common and

presumably non-pathogenic (Breschel etal^ 1997; Nakamoto etal^ 1997). 32 o f 75 Japanese

control DNA samples had alleles containing more than 51 repeats at the so-called ‘expanded repeat domain CA G/CTG T (ERDAl) locus on chromosome 17q21.3 with a range o f repeat size o f 10-92. Similarly, 4 o f 30 Caucasians had alleles o f >51 repeats. The flanking sequences o f this locus show no sequence similarity with other reported sequences and there is no evidence that this locus is transcribed. In addition, a highly polymorphic CTG repeat with a range o f allele sizes o f 11 to ~2100 has been identified in an intron of the SEF2-1 gene on chromosome 18q21.1. It was shown however by PCR analysis that the 150 bp product obtained in the SCAll family did not result from expansions at the SEF2- 1 or ERDAl loci. The origin o f this product in the SC A ll family remains unknown.

The RED result in the SCAl 1 family was not wholly unexpected; owing to the similarity between the phenotypes o f SCAl 1 and SCA6, and the lack of dramatic anticipation in either, it had been hypothesised that the mechanism o f mutation in SCAl 1 might also be a moderately expanded repeat. In light o f the frequency o f large non-pathological

trinucleotide repeats with more than 33 repeats, it is likely that RED analysis in SCA6 families would often be uninformative; indeed, normal alleles at SCAl, SCA2, SCA3 and SCA7 loci as well as those o f the other triplet repeat disorders can be larger than

pathological alleles at the SCA6 locus. A similar argument may apply in the case o f SCAll; in view o f the absence of significant clinical anticipation in the Devon Ataxia family; it is possible that a putative pathological repeat might be only moderately expanded in affected individuals.

The possibility that such a small or moderately expanded repeat is the cause o f SCAl 1 was further investigated by attempts to identify CTG/CAG repeat tracts specifically known to map to the candidate interval. Four such repeats had been identified previously, and these were analysed in members o f the SCAl 1 family. However, all sequences appeared to be monomorphic and all individuals were homozygous for the same amplified fragment

lengths in all four STSs. Clearly, these data do not exclude point mutations in any gene which may contain these repeat sequences, and this presents a further avenue of investigation.

An important and perhaps increasingly likely possibility is that SCAl 1 is caused by a mechanism other than an expanded CAG/CTG repeat. Point mutations in regions o f the SCA6 gene positionally remote from the CAG repeat have been associated with episodic ataxia type 2 (EA-2) (Toumier-Lasserve, 1999). Recentiy, it has been shown that an unusual expanded pentanucleotide repeat (ATTCT) is found in the SCAIO gene o f three Mexican

families with spinocerebellar ataxia (Matsuura et al.^ 2000). Unsurprisingly, RED analysis

had been unhelpful in these families, and IC2 monoclonal antibodies had failed to detect an expanded polyglutamine tract in lymphoblastoid cell lines from SCAIO patients.

O n review of the SCAll candidate interval on the Genemap database, only one gene was identified, MAPI A, whose function and expression pattern made it a good candidate for SCA ll. However, sequencing of all five exons and exon-intron boundaries revealed no potentially pathogenic sequence alterations. This finding does not however exclude mutations in the MAPI A introns or in the promoter region o f the gene.

In document Minería de datos aplicada al manejo de la relación del Cliente en una Entidad Bancaria (CRM) (página 104-109)