3.2. LEVANTAMIENTO DE INFORMACIÓN DEL SISTEMA DE
4.3.2. COORDINACIÓN DE LOS RELÉS DE SOBRECORRIENTE
4.3.2.1. Coordinación de los relés en una ruta de la bahía Ambato salida-1 en
C om paring a novel sequence to all sequences previously identified is obviously very im portant. Searching public databases using genom ic sequences will identify (1) exact m atches to previously unm apped human genes, (2) high scoring m atches to orthologues and close hom ologues, (3) w eaker hom ologies to other genes, suggestive o f coding regions rather than indicative o f exactly which bases are coding, (4) near-exact m atches to hum an EST sequences (the corresponding cDNAs can then be obtained and sequenced to determ ine intron-exon structure) (5) non-exact m atches to ESTs from orthologues or hom ologues, which in the absence o f near-exact m atches m ay provide inform ation about gene structure and (6) m iscellaneous other features (e.g. STS sequences, non-annotated interspersed repeats and sequenced CpG island clones).
Programs
The m ost com m only used sequence database searching algorithm is B LA ST (best local alignm ent search tool, A ltschul et a l 1990), w hich has several different im plem entations (BLASTN, B L A ST ?, BLA STX , TB L A STN and TBLA STX ) depending on w hether a nucleotide or amino acid query sequence is used to search a nucleotide or amino acid database. BLA ST is a program w hich com pares all ‘w ords’ of a query sequence with all words in each sequence in the target database. A w ord is a sm all subsequence contained within the main sequence and has length w, typically 12 for nucleic acid sequences so that the first word in the sequence consists o f bases 1-12, the second word o f bases 2-13, etc. The match between query and target w ords is scored
using a standard scoring system (for nucleic acid alignm ents, a correct m atch scores +5 and a m ism atch scores -4) and the total score com pared w ith a user-defined threshold score, T. For protein sequences there is a range o f scoring m atrices available (each derived from groups of related proteins of varying evolutionary distance) where identical am ino acids score highly, conservative substitutions (chem ically sim ilar amino acids) score positively, and non-conservative changes score negatively. A m atch o f a rare am ino acid also scores m ore highly than that of a m ore com m on am ino acid, as it is less likely to happen by chance and m ore likely to be functionally significant. Low T
values give a very sensitive searching system but increase com puting tim e, and a default value of T has been em pirically determined which balances these considerations. To increase the speed o f the calculations, all possible words in a query sequence which w ould score above T are first determined, and the database searched for identities with w ords in the list. A ttem pts are then m ade to extend the w ord m atch into a longer alignm ent, or ‘m axim al segm ent pair’ (M SP) until the score cannot be raised by extending the alignm ent further. If the score of the extended m atch is m ore than a second threshold value, S, it is reported. A query sequence may find a m atch in a large database by chance, and a statistical m ethod has been developed to assess the probability (P or E value) of finding a match of a given score for that query sequence by chance. A low E value reflects that a match is unlikely to have arisen by chance, and that the m atch is biologically significant. The threshold values S and T m ay be defined (as well as other BLA ST param eters like w) if only very high scoring m atches are desired, or if m any m atches would be useful how ever weak. B iological databases do not follow the random models used to determ ine these statistics and biases in the databases (nucleic acid sequences containing dinucleotide repeats, proline-rich and overrepresented sequences like some EST sequences) may m ean that a m atch will get a high score w ithout necessarily being biologically significant (A ltschul et al. 1994).
A nother algorithm for database searching, FA STA (Pearson & Lipm an 1988), is m ore sensitive than BLA ST, but is m uch slower despite its name. It uses roughly the same m ethod to find initial alignm ents between a pair o f sequences, and then looks at all the high scoring local alignm ents between that pair of sequences (som e m ay be part o f an optim al alignm ent, and some may not) and determines w hich could be jo in ed together to m ake one alignment. It then calculates an optimal alignm ent betw een the two sequences, allow ing gaps, by joining the local alignm ents together. It is clearly m ore
biologically meaningful to have one long alignm ent (possibly w eakly m atching in places) than several unordered or overlapping high scoring segm ent pairs. F A S T A is a good m ethod for analysing shorter sequences in great detail to find even very weak sim ilarities, but for large pieces o f genom ic sequence it is too slow and the m ay show a drop in selectivity (i.e. lots of false positive matches which, over the length o f a cosm id, could easily overw helm real matches).
R ecently, several im proved versions o f BLA ST have been introduced; gapped BLA ST (BLA ST2), which allows alignm ents to include gaps (A ltschul et ah 1997), PSI-B LA ST (position-specific iterated BLA ST) for finding w eak protein m atches, w hich first finds m atches using gapped BLA ST and then builds a position-specific scoring m atrix based on a m ultiple alignm ent o f the query sequence with its m atches, and then searches the database with that matrix, repeating the process iteratively (A ltschul et al. 1997) and W U -B LA ST, a sensitive gapped B LA ST (unpublished, http://blast.w ustl.edu/). There are also several other database searching program s available, including B LA ZE (Brutlag
et al. 1993), M Psrch, and BLITZ, but none are as flexible (allow ing searches on many
databases), fast, or w ell-supported as BLA ST and FASTA.
H ow ever good a database searching program , a lot o f hum an interpretation o f the results is required to identify m eaningful matches. Some com puterised help is available, in the form o f post-BLA ST processing program s like M SPcrunch (Sonnham m er & D urbin 1994) (adjusts the results to simplify m atches to large protein fam ily m atches, high scoring m atches due to unusual nucleotide or amino acid com position, works out which M SPs are adjacent ot one another), Blixem , which can display M SPcrunch output in easily read form, showing where along the query sequence the m atches occur (Sonnham m er & D urbin 1994) and BEAUTY, w hich incorporates functional inform ation from database entries into its output and provides links to m any other databases (W orley et al. 1995). M SPcrunch output can also be im ported into A ceD B, database w hich can be used to store many different form s o f genom e project inform ation and w hich can also com pare BLA ST results w ith gene prediction results from the program GeneFinder.