Valores de Microdureza obtenidos para las probetas seleccionadas de

To clarify the function of software tools used in bioinformatics and the original need

that motivated their development, we can separate them into two categories: prediction

tools and analysis tools.

Prediction tools are used to find new biological information with the application of particular prediction models, based on specific biological knowledge. This is the case,

for example, with gene prediction tools, which provide information on gene location

based on genetic patterns for coding regions in DNA sequences.

On the other hand, analysis tools use existing information stored in biological databases

to infer knowledge about novel molecular sequences. This is the case with search en-

gine tools, which search sequences ofunknown structure and function against biological databases to find similarities with sequences whose structure and function are already known.

Prediction and analysis tools are largely used in the process of genome annotation, which consists of attaching relevant biological knowledge to unknown DNA sequences, by combining genomic information from different databases. In proteomics, analysis

tools are used as part of the process of protein identification. In the next sections we

describe prediction and analysis tools in more detail, and discuss the characteristics of the biological databases used by these tools.

3.3.1 Prediction Tools

Once scientists have determined the genome sequence of an organism, they need to find

where genes1are located, as well as the sequences within genes that code for proteins (the

open reading frames, ORF). In most organisms this process is not trivial, because the

DNA sequences are composed of both coding sequences (exons) and noncoding sequences

(introns). The introns are spliced out before the sequence is mapped into amino acids,

so the exons are the segments of DNA that actually end up coding for a protein2. To

find the regions of the gene that code for proteins, scientists look for a variety ofsignals

in the genetic code that indicate where the coding regions begin (indicated by a start

codon) and end (indicated by a stop codon), and where splices should occur. Many software tools have been developed to help in gene prediction including, for example,

GeneMark3 (Borodovsky and McIninch, 1993), Glimmer4 (Salzberg et al, 1998), and

GenScan5 (Burge and Karlin, 1997).

3.3.2 Analysis Tools

When scientists isolate a new molecular sequence through laboratory experiments, they want to know all relevant biological information about that sequence. The first thing to do is to determine if a similar sequence has been already discovered and annotated

(Phizickyet al, 2003). This is achieved by analysis tools.

Analysis tools operate as search engines by comparing some target sequence against one or more biological databases to find similarities in homology, structure or function. They operate under the premise that if two DNA sequences have a similar combination of nucleotides, they probably have similar function and structure, even if they come from different organisms or different cells (and the same premise is applied to protein sequences but here for similar combinations of peptide sequences). Probably the most used computational tool to perform this task is BLAST (Basic Local Alignment Search

Tool)6 (Altschulet al, 1990), which searches databases like GenBank and Swiss-Prot for

all sequences similar to the target sequence. In the case of search engines for protein

identification, known as ms/ms search engines, the input data is a set of peptide se-

quences (thems/msspectrum)7, and the result is a list of candidate proteins matching

the peptides, each associated with a score and the number of peptides that matched the protein (the higher the number of matching peptides, the higher the confidence that the protein is present in the unknown protein mixture).

Different search engine services have different properties relating to, for example, the databases that are being searched, the search algorithm used, the scoring system that calculates a measure of how well the sequence matches the database, and the returned information, which can be in many different formats. These different properties can lead to differences in the quality of results since, for example, a specific scoring system can

According to The Wellcome Trust (2001), less than 2% of the human genome contains actually protein coding regions.

3_{Available at the EBI website http://www.ebi.ac.uk/genemark/}

Available at the NCBI website http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer 3.cgi

5_{Available at http://genes.mit.edu/GENSCAN.html}

Available at the NCBI website http://130.14.29.110/BLAST/

ms/ms spectrum is a specific type of input received by search engines, from mass spectrometry

return a more precise evaluation of the candidate proteins, and one search algorithm can be faster than another.

Search engines also have several configuration parameters, which are fairly similar for those engines with similar functionality and type of input data, and define the search

space for the query sequence. In particular, for ms/ms search engines, these search

spaces include: taxonomy(organism classification),peptide tolerance (the error window

for experimental peptide mass values), and number of missed cleavages (peptides are

fragmented with an enzyme which breaks peptide bonds in specific sites, and this measure indicates the number of allowed missed breaks during digestion). The significance of configurations is that search results can be influenced by different configuration parameters. For example, if the peptide tolerance is set to a high value, this can result in a higher number of false matches, since the comparison window for the peptide mass is bigger. Conversely, if the peptide tolerance is set to a low value, it can result in the loss of true matches. In practice, bioinformatics experts adjust these parameters according to their individual preferences, or to the quality of the data produced by the mass spec- trometer, so that if peptide masses are accurate and not approximations, parameters are set to narrow the search space, while if they are approximations, parameters are set to increase the search space.

Alternative ms/ms search engines are publicly available, and differ from each other

generally in the implementation of the matching algorithm. Examples of such search

engines are Mascot (Perkinset al, 1999), Tandem (Craig and Beavis, 2003), and OMSSA

(Geeret al, 2004), some of which run on remote servers, while others run as local services.

Although these are alternative ms/ms search engines with the same functionality, they

can yield heterogeneous results for the same input data. As a consequence, some services may be more suitable for data with a certain quality or for a particular configuration setting than others. This means that, even if one search engine performs better when using a particular configuration setting, it may vary its performance when used with a different configuration setting.

3.3.3 Biological Databases

Bioinformatics databases store information related to the genes and proteins of organisms, and are tailored to particular types of information or organisms. For example, they can store DNA sequences, protein sequences or entire genomes, or they can store biological information that is related to particular organisms such as humans, mice, fruit flies, viruses and so on. Most data that is stored in bioinformatics databases is annotated, with relevant biological information attached to it, for example indicating to which organism the sequence belongs, or which gene or protein that sequence is related to, and so on.

When existing bioinformatics databases do not contain any matching proteins to an

unknown protein sequence, a new database (known as a six-frame database) can be

created by translating DNA sequences from the organism associated with the protein mixture directly into protein sequences. Although the protein sequences in a six-frame database are not known, a match indicates that the protein exists, but has not yet been annotated, or at least not in publicly available databases.

In document Influencia en el cambio de las propiedades de impacto y dureza en un acero con contenido de Mn superior al 1%, sometido a temperaturas intercriticas (página 59-78)