To clarify the function of software tools used in bioinformatics and the original need
that motivated their development, we can separate them into two categories: prediction
tools and analysis tools.
Prediction tools are used to find new biological information with the application of particular prediction models, based on specific biological knowledge. This is the case,
for example, with gene prediction tools, which provide information on gene location
based on genetic patterns for coding regions in DNA sequences.
On the other hand, analysis tools use existing information stored in biological databases
to infer knowledge about novel molecular sequences. This is the case with search en-
gine tools, which search sequences ofunknown structure and function against biological databases to find similarities with sequences whose structure and function are already known.
Prediction and analysis tools are largely used in the process of genome annotation, which consists of attaching relevant biological knowledge to unknown DNA sequences, by combining genomic information from different databases. In proteomics, analysis
tools are used as part of the process of protein identification. In the next sections we
describe prediction and analysis tools in more detail, and discuss the characteristics of the biological databases used by these tools.
3.3.1 Prediction Tools
Once scientists have determined the genome sequence of an organism, they need to find
where genes1are located, as well as the sequences within genes that code for proteins (the
open reading frames, ORF). In most organisms this process is not trivial, because the
DNA sequences are composed of both coding sequences (exons) and noncoding sequences
(introns). The introns are spliced out before the sequence is mapped into amino acids,
so the exons are the segments of DNA that actually end up coding for a protein2. To
find the regions of the gene that code for proteins, scientists look for a variety ofsignals
in the genetic code that indicate where the coding regions begin (indicated by a start
codon) and end (indicated by a stop codon), and where splices should occur. Many software tools have been developed to help in gene prediction including, for example,
GeneMark3 (Borodovsky and McIninch, 1993), Glimmer4 (Salzberg et al, 1998), and
GenScan5 (Burge and Karlin, 1997).
3.3.2 Analysis Tools
When scientists isolate a new molecular sequence through laboratory experiments, they want to know all relevant biological information about that sequence. The first thing to do is to determine if a similar sequence has been already discovered and annotated
(Phizickyet al, 2003). This is achieved by analysis tools.
Analysis tools operate as search engines by comparing some target sequence against one or more biological databases to find similarities in homology, structure or function. They operate under the premise that if two DNA sequences have a similar combination of nucleotides, they probably have similar function and structure, even if they come from different organisms or different cells (and the same premise is applied to protein sequences but here for similar combinations of peptide sequences). Probably the most used computational tool to perform this task is BLAST (Basic Local Alignment Search
Tool)6 (Altschulet al, 1990), which searches databases like GenBank and Swiss-Prot for
all sequences similar to the target sequence. In the case of search engines for protein
identification, known as ms/ms search engines, the input data is a set of peptide se-
quences (thems/msspectrum)7, and the result is a list of candidate proteins matching
the peptides, each associated with a score and the number of peptides that matched the protein (the higher the number of matching peptides, the higher the confidence that the protein is present in the unknown protein mixture).
Different search engine services have different properties relating to, for example, the databases that are being searched, the search algorithm used, the scoring system that calculates a measure of how well the sequence matches the database, and the returned information, which can be in many different formats. These different properties can lead to differences in the quality of results since, for example, a specific scoring system can
2
According to The Wellcome Trust (2001), less than 2% of the human genome contains actually protein coding regions.
3Available at the EBI website http://www.ebi.ac.uk/genemark/
4
Available at the NCBI website http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer 3.cgi
5Available at http://genes.mit.edu/GENSCAN.html
6
Available at the NCBI website http://130.14.29.110/BLAST/
7
ms/ms spectrum is a specific type of input received by search engines, from mass spectrometry
return a more precise evaluation of the candidate proteins, and one search algorithm can be faster than another.
Search engines also have several configuration parameters, which are fairly similar for those engines with similar functionality and type of input data, and define the search
space for the query sequence. In particular, for ms/ms search engines, these search
spaces include: taxonomy(organism classification),peptide tolerance (the error window
for experimental peptide mass values), and number of missed cleavages (peptides are
fragmented with an enzyme which breaks peptide bonds in specific sites, and this mea- sure indicates the number of allowed missed breaks during digestion). The significance of configurations is that search results can be influenced by different configuration pa- rameters. For example, if the peptide tolerance is set to a high value, this can result in a higher number of false matches, since the comparison window for the peptide mass is bigger. Conversely, if the peptide tolerance is set to a low value, it can result in the loss of true matches. In practice, bioinformatics experts adjust these parameters according to their individual preferences, or to the quality of the data produced by the mass spec- trometer, so that if peptide masses are accurate and not approximations, parameters are set to narrow the search space, while if they are approximations, parameters are set to increase the search space.
Alternative ms/ms search engines are publicly available, and differ from each other
generally in the implementation of the matching algorithm. Examples of such search
engines are Mascot (Perkinset al, 1999), Tandem (Craig and Beavis, 2003), and OMSSA
(Geeret al, 2004), some of which run on remote servers, while others run as local services.
Although these are alternative ms/ms search engines with the same functionality, they
can yield heterogeneous results for the same input data. As a consequence, some services may be more suitable for data with a certain quality or for a particular configuration setting than others. This means that, even if one search engine performs better when using a particular configuration setting, it may vary its performance when used with a different configuration setting.
3.3.3 Biological Databases
Bioinformatics databases store information related to the genes and proteins of organ- isms, and are tailored to particular types of information or organisms. For example, they can store DNA sequences, protein sequences or entire genomes, or they can store biological information that is related to particular organisms such as humans, mice, fruit flies, viruses and so on. Most data that is stored in bioinformatics databases is annotated, with relevant biological information attached to it, for example indicating to which organism the sequence belongs, or which gene or protein that sequence is related to, and so on.
When existing bioinformatics databases do not contain any matching proteins to an
unknown protein sequence, a new database (known as a six-frame database) can be
created by translating DNA sequences from the organism associated with the protein mixture directly into protein sequences. Although the protein sequences in a six-frame database are not known, a match indicates that the protein exists, but has not yet been annotated, or at least not in publicly available databases.