1.Fisiopatología del corazón
3. Vesículas extracelulares (EVs)
3.4 Caracterización de las vesículas extracelulares
Databases need to be regularly updated to reflect new information, not just including new entries but also updating and correcting existing ones. In fields that are actively evolving, major and minor releases may be required, with a few major updates each year but minor updates on a weekly or even daily basis. Individual database entries will change less often, but it is important that it is easy
66
to recognize if the current entry differs from a copy made earlier. This can be achieved by using version numbers for entries, or alternatively by reporting the most recent date on which changes were made, as can be seen in the "Entry infor- mation" section of the database shown in Figure 3.8. However, this is only of use if there is also a unique identifier for the entry that is fixed and can be used to ensure the two versions are indeed of the same entry. In Figure 3.8 this is called the "Primary accession number."
There are occasions when a decision is taken to make an existing database entry obsolete. In most cases the data from this entry will still be in the database, but now in a different entry with a different unique identifier. It is important that the current database maintains some record of the obsolete entry identifier, the reasons for the decision, and the fate of the data. This information and version records will some- times be of great value in understanding the reasons why a repeated study produces different results.
The eukaryotic genome sequencing projects are an extreme example of the impor- tance of the issues just discussed. The experimental methods used involve breaking the genome into many small overlapping pieces, which can be individually sequenced, and then using the overlaps to assemble these into the complete chromo- somes. Many of these projects were, and are, funded subject to intermediate data being made publicly available very soon after it is obtained. As a consequence, a data- base of the sequence data has to be constructed before the complete chromosomes have been assembled. As the project progresses, the assembly progresses, resulting in ever-longer sequences formed by merging the smaller sequences. This results in some database entries becoming obsolete. Until the assembly has been completed, features such as genes cannot be identified by reliable sequence base numbers, as the true start of the chromosome will not be known. Every time sequences are merged into a larger assembly there is a possibility of the sequence numbers changing. Great care is needed to use suitable methods to identify features in the database such that they can be traced over the development of the assembly to its final state.
Summary
In this chapter we have introduced the reader to the concept of a database and looked at the wide variety of databases publicly available and easily accessible via the Internet. There are many Web sites that serve locally created, highly specialized databases, as well as large resource centers that integrate many key databases into a unified network that facilitates identifying connections in the data, which is one of the main aspects of bioinformatics analysis.
We have highlighted the importance of accuracy, both in data and in the annota- tions. Equally important is that the database is kept up to date, as analysis based on outdated or incorrect data will also be outdated and, quite possibly, incorrect. Databases are often the starting point of many types of bioinformatics research that will be described in the following chapters. They are a powerful tool for storing, sharing, and describing data, as well as for extracting information for further understanding and analysis. They can be regarded both as data repositories and on- line libraries.
Finally, to appreciate the range of data available in the public databases and the numerous ways in which it can be presented, the reader is recommended to go to the NAR database list and click on the links, exploring all the possibilities, and let their curiosity lead them on.
Further Reading
3.1 The Structure of Databases
Bressan S & Catania B (2006) Introduction to Database Systems. New York: McGraw Hill Higher Education. Date CJ (1995) An Introduction to Database Systems, 6th ed. Boston: Addison-Wesley.
Kim W (1990) Introduction to Object-Oriented Databases. Cambridge MA: MIT Press.
Riccardi G (2001) Principles of Database Systems with Internet and Java Applications. Boston: Addison- Wesley.
Stein LD (2003) Integrating biological databases. Nat.
Rev. Genet. 4, 337-345.
3.2 Types of Database
How we define and connect things is important: Ontologies
Ashburner M, Ball CA, Blake JA et al. (2000) The Gene Ontology Consortium. Gene ontology: tool for the unifi- cation of biology. Nat. Genet. 25, 25-29.
Bard J (2003) Ontologies: Formalising biological knowl- edge for bioinformatics. Bioessays 25, 501-506.
Bard JB & Rhee SY (2004) Ontologies in biology: design, applications and future challenges. Nat. Rev. Genet. 5, 213-222.
Gruber TR (1993) A translation approach to portable ontology specification. Knowledge Acquisition 5, 199-220.
Thompson JD, Holbrook SR, Katoh K et al. (2005) MAO: a multiple alignment ontology for nucleic acid and protein sequences. Nucleic Acids Res. 33, 4164-4171. Zhang B, Schmoyer D, Kirov S & Snoddy J (2004) GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies.
BMC Bioinformatics 5, 5:16.
3.3 Looking for Databases
Sequence databases
Apweiler R, Bairoch A & Wu CH (2004) Protein sequence databases. Curr. Opin. Chem. Biol. 8, 76-80.
Bairoch A, Apweiler R, Wu CH et al. (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154-D159.
Benson DA, Karsch-Mizrachi I, Lipman DJ et al. (2005) GenBank. Nucleic Acids Res. 33, D34-D38.
Cochrane G, Aldebert P, Althorpe N et al. (2006) EMBL Nucleotide Sequence Database: developments in 2005.
Nucleic Acids Res. 34, D10-D15.
Wheeler DL, Barrett T, Benson DA et al. (2006) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 34, D173-D180.
Microarray databases
Brazma A, Sarkans U, Robinson A et al. (2002) Microarray data representation, annotation and storage. Adv. Biochem. Eng. Biotechnol. 77, 113-139. Gollub J, Ball CA, Binkley G et al. (2003) The Stanford Microarray Database: data access and quality assess- ment tools. Nucleic Acids Res. 31, 94-96.
Gollub J, Ball CA & Sherlock G (2006) The Stanford Microarray Database: a user's guide. Methods Mol. Biol. 338, 191-208.
Parkinson H, Sarkans U, Shojatalab M et al. (2005) ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 33, D553-D555.
Protein interaction databases
Bader GD, Betel D & Hogue CW (2003) BIND: the Biomolecular Interaction Network Database. Nucleic
Acids Res. 31, 248-250.
Ng A, Bursteinas B, Gao Q et al. (2006) pSTIING: a 'systems' approach towards integrating signalling path- ways, interaction and transcriptional regulatory networks in inflammation and cancer. Nucleic Acids Res. 34, D527-D534.
Salwinski L, Miller CS, Smith AJ et al. (2004) The Database of Interacting Proteins: 2004 update. Nucleic
Acids Res. 32, D449-D451.
Zanzoni A, Montecchi-Palazzi L, Quondam M et al. (2002) MINT: a Molecular INTeraction database. FEBS
Lett. 513, 135-140.
Structural databases
Berman HM, Westbrook J, Feng Z et al. (2000) The Protein Data Bank. Nucleic Acids Res. 28, 235-242. Berman HM, Henrick K & Nakamura H (2003) Announcing the worldwide Protein Data Bank. Nat.
Struct. Biol. 10, 980.
Lo Conte L, Brenner SE, Hubbard TJP et al. (2002) SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res. 30, 264-267.
Murzin AG, Brenner SE, Hubbard T & Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. /. Mol.
Biol. 247, 536-540.
Orengo CA & Thornton JM (2005) Protein families and their evolution—A structural perspective. Annu. Rev.
Biochem. 74, 867-900.
Pearl FM, Bennett CF, Bray JE et al. (2003) The CATH database: an extended protein family resource for struc- tural and functional genomics. Nucleic Acids Res. 31, 452-455.
Velankar S, McNeil P, Mittard-Runte V et al. (2005) E- MSD: an integrated data resource for bioinformatics. Nucleic Acids Res. 33, D262-D265.
3.4 Data Quality
Ashelford KE, Chuzhanowa NA, Fry JC et al. (2005) At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substan- tial anomalies. Appl. Environ. Microbiol. 71, 7724-7736. Gilks WR, Audit B, de Angelis D et al. (2005) Percolation of annotation errors through hierarchically structured
protein sequence databases. Math. Biosci. 193, 223-234. Weichenberger CX & Sippl MJ (2006) NQ-Flipper: vali- dation and correction of asparagine/glutamine amide rotamers in protein crystal structures. Bioinformatics 22,1397-1398.
MIAME
Brazma A, Hingamp P, Quackenbush J et al. (2001) Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nat. Genet. 29, 365-371.