La Agricultura Orgánica en Nicaragua
VIII. Conclusiones y recomendaciones generales
2.1. Crecimiento y mercado
As previously mentioned, the functional annotations by the Affymetrix array manufacturer and Ensembl stored in the LPD were originally tailored for
Retrieve UniGene identifiers using Elink Retrieve nucleotide sequences using Efetch
Identify entries with identical UniGene identifiers
Identify entries with identical sequences
Gene expression datasets with redundant array feature/gene
identifiers
Gene expression datasets with nonredundant array feature/gene
identifiers
Figure 3.4.1. Flowchart showing the combined methodology used for identifying equivalent biological entries across the different LPD expression datasets. NCBI web services, Elink and
Efetch, were used to retrieve UniGene identifiers and nucleotide sequences for array features using their GenBank identifiers. Equivalent biological entries across the different datasets were identified by means of identical UniGene identifiers and/or identical sequences. The two strategies complemented each other: UniGene mapping allows entries featuring partial sequences of the same gene to be identified while sequence matches are more appropriate when UniGene identifiers are unknown.
3. Adatabase of gene expression data from animal models of peripheral neuropathy
3.4. Data integration
Affymetrix arrays and hence needed no further integration with the Affymetrix based expression datasets in the LPD. However, one important aim of the current work was to derive functional annotations for the genes from the various expression datasets by exploiting the BioMap family oriented annotation framework. Using BioMap, additional functional information for uncharacterised genes may be gained from other functionally characterised homologs. This was particularly important as the average functional coverage for the arrays, achieved by either annotation source (Affymetrix/Ensembl), was rather limited. Furthermore, functional information derived from BioMap may be assessed by considering the extent of functional variation within individual protein families. Finally, exploiting BioMap provided an opportunity to annotate the LPD expression datasets originating from literature, which were not based on Affymetrix arrays and needed to be explicitly annotated.
Initially, the protein sequences from LPD array features/genes were obtained by querying the NCBI Efetch web service with the corresponding GenBank identifiers. To check whether these protein sequences existed in BioMap and hence already classified in the appropriate BioMap sequence clusters, their MD5 digests were matched against BioMap protein identifiers based similarly on MD5 digests of corresponding sequences. Where no match was found, the
3. Adatabase of gene expression data from animal models of peripheral neuropathy
3.4. Data integration
BioMap protocol for assigning new sequences to existing clusters was used. Finally, the updated Cluster Data table from BioMap containing mappings of all BioMap proteins (including LPD array protein sequences) to BioMap cluster numbers was mirrored in the LPD.
To assess the overall efficacy of the BioMap functional annotation of genes performed in this work, we compared the extent of functional coverage achieved with various Affymetrix arrays by BioMap, Ensembl and the Affymetrix array manufacturer. It is worth noting that with BioMap, functional information was inherited from related BioMap sequences at a sequence identity level greater or equal to 40%; that is functionally characterised homologs from S40 clusters.
The results are shown on Figure 3.4.2. Rather disappointingly, the BioMap based annotation seems to be only slightly better than that by the array manufacturer. Moreover, the Ensembl annotation appears to be more comprehensive for certain arrays, mainly the Rat230_2, RatU34B and the RatU34C. The explanation for this lies in the fact that these arrays feature a high percentage of EST sequences, meaning that the probesets in these arrays were mostly derived from short EST sequences instead of full-length genes (Fig 3.4.2). This is rather problematic with the BioMap annotation framework
3. Adatabase of gene expression data from animal models of peripheral neuropathy
3.4. Data integration
as EST sequences are usually of unknown gene origin and it is hence difficult to obtain protein sequences for them that may be searched against BioMap protein sequences. By contrast, the annotation strategy used by Ensembl is based on nucleotide instead of protein sequence comparison, whereby probe sequences (including those derived from ESTs) may be mapped to genomic cDNA sequences from the appropriate organism according to well-defined rules.
Array EST
content
47% 82% 11% 91% 91%
Figure 3.4.2. Percentage of functionally characterised probesets from various Affymetrix arrays by the different annotation approaches: BioMap, Ensembl and Affymetrix. Note that the
0 20 40 60 80
Moe430_2 Rat230_2 RatU34A RatU34B RatU34C
% o f a n n o ta te d p ro b e s e ts
3. Adatabase of gene expression data from animal models of peripheral neuropathy
3.4. Data integration
In Figure 3.4.3, the extent of functional annotation of Affymetrix arrays by BioMap at varying homology levels is shown. The analysis reveals that about 95% of functional assignments were derived from highly similar BioMap sequences with greater than 95% sequence identity, the majority of which featured exact matches. This implies that annotations inferred from homologous sequences at lower levels of sequence identity were not substantial; presumably, owing to the fact that the arrays subject to annotation in this work featured functionally well characterised genomes from the mouse and rat species. This seems to explain why the BioMap annotation pipeline did not perform better than the Ensembl and the array manufacturer annotations (Fig 3.4.2), as the former is based on exploiting homology to derive functional attributes for genes. However, despite the marginal gain in function assignment, the mappings between individual Affymetrix genes and the BioMap protein families achieved in this work can be used to inherit various other forms of useful information such as protein-protein interactions. Such data have been largely generated for yeast and are not directly available for the mouse and rat species except through family inheritance.
3. Adatabase of gene expression data from animal models of peripheral neuropathy 3.4. Data integration 0% 20% 40% 60% 80% 100%
Moe430_2 Rat230_2 RatU34A RatU34B RatU34C
% o f a n n o ta te d p ro b e s e ts 100% ID 95% ID 60% ID 35% ID
Figure 3.4.3. Number of annotated probesets at any given sequence similarity threshold expressed as a percentage from the total number of annotated probesets per array. Note that ID
3. Adatabase of gene expression data from animal models of peripheral neuropathy