• No se han encontrado resultados

Autodescubrimiento

As discussed in section 1.2.6.1, proteomic data can be used in various aspects of the genome annotation process such as validating predicted gene models and detecting novel genes as well as validating alternative splicing variants [213, 215, 218, 220, 221, 344]. In this study, three examples have been shown to demonstrate the

125 potential usage of proteomic data in indicating missing exons, alternative frame shifts, as well as alternative exon positioning.

The proteomic data acquisition in this study can be used to assist the development of new genome annotation pipelines. Firstly, protein expression data can be used as a valuable training set to improve the prediction of integrative gene prediction programs such as GLEAN [282] and TwinScan [280]. This application is particularly important for microbial genome annotation such as T. gondii, where few homologies have been characterized in comparison to the human genome. By analysing the composition and statistical properties of the expressed peptides, programs can be tuned to predict novel genes which homologies have not been previously identified in other organisms.

Secondly, by searching the raw MS data against the latest update of genome annotation in the pipeline, peptide expression data can be directly used to validate the accuracy of the predictions. This information can then be fed back to the automated pipeline and generate an improved version of genome annotation, which can be validated by the raw MS data again. By performing this cycle several times, the accuracy of genome annotation can be rapidly improved. This automated pipeline will also significantly speed up the current proteomic research workflow, where successive upgrades of genome annotation require the raw data to be re-submitted in a slow manual fashion at the moment, as highlighted in the previous section. The pipeline will also resolve the peptide mapping problem on ToxoDB, for example, the raw MS data used to identify those 220 TgEST sequences and 184 ORF sequences that were not able to be mapped on ToxoDB would be preserved and searched against the new annotations. Likewise, the peptide identifications from those 163

126 alternative gene models that are no longer available for querying on ToxoDB could also be directly entered into the automatic pipeline.

Of course, in order to efficiently initiate the cycle, large scale sampling of peptide identifications from the genome is required. This will enable the maximum number of peptide features to be picked up by gene prediction programs and preserved in the subsequent annotation cycles. Currently, the best approach to achieve the biggest coverage of peptide identifications in a genome is to search the MS data against all the ORFs with a length greater than 50 amino acids from the whole genome sequence database.

Theoretically, the collection of all the ORFs covers the entire potential protein coding sequence (CDS) in the genome. However, the current program setup for ORF marking in the genome only processes the same region of sequence once and identifies a specific ORF as starting from the first start codon it encounters until it comes across a stop codon [112, 345]. This approach is particularly efficient in marking ORFs in organisms with no introns, such as prokaryotes. However, it has several limitations in covering the entire potential coding sequence in a typical eukaryotic gene which contains multiple exons. Firstly, the algorithm marks every ORF from the first start codon to the first stop codon, no matter how many other start codons are within the ORF. This prevents the identification of the second exon starting within the length of the original ORF. Secondly, in a gene that contains multiple exons, a frame shift between different exons means the coding sequence cannot be hosted within a single ORF. This particularly prevents the identification of intron-spanning peptides in the proteomic research which also contains critical information for genome annotation.

127 In fact, the 163 alternative gene models which could not be mapped onto release 4 gene models and ORFs partly reflected the second limitation of the current ORF marking algorithm, where no single ORF could host 100% of the peptides identified to an alternative gene model. A new algorithm approach to design ORF databases for MS data searching is under development in the Wastling group in collaboration with Dr. Andy Jones, University of Liverpool.

The new ORF database cannot be a simple collection of all the direct translations of genomic sequences from all the start codons which exist, as this would result in a giant sequence database which would require tremendous computing power for MS data searching and which would increase the false discovery rate. One possibility is to harness the latest development in the transcriptome, RNA-Seq. By using the high- throughput sequencing approach offered by RNA-Seq, a genome-scale transcription map can be rapidly achieved [229]. The information can then be used as a reference map for the selection of gene coding ORFs and subsequently reduce the size of the ORF database for MS data searching. An on-going collaborative project between the Wastling group and Dr. Arnab Pain, at the Wellcome Trust Sanger Institute, Cambridge is developing a method for the integration of proteomic and transcriptome data in the genome annotation process for T. gondii and Neospora caninum.

4.4.3 Conclusion

In this chapter, the proteomic data acquired in this study have been placed in a broader platform. The raw MS data have been stored in the publically accessible Tranche network. The expression data have been integrated on a peptide level with other genomic resources on ToxoDB. Inspired by the issues raised during peptide mapping and the examination of the accuracy of the release 4 genome annotations

128 using these peptide identifications, the incompleteness of the release 4 genome annotation was highlighted and the potential application of a new genome annotation pipeline was discussed.

In a conventional bottom-up protein identification based proteomic project, peptide identification relies on the predicted gene models. Successive upgrades of genome annotation mean that the proteomic researcher may have to re-submit the data which is a time consuming process. By using the new genome annotation pipeline discussed in this chapter, the manual re-submission can be carried out automatically and proteomic expression data can be directly used to improve genome annotation. Together with the information contained within the transcriptome, a near “perfect” genome annotation can be expected in the near future.

In addition to the application of proteomic data in the field of proteogenomics, the integration of proteomic data into ToxoDB also allows an important comparison to be made, that of the transcriptomic data. This comparison will reveal implications of important biological processes such as protein degradation and post-transcriptional regulation. This interesting subject is investigated and discussed in chapter 5.