The objective of this dissertation is to develop and integrate tools that enhance the entire spectrum of proteomics analysis by mass spectrometry- from detection of raw data to interpretation of biological story. These tools alleviate computational bottlenecks at each step of analysis by providing statistically-sound software in order to deliver biologically- relevant and meaningful output without distorting information, losing data, or adding artifacts. The intentionally modular design of this toolbox provides an environment for each tool to perform its individual function in response to specific bioinformatic queries, as well as sets the framework for all of the tools to interact in a seamless, holistic manner for a more comprehensive understanding of the biological questions under investigation. By looking through a computational biologists’ lens at a vast array of biological studies implementing mass spectrometry for proteomic analysis, a number of inter-related but functionally-distinct informatics processes present themselves. The development of a tool
for each process provides a mechanism of answering biological inquiries ranging from focused, hypothesis-driven questions, such as the increased ratio of a structural cellulase protein CipA in condition 1 compared to condition 2, to more global, discovery-based investigations, such as the identification of a core group of proteins expressed across a collection of plant tissues.
Answering these questions requires several points of engagement between informatics and analytical understanding of the underlying biochemistry of the system under observation. Deriving meaningful information from analytical data can be achieved through linking together the concerted efforts of more focused, logistical questions. This study focuses on the following aspects of proteomics experiments: spectra to peptide matching (Chapter 3), peptide to protein mapping (Chapter 4), and protein quantification and differential expression (Chapter 5). The interaction and usability of these analyses are also described (Chapter 6).
While it is important for informatic tools to be able to handle large datasets, it is becoming increasingly crucial for tools to also handle the biological complexity associated with more intricate experimental designs. Although some existing tools can scale computationally and maintain biological relevance, most of the time new tools need to be developed to appropriately address these concerns. The overwhelming volume and complexity of these experiments requires that the new and existing tools are not only optimized for speed and interpretation, but they also necessitate seamless communication with each other in an integrated workflow. By constructing a workflow that allows high- throughput processing of massive datasets, data collected within the past decade can be standardized and updated with the most recent analyses. Once these analyses are complete, meta-analyses can identify global analytical and biological trends.
Technological and informatic improvements are continuously accelerating the scope and complexity of biological investigations. As such, defining how a question is answered is becoming just as important as determining what both the question and answer should
look like. In fact, clearly identifying appropriate analytical and informatics methods is half of the work in solving these biological problems. Method optimization, versatility, and specialization become ends worthy of research in themselves. Although collections of measurements are motivated by biological enquiries and ultimately exist to reveal biological significance, the data points that act as intermediary, empirical evidence of interactions between genotypic and phenotypic information could arguably be considered more “real” and reproducible than their initial biological drivers and final interpreted conclusions. However, data cannot be useful until it is contextualized as information and interpreted as knowledge. In these processes, the truth or value of the data may be altered due to misinterpretations of newly annotated data, such as causal instead of correlative conclusions, or tendencies to over-fit, normalize, or filter results in an effort to arrive at pre-conceived outcomes. Therefore, in order to continue the iterative feedback loop of inspiring and answering biological questions, it is becoming ever more important to also ensure that the informatics validating, analyzing, and interpreting collected data preserve and reflect the integrity of the analytical measurements.
CHAPTER 3: Spectrum to Peptide Matching
Data presented in Section 3.1 has been adapted from the following journal article ready for submission to the Journal of Proteome Research:
Rachel M. Adams, Richard J. Giannone, Paul Abraham, Robert L. Hettich. “Protease- Optimized Spectral Indexing Enhances Protein Identification and Quantification in Shotgun Proteomics Datasets.” Sample preparation and experiments were performed by Richard J. Giannone. Data analysis was performed by Rachel M. Adams.
Data presented in Section 3.2 has been adapted from the following journal article:
Paul Abraham*, Rachel M. Adams*, Richard J. Giannone, Robert L. Hettich. “Defining the Boundaries and Characterizing the Landscape of Genome Expression in Vascular Tissues of Populus using Shotgun Proteomics.” * Authors contributed equally to this work. Sample preparation and mass spectrometry experiments were performed by Paul Abraham. The bioinformatic workflow for evaluating sequence redundancy was developed by Paul Abraham, Rachel Adams, Richard Giannone and implemented by Rachel Adams. The supplemental database for single nucleotide polymorphism detection was created by Rachel Adams. Quality of spectra was evaluated using software written by Brian Erickson. Biological data analysis was performed by Paul Abraham.
Data presented in Section 3.3 has been adapted from the following journal article:
Paul Abraham, Rachel Adams, Gerald Tuskan, Robert Hettich. “Moving Away from the Reference Genome: Evaluating Single Amino Acid Polymorphism Identifications from a Peptide Sequencing Tagging Approach for the Genus Populus”. Journal of Proteome
Research (In review). Sample preparation, mass spectrometry experiments, and
manuscript preparation were lead by Paul Abraham. In-house scripts for matching ion intensity information and evaluating the site-determining ions of modified amino acids were developed by Rachel Adams.