Shotgun proteomic experiments provide qualitative and quantitative analytical information from biological samples ranging in complexity from simple bacterial isolates to higher eukaryotes such as plants and humans and even to communities of microbial organisms. Improvements to instrument performance, sample preparation, and informatic tools are increasing the scope and volume of data that can be analyzed by mass spectrometry (MS). To accommodate for these advances, it is becoming increasingly essential to choose and/or create tools that can not only scale well but also those that make more informed decisions using additional features within the data. Incorporating novel and existing tools into a scalable, modular workflow not only provides more accurate, contextualized perspectives of processed data, but it also generates detailed, standardized outputs that can be used for future studies dedicated to mining general analytical or biological trends.
Tightly coupled to the advancements in sample preparation and instrument technology, there is an increasing demand for software improvements to make sense of and report the collected data in a meaningful way. Each new parameter that can be adjusted in the experimental protocol or tuning of the instrument adds to the opportunity for an optimized combination of settings that provide the best-case scenario for deep, accurate measurements. Therefore, data is collected on instrument statistics (voltage, % salt, DE window, etc) as well as spectral-level statistics (elution time, precursor ion selection, spectral counts, measured MS1 peak intensity, calculated peak area, etc). Once the data has been collected, interpreting the data requires algorithms to perform peptide to spectra matching (PSM scores/likelihoods) and protein to peptide matching (FDRs). The existing algorithms that provide scores and suggest assignments have a mixture of competing and
complementary benefits and disadvantages, so it is conceivable one may want to compare the results of multiple software algorithms in order to come up with the most comprehensive understanding of the components collected in the biological sample. Several software programs use index-based information retrieval so that one isn’t always moving every piece of data with each analysis. For example, protein assembly software generally does not maintain information about individual ion series distributions for each PSM; the data is usually linked in some way so that the user can explore to his or her desired level of detail, or the data’s represented by an aggregate measure. However, with a centralized repository of raw input data as well as processed results, it is a much more straightforward task to provide means of easily extracting cross-referenced information, transforming or filtering it, and sharing it with others. Some of the most beneficial steps taken by other research groups along the way include standardizing their input and output formats for informatics software. Making data results portable not only increases the speed and efficiency at which a new tool can be evaluated and adopted into an existing workflow, but it helps standardize vocabularies, establish quality control, and move the community closer to diagnostic and deterministic assessments of datasets’ behaviors. Therefore, in our implementation of a bioinformatic workflow, TORPEDO (Tools and Omnibus of Resources for Proteomic Experimental Datasets Online), we have endeavored to receive and generate the common standardized outputs.
With so many tasks to accomplish, and for occupation by multiple users at a time, such a workflow requires support by adequate hardware infrastructure. Successfully integrating this workflow within the existing computing architectures was not possible. The distinct computational resources currently available require multiple data transfers between users, adaptation of analysis scripts to accommodate different operating systems, numerous transformations of the data into various input and output file formats, and non-linear documentation of analyses performed on the data. Therefore, we proposed developing cyber-infrastructure that would allow a user to seamlessly run multiple analyses, store the results, and share processed data with other users.
Specifically, we built a web-based front-end to facilitate data exploration. Users have to sign in with an account to run analyses, but they can choose to make their results available to the public or private. In addition, users can opt to upload data for one-time analyses, or users can create a persistent project with longer-term data storage and invite other users to view their results. In short, this project offers an easy-to-use interface for running multiple proteomics analyses tools backed by sizable computing resources and a platform for sharing data with other researchers for enriched collaborations.
The alternative hardware solutions we explored involved complicated communication between two existing resources: a host computer and a compute cluster. For this setup to work, the host computer handled user interactions and constantly pinged the remote compute cluster for notifications of completed jobs and retrieving the results. Checks had to be made both at the user- and processing-end to ensure all parameters were in place, even though the software only resided on the compute cluster. Security requirements also provided a significant hurdle to protect the computing cluster from attacks originating through our website.
Simplifying this architecture into one machine minimized redundant validation steps, mitigated communication errors, allowed real-time job status updates, and simplified the overall design concept to create truly modular program development. By building a computer system that can handle both responsibilities of hosting and computing processes, computing tasks can later be distributed onto other machines as infrastructure expands. Thus, in the future the single computer will not become obsolete, but instead can easily be repurposed as a login node facilitating load balancing and job execution on compute nodes.
This proposed solution is not just for large labs on the cutting-edge of large-scale data- centric experiments; it also provides a gateway for small labs doing one-off experiments that do not necessitate a dedicated informatics solution. This solution fits well with the NSF-recognized need for national cyber-infrastructure for research and provides a
starting framework for future projects to expand capabilities. Specifically, we will leverage this project to apply for the Annual Research Cluster Grant from Silicon Mechanics. By providing centralized pre-built options for analysis, we engage an audience that otherwise may not participate in data-centric biological experiments and provide a functional education for best-practices in experimental design and data analyses.
To ensure the usability, performance, and integrity of the data in these analyses, it is critical to have efficient ways to store, access, and interpret information. These needs translate into tangible computational specifications. For example, current state-of-the-art mass spectrometry instruments are generating twice as many spectra as their predecessors, which means algorithms that are optimized for multi-threading and MPI communication are becoming increasingly essential to efficiently deconvoluting spectra into protein identifications. In addition, filtering true protein identifications is far more effective when a user can dynamically score matches, but re-evaluating the large amount of multidimensional data is a highly memory-intensive, user-interactive process. Once the data is properly filtered, it is common to normalize datasets against technical and biological replicates and compare the results between biological conditions or experimental methods. Since each experiment can easily scale to 10-20 GB, having adequate data storage is especially important to obtaining proper perspectives on the analytical quality and biological significance of these proteomics experiments.
While it is important for informatic tools to be able to handle large datasets, it is becoming increasingly crucial for tools to also handle the biological complexity associated with more intricate experimental designs. The overwhelming volume and complexity of these experiments requires that the new and existing tools are not only optimized for speed and interpretation, but they also necessitate seamless communication with each other in an integrated workflow. By constructing a workflow that allows high- throughput processing of massive datasets, data collected within the past decade can be
standardized and updated with the most recent analyses. Once these analyses are complete, meta-analyses can identify global analytical and biological trends.