“Al que toca o recita le resulta indispensable la presencia de otros: la lábil performance existe sólo si es vista o escuchada,
4.3. La última frontera emocional: la tecnociencia
As already illustrated in Chapter 4.1.2, components used in a scientific workflow may implement the same algorithm or function but in a different fashion. Scientists want to evaluate which implementation performs best for a specific problem. This section shows results of the optimization of workflow components by reusing the implemented parameter optimization plugin. Therefore, each occurring parameter of each application is encoded as a single parameter in the Genetic Algorithm. Additionally, a flag is encoded as parameter in the GA to enable only one application for a specific set of parameter. In the following, the
proteomics workflow, see Section 5.3.1, was further extended and optimized by component optimization.
Proteomics Workflow:
The use case proposed in Chapter 3.3 and optimized in Chapter 5.3.1 was extended to adapt component optimization. The original workflow finished with a list of peptides that could be tentatively matched for identification. This list still includes false positives and to identify or remove these, additional information can be incorporated. One way to do this is to train a so-called retention time predictor. The retention time can be extracted from the tandem mass spectrometry process and characterizes the time, a particular peptide needs to pass the chromagraph under certain conditions – between the injection and detection. This characteristic can be used to remove peptides that do not fit the predicted chromatographic behavior from the list of peptide matches [Palmblad2002].
The results of this investigation have been prepared and submitted in [Holl2013e]. Motivation: "There are a number of algorithms to predict peptide retention times, for instance one developed by Palmblad et al. [Palmblad2002] or two of which are already included in the RTCalc utility in TPP [Keller2005]: one based on SSRCalc [Krokhin2006] and one based on the artificial neural network (ANN) method by Petritis and co-workers [Petritis2006]. To demonstrate how a scientific workflow can select an optimal path for proteomics data analysis, a workflow to balance the quality (FDR or minimum Peptide- Prophet probability) of the training set and the prediction model was designed. It should find the model that can best predict the retention times of peptides within the same dataset and a specific quality measure. The rationale is that the simpler retention time predictors have fewer free parameters and will be more robustly fit by smaller training sets than the potentially better but more complex models, requiring much more training data." (Modified from [Holl2013e].)
Workflow: The workflow is shown as a stand-alone workflow in Figure 5.12. During one execution, only one path of the workflow is executed, regarding the flag. In doing so, no useless runs are created. The probabilities parameter has as input a list of peptide probabilities for each identified peptide from PeptideProphet.
One sub-workflow executes the retention time predictor developed by Palmblad et al. [Palmblad2002] (named rt). The other two sub-workflows apply algorithms out of the TPP toolbox, namely an ANN-based algorithm (named ANN) and the standard RTCalc algorithm (named RTCalc). "As RTCalc has its own hard-coded internal quality checks that generates an error message it aborts rather than produce a poor or overfitted model.
Figure 5.12: The workflow to perform retention time prediction with i) RTCalc (right hand sub-workflow), ii) rt (middle sub-workflow), or iii) ANN (left hand sub-workflow). The workflow is available at myExperiment: http://www.myexperiment.org/workflows/3691. html.
These checks were disabled in the RTCalc source code to level the playing field and allow the optimization framework to independently find the right combination of parameters and algorithm.
Fitness: The quality of the retention time prediction was evaluated as a root mean square deviation (RMSD) for 10% of the peptides held back as a validation set, using the remaining 90% of the peptides to train the model. These 10% were then chosen at random 10 times so that each peptide was used exactly once for validation. " (Modified from [Holl2013e].) The RMSD was minimized during the optimization.
Data input: The optimized output file of PeptideProphet produced in Chapter 5.3.1 was used as input for this workflow.
Optimization:
Optimized parameters: min_probability, flag Fixed parameters: peptides, probabilities, numbers
User constraints for parameters: minimumprobability ∈ [0.5, 1] (double), f lag ∈ 1, 2, 3 (integer)
Used data sets: hybrid E. coli
5300 5400 5500 5600 5700 5800 5900 6000 6100 6200 6300 6400 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 RMSD Probability ANN RT RTCalc
Figure 5.13: RMSD values of the optimization of i) RTCalc, ii) rt, and iii) ANN.
Scientific Use:The example use case shows that component optimization is useful when comparing the performance of different applications. The data formats remained the same and simplified the design of the example switch workflow.
"As expected, the number of peptides in the training sets used here were not sufficient to produce an accurate model using the artificial neural network algorithm. For more than 70-80 peptides in the training set, the RTCalc coefficient model seemed to perform best. For 52-70 peptides, the rt performed better, and for 21-52 peptides in the training set, only rt produced a model at all. No model was returned when having 21 peptides or fewer in the training set. In absence of quality checks, the minimum number of peptides required to produce a model is solely determined by the number of free parameters (terms) in the model. Figure 5.13 shows the retention time prediction accuracy as a function of model and minimum probability/training set size. " [Holl2013e]
5.4.2
Discussion
The last section illustrated a real-world use case for the optimization of workflow com- ponents. The result showed that the different applications achieve different performance
measures regarding the probability value. This finding lead to new insights in data and algorithms themselves [Holl2013e].
Due to the lack of an ontology describing life science applications and their functions as well as input and output types, it is recently not possible to provide a fully automated optimization of workflow components. Thus, the developed parameter optimization plugin was used to find the optimal component. Therefore, a workflow was manually created to switch between the three different sub-workflows. Each sub-workflow was encoded by one flag, which implements switching the sub-workflow on or off (only one path is executed). This simulated set-up of component optimization would be challenging in practice due to the difficulty of building a switchable workflow. However, within another plugin targeting component optimization, this behavior could be similarly used and a switching workflow automatically created by the plugin.
During the optimization of components, parameter optimization has to be taken into account, likewise. Thus, component optimization does not notably differ from parameter optimization, besides the fact that the number of input parameter is much larger. One approach would be to constitute one gene on a chromosome as a flag to point out the evolved application. During the execution of one component, only the encoding block would be regarded for the specific run. Albeit this, the chromosome size would enlarge which would result in a larger required population size which would in turn enlarge the execution time. A serious pre-validator would be needed to predict reasonable exploration areas in the search space and neglect misguided solutions. Such pre-validators will be discussed in more detail in Chapter 6.