CAPÍTULO 5. CONCLUSIONES
5.2 RECOMENDACIONES
Parallelization of the bsapply() or more general of apply() like functions is very simple and achieves good performance improvements in most cases. A speedup approximately proportional to the number of available nodes is achievable. But if there is a lot of data to distribute and the communication times are large enough to the relative computation times, no improvement in the overall computation time can be achieved. NetWorkSpaces and the parallel Sleigh environment using a central server to store all data could be a working solution for parallel implementations in next-generation sequence data, but is not yet tested.
For the parallelization of thebsapply()function the communication costs are high and only a speedup proportional to the half number of available nodes is realistic. There are some aspects which make the use of the parallel bsapply() function difficult and require some notice to the user.
We do not know which kind of FUN functions will be executed and which amount of data will be sent to the nodes or back to the master.
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●● 0 10 20 30 40 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08 Chromosomes Sequence length
(a) Sequence lenght of chromosomes in hu- manBSgenomeobject.
1 2 3 4 5
5
10
15
Runs
Computation time in sec
bsapply parallel with equal distribution bsapply parallel with optimized distribution
(b) Computation time for counting alpha- betic frequencies.
Figure 6.4: Sequence length of chromosomes in human genome and computation time for counting alphabetic frequencies in full human genome with improved and load balanced data distribution (black) and equal distribution (red).
The smaller the data to send and the longer the computation time, the better the performance.
The genome library with the Biostringsobject has to be available at all nodes, but not loaded into theR session.
Already simple examples can have slow parallel computation times.
Therefore, only users familiar with parallel programming standards can achieve good im- provements. The parallel implementation was not added to the BSgenome package.
Useful Applications for the Parallel bsapply() Function
As demonstrated in this chapter especially communication costs limit the advantage of parallel computing. Due to low communication costs the following functions are useful for the use in the parallel bsapply() function: alphabetFrequency(), consensusString(),
matchPattern(), countPattern(), . . . . Unsuitable for parallel calculations are the func-
tions countPDict(),matchPDict(), pairwiseAlignment(),consensusMatrix(), . . . . As described, creating the objects at the workers and reducing the output object, will reduce the communication costs and improves the parallel performance.
6.4 Summary 91
6.4
Summary
Especially the huge amount of data limits parallelization for next-generation sequence data. In existing parallel implementations all data have to be available at all processors. Due to the amount of data it is not possible to load all data at the master and to distribute the data over the network. In detail the raw data have to be accessible by the hard drive (e.g., a samba or nfs device), which limits the deployment on general computer clusters or grid environments. Thesrapply() function in the ShortRead package demonstrates a working parallel solution, in contrast the bsapply()function in the BSgenome package is a negative example for parallel computing in next-generation sequence data.
New protocols and the intrinsic curiosity of biologists are expanding the range of ques- tions being addressed, and creating a concomitant need for flexible software analysis tools. The increasing affordability of high-throughput sequencing technologies means that multi- sample studies with non-trivial experimental designs are just around the corner. Therefore, innovative new computational tools have to be developed to manage the amount of data and to avoid long computation times. As demonstrated, existing parallel computing techniques show promising outcomes but there are obvious limitations, too.
Chapter 7
Large Cancer Study
Public available data sets were collected from public microarray databases, preprocessed and analyzed together. Data from more than 60 experiments and eight different cancer en- tities were used to demonstrate the power of parallel computing with theaffyParapackage, to discuss the difficulties of data management, and to analyze correlation between genes. Furthermore, the demand of new meta analyses and the unused potential of the existing – public available – data sets is outlined.
This is one of the first projects for analyzing several public available data sets together. Therefore, this chapter presents more technical details and problems of performing mi- croarray analyses with huge numbers of data. Detailed biological interpretations of the results are not yet available. In future it is expected, that there will be single experiments available with more than 2000 arrays. First ideas for genome-wide association studies, prognosis studies or randomized studies to evaluate biological signatures exist. For these projects well working data management, processing and analyses tools have to be available. This chapter presents a proof of principles for a manageable data management and data processing.
Section 7.1 describes the biological idea and some biological background. Section 7.2 contains a critical comment about public available databases and data quality, details about the selected data and the data management. Section 7.3 discusses basic steps of the analysis process and problems occurring with standard analysis procedures. The chapter ends with several figures and tables of results and a small biological interpretation.