• No se han encontrado resultados

CAPÍTULO 3: PROCESOS DE PLANEACIÓN

10. Procesos de Planeación de la Gestión de interesados

Implementing this pipeline would require four main bioinformatics tools: a quality control tool; an aligner; a transcript constructor; and an abundance estimator. In addition, ribosomal RNA reads would have to be removed, either before or after alignment. Multiple tools have been published to accomplish each of these tasks, and so it was necessary to compare those available and choose the most appropriate for this project.

3.1.1.1 Quality Control

It is common practice to check raw reads for various quality metrics, such as phred score, adapter contamination, and GC content. (Phred score measures the likelihood that a particular base in a read is correct; higher scores mean the base is more likely to be correct.) In the event of poor quality or contaminated reads being discovered, adapters and bases with low phred scores can be trimmed to produce shorter reads with higher quality scores. This can improve the results of subsequent analysis steps.

The choice of quality control tool does not impact the results of RNA-seq analy-sis results directly, but it is important that the tool used provides a comprehensive set of checks with easy to interpret results. I therefore chose to use FastQC [209]

to check read quality. FastQC is a popular tool that performs a wide set of quality checks and outputs the results in an easy-to-read HTML file.

Based on the results of FastQC, it appeared that some samples might benefit from read trimming. Many tools are available, each employing a different algo-rithm, which impacts both results and run time. There was no evidence of adapter

contamination from FastQC, so I focused on choosing a tool to trim reads based on phred score. I ran preliminary tests on cutadapt [210], ERNE-FILTER [211], and trimmomatic [212] to check their impact on FastQC output and alignment score. I discovered that trimming reads appeared to have little impact on either of these outcomes. In addition, literature on this subject suggested that trim-ming should be used with caution in RNA-seq analyses, lest useful information be lost [213]. Many recent aligners automatically take quality score into account during the alignment, trimming reads if and when necessary to achieve a better alignment. I therefore decided not to carry out an explicit read trimming step, and instead to choose an aligner that would handle this internally.

3.1.1.2 Alignment and rRNA Removal

There are many alignment tools available for RNA-seq reads, and so I relied on published comparison and benchmarking studies to narrow the field to a smaller number of choices. I then made a final decision based on suitability for this project, ease of use, and speed. The comparison published by Engstr¨om et al. [214] showed that STAR [215], GSNAP [216], and RUM [217] all produce high-quality results in comparison to other popular aligners. The paper that presented STAR demon-strated similar results [215]. A more recent comparison by Sahraeian et al. has demonstrated that HISAT2 [218] may perform better than STAR [219], although the HISAT2 documentation states that it is not suitable when reads mapping to many loci need to be retained, as is the case in this project.

I carried out informal testing with these STAR, GSNAP, and RUM, and in-vestigated their capabilities. STAR stood out from the others in terms of ease of use and documentation quality. In addition, it automatically trims reads based

on alignment quality. Most importantly, a maximum number of alignments per read can be set, and each alignment is preserved in the results. Finally, STAR runs significantly faster than any other published aligner. I therefore decided to use STAR for the alignment step.

I decided to use the RSeQC software package [220] to remove reads mapping to known rRNA regions, based on its fast runtime, ease of use, and good docu-mentation.

3.1.1.3 Transcript Reconstruction

There are significantly fewer transcript reconstruction tools available than there are aligners. The most popular available are Scripture [221], Cufflinks [222], and, more recently, StringTie [223] (the successor to Cufflinks). Scripture is poorly documented and maintained, and is difficult to use, unlike Cufflinks and StringTie.

StringTie was presented as an improved version of Cufflinks, and informal testing with both demonstrated that StringTie runs significantly faster and is easier to use. In addition, the more recent versions of StringTie handle reads from stranded RNA-seq protocols. A comparison between StringTie and other tools in the context of a complete pipeline demonstrates that StringTie does indeed produce better results [219]. I therefore decided to use StringTie as the transcript reconstruction tool.

3.1.1.4 Abundance Estimation

The key requirement for an abundance estimation tool in this project is that it handles multimapping reads correctly, as this is the stage where they can have the greatest impact. There are several tools that explicitly deal with

multimap-ping reads. Popular choices in the bioinformatics community include RSEM [192]

and kallisto [184], both of which perform well according to benchmarking stud-ies [184, 192, 219, 224]. RSEM uses an expectation-maximisation algorithm to as-sign fractions of reads to different loci based on the number of uniquely mapped reads at each locus. Alternatively, kallisto is one of a recent group of abundance estimators that rely on a “pseudoalignment” rather than an explicit alignment.

This method results in remarkable speed-ups, without a drop in quality of re-sults. I decided to use kallisto, based on its ease of use, remarkable speed, and benchmarking results.