• No se han encontrado resultados

The downstream analysis of the sequencing reads generated for each sample was done following the pipeline shown in Figure 2.3, using programs called tophat, cufflinks (cuffmerge, cuffdiff) and cummeRbund (Trapnell et al., 2012). Tophat and cufflinks are free open source software tools used for gene discovery and expression analysis of high throughput RNA-seq data. Together they allow for the identification of novel genes and splice variants as well as for the comparison of gene expression between disease and healthy states (Trapnell et al., 2012).

45

Figure 2.3: An overview of the tophat/cufflinks RNA-seq analysis protocol.

The RNA-seq data was aligned to the Human Reference Genome (hg19) with tophat 0.5.9-r16 (http://tophat.cbcb.umd.edu) with default options. Tophat is a fast splice site junction mapper for RNA-seq reads. The script written for tophat to align the generated reads to the genome using the ultra-high throughput short read aligner, ”Bowtie”, and then analyse the mapping results to identify splice sites between exons, is shown in Figure 2.4. Tophat generates an output file named “accepted_hits.bam” file. This contains all the aligned reads and was used as the input file for cufflinks. After running tophat, the resulting alignment files were provided

to cufflinks (http://cufflinks.cbcb.umd.edu) which assembled the individual transcripts from

the aligned RNA-seq reads, estimates their abundances, and tested for differential expression. Cufflinks produces an assembled transcriptome fragment for each sample using the “accepted_hits.bam” file as the input. The script written to run cufflinks is shown in Figure 2.5. The resulting transcriptome fragments of each sample are then merged together with the reference transcriptome annotation into one file for further analysis. This was done using cuffmerge. The script written to merge the cufflinks files is shown in Figure 2.6.

46 Tophat.26.sh #!/bin/bash # OGE parameters #$ -q xe-el6 #$ -N RNAseqANGELA #$ -e /no_backup/xe/ahobbs/alignment/0026/e26logs #$ -o /no_backup/xe/ahobbs/alignment/0026/o26logs #$ -m abe #$ -M [email protected] #$ -pe smp 4 #$ -l h_rt=20:00:00 #$ -l virtual_free=20G # paths PATH=/users/GD/tools/bowtie/bowtie2-2,1,0:$PATH export PATH

/software/bi/el6,3/current/tophat/tophat2 --output-dir /no_backup/xe/ahobbs/alignment/0026 --num-threads 4 --rg-id 0026 --rg-library 0026 --rg- sample 0026 --rg-platform illumina --transcriptome-index /no_backup/xe/ahobbs/alignment/index

/db/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome /no_backup/xe/ahobbs/samples/26_9966_GTCCGC_read1.fastq.gz /no_backup/xe/ahobbs/samples/26_9966_GTCCGC_read2.fastq.gz

Figure 2.4: The script written to align the generated sequences to the reference human genome using Tophat. #!/bin/bash # OGE parameters #$ -q xe-el6 #$ -N RNAseqANGELA #$ -e /no_backup/xe/ahobbs/alignment/0009/e,cl9,logs #$ -o /no_backup/xe/ahobbs/alignment/0009/o,cl9,logs #$ -V #$ -m abe #$ -M angela,hobbs@crg,eu #$ -t 1 #$ -pe smp 8 #$ -l h_rt=20:00:00 #$ -l virtual_free=40G

/users/GD/tools/cufflinks/cufflinks-2,2,1,Linux_x86_64/cufflinks --output-dir /no_backup/xe/ahobbs/alignment/0009 --num-threads 8 --max-bundle- frags 100000000 --GTF /db/igenomes/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes,gtf --GTF-guide

/db/igenomes/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes,gtf /no_backup/xe/ahobbs/alignment/0009/accepted_hits,bam

47 Cuffmerge_blood.sh #!/bin/bash # OGE parameters #$ -q xe-el6 #$ -N CuffMerge_Blood #$ -e /no_backup/xe/ahobbs/DE/e,cm,logs #$ -o /no_backup/xe/ahobbs/DE/o,cm,logs #$ -V #$ -m abe #$ -M angela,hobbs@crg,eu #$ -t 1 #$ -pe smp 8 #$ -l h_rt=30:00:00 #$ -l virtual_free=40G source /users/xe/ahobbs/,bash_profile source /users/xe/ahobbs/,bashrc

/users/GD/tools/cufflinks/cufflinks-2,2,1,Linux_x86_64/cuffmerge -o /no_backup/xe/ahobbs/DE --num-threads 8 -g /db/igenomes/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes,gtf -s

/db/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome,fa /no_backup/xe/ahobbs/samples/assemblies_blood,txt

Figure 2.6: The Cuffmerge script written to merge all the resulting transcriptome fragments of each sample together with the reference transcriptome annotation.

The merged file was then quantified by cuffdiff which is a separate program that is included in the cufflinks package. Cuffdiff calculated differential gene expression i.e. the expression between our case and control groups and also tested the statistical significance of each observed change in the expression between them. The results were given in a set of tabular files. Differential expression was considered significant depending on whether the p-value is greater than the FDR after Benjamini-Hochberg correction for multiple-testing (Mutryn et al., 2015). The output file generated by cuffdiff was saved in an excel format for analysis.

Cuffdiff.placenta.final.sh #!/bin/bash # OGE parameters #$ -q xe-el6 #$ -N cuffdiff_placenta #$ -e /no_backup/xe/ahobbs/cuffdiffplacentafinal/e,CDplacentafinal,logs #$ -o /no_backup/xe/ahobbs/cuffdiffplacentafinal/o,CDplacentafinal,logs #$ -V #$ -m abe #$ -M angela,hobbs@crg,eu #$ -t 1 #$ -pe smp 8 #$ -l h_rt=72:00:00

48 #$ -l virtual_free=60G aligndir=/no_backup/xe/ahobbs/alignment

/users/GD/tools/cufflinks/cufflinks-2,2,1,Linux_x86_64/cuffdiff -o /no_backup/xe/ahobbs/cuffdiffplacentafinal -p 8 -L Controls,Cases --library-type fr- firststrand /db/igenomes/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes,gtf

$aligndir/0002/accepted_hits,bam,$aligndir/1090/accepted_hits,bam,$aligndir/0017/accepted_hits,bam,$aligndir/0018/accepted_hits,bam,$aligndir/ 0002/accepted_hits,bam,$aligndir/1090/accepted_hits,bam

$aligndir/0006/accepted_hits,bam,$aligndir/0007/accepted_hits,bam,$aligndir/0013/accepted_hits,bam,$aligndir/0006/accepted_hits,bam,$aligndir/ 0007/accepted_hits,bam,$aligndir/0013/accepted_hits,bam

Figure 2.7: The Cuffdiff script. This script was written to extract differential gene expression sequences from the

blood and placenta dataset.

The number of RNA-seq reads generated from a transcript is directly proportional to the relative abundance of that transcript in the sample and because cDNA fragments are generally size-selected as part of library construction, longer transcripts produce more sequencing fragments than shorter transcripts. In order to determine the correct expression level of each transcript, cufflinks must count the reads that map to each transcript and then normalize this count by each transcript's length. The commonly used fragments per kilobase of transcript per million mapped fragments (or FPKM, also known as RPKM in single ended sequencing experiments) is used to normalization expression levels for different genes and transcripts (Trapnell et al., 2012). Figure 2.7 shows the script written to run the cuffdiff command to extract differential expression gene sequences.

CummeRbund (http://compbio.mit.edu/cummeRbund) is a powerful plotting tool which was

used to create commonly used expression plots such as volcano, scatter and box plots, cummeRbund transforms cufflinks output files into R objects suitable for analysis with a wide variety of other packages available within the R environment. The cuffdiff output file was used as an input for cummeRbund.

Documento similar