Computational analysis and characterization of alternative splicing and its impact on transcriptional diversity

(1)

Departamento de Biología Molecular

Computational analysis and

characterization of alternative splicing and its impact on transcriptional diversity

Alberto Gatto

Madrid, 2016

(2)

(3)

Departamento de Biología Molecular Facultad de Ciencias

Universidad Autónoma de Madrid

Computational analysis and

characterization of alternative splicing and its impact on transcriptional diversity

Alberto Gatto, M.Sc.

Enrique Lara-Pezzi, Ph.D.

Fátima Sánchez-Cabo, Ph.D.

Fundación Centro Nacional de

Investigaciones Cardiovasculares

(4)

Summary

(5)

2

(6)

In all eukaryotes, pre-mRNA splicing is a requirement for protein synthesis as most coding sequences are interspersed with non-coding ones that need to be removed from the primary transcript. The basic steps of this process are stereotyped and essentially universal but, as nucleotide sequences can vary greatly and in an unpredictable fashion, a complex mechanism of regulation ensures the efficient and flexible processing of myriads of different substrates. By virtue of this flexibility, the same pre-mRNA can be spliced in different ways and alternative isoforms are thereby generated from a single genomic template. This alternative splicing is essentially a by-product of the process, but can profoundly affect the transcriptional and protein-coding potential of eukaryotic genes. Its global impact remains however elusive and the functional relevance of alternatively spliced transcripts is a controversial and open-ended question.

The advent of RNA sequencing enabled the investigation of splicing patterns at a genome-wide scale. Best analysis practices are nonetheless difficult to establish, as the alignment of short sequence fragments is a complex computational task. This task is made even more challenging by splicing itself, as different parts of these fragments originate from separate genomic locations and are harder to map unambiguously. In order to tackle this issue, simulated sequencing data was generated to evaluate systematically the effectiveness of different alignment algorithms in terms of splice site mapping, detection and quantification. As none of these algorithms provided an integrated solution to the three problems, a novel computational strategy was devised to achieve a better trade-off in performance. The tool that achieved best mapping and quantification accuracy was coupled to a newly implemented method for efficient splice site detection. This pipeline was tested in both simulated and real data, proving to achieve consistently better results.

Following implementation and testing, the established strategy was used to analyse comparative transcriptomics data in different tissues. Tissue-dependent splicing signatures were identified and one-to-one orthologs were found to share similar patterns in primates and mouse. Though rare, instances of tissue-specific splicing were prevalently observed in genes expressed at constitutively high levels and with more strongly constrained, CpG-rich, promoters. Despite being ubiquitously expressed, these genes exhibited enhanced mRNA levels in the specific tissue where alternative splicing occurs. This was further associated with an increase in protein content whenever the splicing change was found to be common to both human and mouse. Modulation of tissue-specific abundances in proteins encoded by constitutively high expressed genes provides a rationale for conservation of alternative splicing patterns.

(7)

4

(8)

Resumen

(9)

6

(10)

En todos los organismos eucariotas, el splicing del pre-mRNA es un requerimiento para la síntesis proteica, ya que la mayoría de las secuencias codificantes se encuentran separadas por otras no codificantes que necesitan ser retiradas del transcrito primario.

Los pasos básicos de este proceso son esencialmente universales, pero, como las secuencias de nucleótidos pueden variar enormemente y de forma impredecible, un complejo mecanismo de regulación asegura el procesamiento eficiente y flexible de miles de distintos substratos. Mediante esta flexibilidad, el mismo pre-mRNA puede ser procesado en distintas formas, generando isoformas alternativas a partir de un molde genómico único. Este procesamiento (splicing) alternativo es esencialmente un sub-producto del splicing en sí, pero puede afectar profundamente el potencial transcripcional y codificante de los genes eucariotas. Sin embargo, su impacto global no se entiende aún completamente y la relevancia funcional de los transcritos alternativos es controvertida y constituye una cuestión por resolver.

El desarrollo de la secuenciación masiva de RNA permite potencialmente el estudio exhaustivo de todas las isoformas expresadas en una muestra.. Sin embargo, es difícil establecer un procedimiento de análisis óptimo, ya que el alineamiento de secuencias cortas es una tarea computacionalmente compleja. El splicing añade complejidad a este reto debido a que distintas partes de una lectura pueden mapear a regiones distantes del genoma alrededor del llamado sitio de splicing, haciéndolas más difíciles de mapear de forma precisa. Para estudiar de manera sistemática la capacidad de alineamiento de los distintos algoritmos disponibles los principales métodos se testaron sobre datos simulados. Los algoritmos fueron comparados no solo en base a su capacidad de alineamiento alrededor de los sitios de splicing sino también en función de su capacidad de cuantificar correctamente, dado que la cuantificación y no solo la detección es uno de los objetivos cuando se usa RNA-Seq. Como ninguno de los algoritmos testados ofrecía una solución integrada a los tres problemas, se diseñó una nueva estrategia computacional para mejorar su rendimiento. Se identificó la herramienta con la mayor precisión en el mapeo y cuantificación de secuencias y se acopló a un nuevo método para la detección eficiente de sitios de splicing. Este método integrado se comprobó tanto en datos simulados como reales y demostró alcanzar mejores resultados de forma consistente que cualquiera de los algoritmos por separado.

Tras su implementación y comprobación, se utilizó la nueva estrategia para analizar datos de transcriptómica de diferentes tejidos y especies, con el objetivo de estudiar la contribución del splicing en la especificidad del transcriptoma por tejido y a través de

(11)

diferentes especies. Se identificaron patrones de splicing tejido-específicos y se encontró que los genes ortólogos comparten patrones de splicing similares en primates y ratón.

Aunque poco frecuentes, se observaron cambios de splicing tejido-específicos preferentemente en genes altamente expresados de manera constitutiva. Estos genes tienden a tenerpromotores más fuertemente conservados y con más islas CpG. A pesar de estar expresados de forma ubicua, dichos genes mostraron niveles de mRNA más altos en el tejido específico en el que ocurre el splicing alternativo. Este aumento del mRNA estaba además asociado con un incremento en la cantidad de proteína, para aquellos eventos de splicing que eran comunes en ratón y humano. La modulación de la abundancia de proteína de forma tejido-específica para genes que se expresan de forma ubicua representa un buen fundamento para la conservación de los patrones de splicing.

8

(12)

Index of contents

Summary 1

Resumen 5

Abbreviations 11 Formats 11 Symbols 12

Introduction 13

15 pre-mRNA splicing

17 Splicing catalysis and regulation 19 Genome-wide splicing analysis 21 Computational methodologies 26 Challenges and perspectives

Objectives 29

Materials and methods

33

35 RNA-Seq simulation 35 Alignment benchmarking 36 The FineSplice pipeline

39 Comparison with experimental data 40 Comparative transcriptomics data 40 Tissue-specific splicing analysis 42 Characterization of splicing signatures

Results 43

45 Simulation and alignment benchmarking

54 Enhanced detection and quantification with FineSplice

(13)

61 Detection performance in experimental data 39 Comparative analysis of tissue-specific splicing 72 Characterization at the gene expression level 78 Functional implications of tissue-specific splicing

Discussion 87

Conclusions 101 Conclusiones 105

Bibliography 109

Annex I Supplementary

129

Annex II Publications

139

10

(14)

Abbreviations

BWT Burrows–Wheeler transform

cDNA complementary DNA

CLIP-Seq cross-linking immunoprecipitation coupled and sequencing

FM-index Ferragina-Manzini index

HPC High Performance Computing

miRNA-Seq microRNA sequencing

polyA(+) RNA polyadenilated RNA

ppm parts per million (protein abundance unit)

qPCR quantitative real-time PCR

RBP RNA-binding protein

RNA-Seq RNA sequencing

rRNA ribosomal RNA

snRNP small nuclear ribonucleoproteins

TMM trimmed mean of M-values normalization

Formats

GFF gene/transcript annotation format

GTF gene/transcript annotation format

SAM text-based sequence alignment format

BAM binary compressed sequence alignment format

FASTQ text-based sequence format

(15)

Symbols

8M 8 million

20M 20 million

40M 40 million

SE single-end library

PE paired-end library

Sn sensitivity

PPV positive predictive value or precision

Ψ splicing rate

ΔΨ difference in splicing rates

μ per-tissue mean gene/protein expression

μ mean gene/protein expression over all tissues

τ tissue-specificity index

12

(16)

Introduction

(17)

14

(18)

pre-mRNA splicing

Most eukaryotic proteins are encoded by separate genomic elements interrupted by non-coding sequences that need to be excised from the precursor messenger RNA (pre-mRNA) to yield an open reading frame for translation. Splicing is the coordinated process of cleavage and ligation reactions whereby parts of the primary transcript (introns) are displaced from the nascent pre-mRNA and the flanking regions (exons) are joined into a continuous sequence that defines the final composition of the mature mRNA. In genes where the reading frame is interrupted by non-coding sequences, splicing is critical to ensure translation of a functional protein. The corresponding coding exons need to be ligated in an orderly fashion, and intervening non-coding sequences must be recognized as introns and spliced out of the mRNA precursor. Splicing fidelity is therefore required for protein synthesis, as primary transcripts of an intron-containing gene must be, to some extent, spliced in a stereotyped manner to ensure production of the encoded protein. On the other hand, variations in the splicing pattern (alternative splicing) allow for flexibility in terms of the number of mature isoforms that can be generated from a single DNA template through removal of different combinations of introns.

Intron cleavage occurs at conserved elements called splice sites (Lee and Rio, 2015;

Wongpalee and Sharma, 2014), almost invariably defined by a GU dinucleotide at the 5' end (donor site) and an AG dinucleotide, preceded by a pyrimidine-rich sequence, at the 3' end (acceptor site). Donor and acceptor sites demarcate the exon-intron boundaries, or junctions, and constitute the core splicing signals together with the branch site, a more loosely conserved element located upstream of the polypyrimidine tract that comprises a conserved adenine nucleotide. GU-AG dinucleotides at the intron termini are virtually invariant, accounting for ~99% of naturally-occurring splice sites in eukaryotic genomes (Burset et al., 2000; Sheth et al., 2006). Non-canonical sites represent a rather exceptional case, commonly processed through unconventional pathways (Parada et al., 2014; Patel and Steitz, 2003; Sibley et al., 2016; Turunen et al., 2013).

The core signals are highly conserved in yeast, and likely sufficient to unambiguously define exon-intron boundaries in multi-exon pre-mRNAs (Irimia and Roy, 2008; Roy and Irimia, 2014). Auxiliary cis-regulatory elements are instead required for accurate splice site recognition in multicellular eukaryotes, as the equivalent sequences are more degenerate (Ast, 2004; Burge et al., 1999; Coelho and Smith, 2014). Nevertheless, the basic splicing mechanism is essentially the same: neighboring exons are ligated following displacement

(19)

of the intervening intron in two isoenergetic transesterification steps. In the first step, the cleavage of the phosphodiester bond at the 5' exon-intron junction follows the formation of an intermediate loop structure, the intron lariat, resulting from the nucleophilic attack of the adenine 2'hydroxyl group at the branch site to the 5' guanine phosphate at the donor site.

Exon ligation proceeds by a second transesterification where the 3' hydroxyl group of the detached 5' exon makes a nucleophilic attack at the acceptor site, thereby releasing the intron lariat (Brow, 2002; Will and Lührmann, 2011). The process is illustrated in Figure 1.

Figure 1. The three core splicing signals comprise the donor and acceptor sites at the 5' and 3' ends of an intron (respectively defined by a GU and AG dinucleotide) and the branch site (that comprises an A nucleotide). The splicing reaction occurs in two transesterification steps. In the first step, an intron lariat is formed by the nucleophilic attack of the branch site adenosine to the guanine phosphate at the donor site. In the second step, the two exons are joined and the lariat released by the nucleophilic attack of the detached 5' exon end at the acceptor site.

In eukaryotic cells, these reactions are orchestrated in a highly coordinated, and yet flexible, manner in order to ensure accurate processing of coding mRNA precursors as well as the regulated production of alternative isoforms. Splicing is catalyzed by a large macromolecular machinery, the spliceosome, that is dynamically assembled on the pre-mRNA substrate through the step-wise recruitment of several trans-acting factors on the core splicing signals and neighboring cis-regulatory elements (Chiou and Lynch, 2014;

Sperling et al., 2008; Staley and Woolford, 2009).

16

(20)

Splicing catalysis and regulation

The spliceosome is a multi-megadalton complex, highly variable in composition, that consists of 5 core ribonucleoprotein subunits (snRNPs) and more than 100 associated proteins (Hang et al., 2015; Lara-Pezzi et al., 2013; Yan et al., 2016; C. Yan et al., 2015). A complex network of RNA-RNA, RNA-protein and protein-protein interactions ensues the formation of a catalytically active spliceosome. In canonical sites, spliceosome assembly is initiated by the binding of U1 and U2 snRNPs onto the nascent pre-mRNA through recognition of the core splicing signals at the 5' and 3' intron ends. The pre-spliceosomal intermediates are then formed sequentially by binding of other snRNPs and associated cofactors to adjunct reactive elements or interaction between newly assembled subunits.

These intermediate complexes undergo several ATP-dependent conformational and compositional rearrangements throughout the catalytic process, in a dynamic interplay of factors that, depending on the context, act in a synergistic or antagonistic fashion (Matera and Wang, 2014; Wahl et al., 2009). The intermediates formed between the substrate and pre-spliceosomal components are proofread by dedicated helicases to monitor the fidelity of the assembly (Koodathingal and Staley, 2013; Semlow and Jonathan P. Staley, 2012).

Although the key steps are well conserved, spliceosome assembly is a remarkably flexible process, and multiple factors can modulate its outcome and kinetics (Abelson et al., 2010; Kotlajich et al., 2009; Shcherbakova et al., 2013). Spliceosomal catalysis entails an ordered but variable progression of cis- and trans-directed interactions that redundantly regulate substrate recognition and excision of intervening sequences in the pre-mRNA.

Accurate intron removal is ensured through combinatorial control of the assembly process, which grants at the same time enough flexibility to accommodate a wide variety of different substrates (Fu and Ares Jr, 2014; Licatalosi and Darnell, 2010; Papasaikas et al., 2015).

The initial steps are indeed largely stereotyped and driven by the recognition of highly conserved signals, but subsequent steps proceed by different combinatorial interactions between spliceosome-associated proteins and degenerate reactive sites on the substrate.

Regulatory sequences in the pre-mRNA (Blencowe, 2006; Wang and Burge, 2008) can either facilitate the recruitment of spliceosomal components and stabilizing cofactors thereby promoting the inclusion of an exon (enhancers) or prevent spliceosome assembly by inhibitors binding and cause its skipping (silencers). Splicing enhancers and silencers are commonly distinguished in exonic and intronic, depending on their location (Barash et

(21)

al., 2010; Goren et al., 2006; Wang et al., 2006). In many instances this distinction is not straightforward, because the fate of a given region might change by virtue of alternative splicing and regulatory sequences can exhibit different activities depending on the protein, or proteins, that binds them (Coelho and Smith, 2014; Gerstberger et al., 2014). As most cis-regulatory elements act by recruiting trans-acting factors their activity ultimately reflects the properties of the protein that binds them, which can have opposite effects depending on the presence of auxiliary elements, competing factors or interacting partners, as wells as their post-translational modification status and cellular availability (Fu and Ares Jr, 2014;

Matera and Wang, 2014; Ray et al., 2013). RNA-binding proteins often interact non-specifically with a given target, and their effect is inherently combinatorial. Many trans-regulatory factors have indeed been shown to act as an enhancer or a silencer depending on position and context, and others to act as antagonistic pairs (Erkelenz et al., 2013; Li et al., 2015, p. 1; Llorian et al., 2010; Motta-Mena et al., 2010; Wang et al., 2012).

Figure 2. Splicing regulation is mediated by cis-regulatory sequences found in the spliced exon and in neighboring introns. Cis-regulatory sequences can either facilitate inclusion of an exon (splicing enhancers), or they can cause exon skipping (splicing silencers). Depending on their exonic or intronic position, splicing enhancers are known as ESE or ISE, and splicing silencers as ESS or ISS. Most auxiliary cis-regulatory sequences act by recruitment of trans-regulatory factors. In general, splicing enhancers bind SR proteins (Ser/Arg-rich SRSF factors), which facilitate spliceosome assembly, whereas splicing silencers recruit proteins of the hnRNP family, which can interfere with recruitment of the spliceosome or SR proteins (Lara-Pezzi et al., 2013).

In addition to elements that act as binding sites for trans-regulatory factors (Figure 2), genes can contain cryptic splice sites with weaker core signals that resemble the canonical ones, and act in a negative fashion by competing with functional sites and interfering with the site recognition by the core subunits (Ke and Chasin, 2010; Roca et al., 2013, 2003).

18

(22)

Spliceosomal assembly can also be affected by gene architecture, e.g. intron/exon number and length, or structural properties of the pre-mRNA, e.g. the potential to form secondary structures. Longer introns and shorter exons can hinder recognition of the substrate and delay binding of protein cofactors, or disfavor the interaction between newly assembled intermediates (Barrass et al., 2015; De Conti et al., 2013; Fox-Walsh et al., 2005; Hertel, 2008; Hollander et al., 2016). Secondary structures can act both positively and negatively by bridging distant elements into proximity, masking sequences that are only recognized in single-stranded form, or forming structural motifs that are either bound by enhancer or silencer proteins (Buratti and Baralle, 2004; Hiller et al., 2007; Jin et al., 2011; Shepard and Hertel, 2008; Yang et al., 2011).

Overall, a large and dynamic number of variables, that depend on both sequence and cellular context, are involved in determining the fate and splicing efficiency of a given intron from the pre-mRNA (Barash et al., 2010; Leung et al., 2014; Rosenberg et al., 2015;

Vaquero-Garcia et al., 2016; Wang and Burge, 2008; Xiong et al., 2015). The number of alternative fates an immature transcript can suffer, and hence the composition of the mature mRNA fraction, ultimately reflects the combinatorial nature of the splicing process and of the transcriptional state of a cell. Lack of splicing fidelity and alternative splicing are inherently two sides of the same coin, as production of different isoforms can represent a regulated and functional process as much as the inevitable consequence of pervasive transcription and noisy processing of the many available substrates, or lack of control over indifferent products transcribed at low basal rates (Melamud and Moult, 2009; Wang et al., 2014). In principle, alternative splicing can be exploited to expand the coding potential of a gene (Kim and Hahn, 2012; Leoni et al., 2011; Nilsen and Graveley, 2010; Roy et al., 2013; Yang et al., 2016) or regulate post-transcriptionally protein synthesis via controlled production of nonfunctional products (Bitton et al., 2015; Braunschweig et al., 2014; Q.

Yan et al., 2015; Zhang et al., 2009). Evidence of both instances have been found, but the global picture is still unclear as functional studies are few and limited to individual isoforms.

Genome-wide splicing analysis

The advent of high-throughput technologies boosted the investigation of splicing at a genome-wide scale, by allowing to interrogate the transcriptome at unprecedented depth and in an unbiased manner (Djebali et al., 2012; Pan et al., 2008; Wang et al., 2008).

Comparative studies used the similarity of tissue-specific splicing patterns across different species as a proxy to conservation and functionality (Barbosa-Morais et al., 2012; Merkin

(23)

et al., 2012; Reyes et al., 2013). The role and impact of alternative splicing is however still debated, as results from genome-wide studies are, in themselves, rarely conclusive and their interpretation remains often controversial. In the case of alternative splicing, this is further complicated by the complexity of the underlying biological problem and of the many computational challenges that its analysis brings along.

The analysis of splicing is, In its most basic sense, the characterization of differences in the number of times a given sequence is spliced in or out of the primary transcript. This can be measured as percentage of inclusion of an exon or retention of an intron, usage of a given splice site over an alternative one, or relative abundance of processed transcripts containing an exon/intron as opposed to alternative isoforms lacking it (Alamancos et al., 2014; Katz et al., 2010; Zhang and Stamm, 2012). In all these instances, the quantification itself is a simple percentage measure but establishing a priori a good unit of comparison, particularly on a genome-wide scale, is complicated by the multitude of overlapping and opposite fates that any pre-mRNA region can suffer by virtue of noise or alternative splicing (Lee and Rio, 2015; Sibley et al., 2016). For instance, measuring the inclusion levels of an exon can be complicated by the presence of alternative 5' and 3' ends, and/or by the number of different splicing events that ensue its presence or absence in a mature transcript e.g. alternative splicing of the flanking exons, skipping by failure to recognize the acceptor site at its 5' boundary or selection of a mutually exclusive splice site (Figure 3).

On one hand, pooling information at the exon level fails to capture specific variations in the splicing pattern underlying the change in its inclusion rates but event-based metrics tend on the other hand to produce redundant and correlated (if not biased) results, due to the overlap between combinations of excised introns for different possible events occurring at the same location (Carrara et al., 2015; Conesa et al., 2016; Kakaradov et al., 2012). A similar problem affects metrics based on relative transcript abundance (Hayer et al., 2015;

Kanitz et al., 2015; Liu et al., 2014), while measuring a change at the intron level makes the subsequent interpretation of the results more challenging as the final outcome on the mature transcript remains essentially unknown. These difficulties are mainly rooted in the complexity and combinatorial nature of the biological process, but also on the relatively narrow resolution of the experimental information which derives from hybridization probes or sequencing reads that are shorter than the full-length transcript.

RNA sequencing (RNA-Seq) has established itself as the leading strategy in global analyses of the transcriptome, becoming the de facto standard tool for characterization of

20

(24)

splicing patterns and post-transcriptional changes at genome-wide resolution (Costa et al., 2010; Mutz et al., 2013; Ozsolak and Milos, 2011; Wang et al., 2009). Compared to hybridization-based technologies, that rely on oligonucleotide probes to detect and quantify complimentary sequences in the RNA population of interest, sequencing doesn't require any prior knowledge of the targets and provides a wider dynamic range for detection and quantification. Unlike qPCR and microarrays, sequencing can yield a global, unbiased snapshot of all molecules in a given sample, thus enabling to query relative abundance of known products, including lowly expressed ones, characterize previously unknown ones and resolve differences at low and high concentration ranges (Mantione et al., 2014; Mutz et al., 2013; SEQC/MAQC-III Consortium, 2014; Zhao et al., 2014) . Besides allowing to better appreciate changes in gene expression, RNA-Seq makes it hence possible to identify novel isoforms, appreciate low abundance mRNA isoforms, and assess alternative usage of exons or splice sites (Chen, 2012; Mortazavi et al., 2008).

RNA-Seq entails the high-throughput sequencing of a library of cDNA fragments obtained from any given RNA sample by retrotranscription, fragmentation and ligation of ad hoc adaptors (Ozsolak and Milos, 2011; Wang et al., 2009). Total or fractionated RNA (e.g. the polyadenilated or rRNA-depleted fraction) can be used to resolve the make-up and composition of protein-coding mRNAs in the population of interest, and strand-specific tags can be added to discriminate antisense transcription (Oshlack et al., 2010; Young et al., 2012). cDNA libraries enriched in small RNAs can be obtained via size fractionation, and employed to better appreciate the expression of microRNAs and other small non-coding species (miRNA-Seq, cf. Eminaga et al., 2013; Hackenberg, 2012; Pritchard et al., 2012; Stokowy et al., 2014). Targets of RNA-binding proteins (RBPs) can be also determined through sequencing the bound RNA fraction after cross-linking and immunoprecipitation of the RNA-protein complexes (CLIP-Seq, cf. Ascano et al., 2012;

Darnell, 2010; Kishore et al., 2011; König et al., 2012; Sugimoto et al., 2012). Following library preparation, the cDNA fragments, with or without amplification, are analyzed by massive parallel sequencing to yield short nucleotide reads from one or both ends of the fragment (single-end or paired-end sequencing).

Computational methodologies

The output of an RNA-Seq consists of millions of reads derived from a random sampling of the input library, that ultimately reflects the throughput of the sequencing (depth) and the relative proportion of RNA molecules in the analyzed sample. In order to

(25)

appreciate the composition of the input library, reads must be either traced back to their genomic source of origin, assigned to known transcripts, or used to infer previously uncharacterized products. This is achieved via a string similarity search that commonly relies on sequence alignment, the comparison and approximate matching of nucleotide sequences. Pairwise similarity criteria and/or additional heuristics are used in order to determine an optimal match for each read (Flicek and Birney, 2009; Trapnell and Salzberg, 2009). Reads can be either compared to a reference genome or transcriptome, and best matches thus used to infer the source location (mapping), or aligned to other reads in order to reconstruct the original sequence ab initio (de novo assembly) based on the overlap between matching ends. Read mapping is a knowledge-driven problem, as it uses a reference sequence (or a set of known sequences) to assign a location to each read (Fonseca et al., 2012; Li and Homer, 2010; Reinert et al., 2015; Trapnell and Salzberg, 2009). De novo assembly is instead a largely data-driven problem, as it depends on read-to-read alignment and requires sufficient numbers of overlapping reads to reliably infer a larger consensus sequence (contig) through merging of juxtaposed ends (El-Metwally et al., 2013; Huang et al., 2014; Martin and Wang, 2011; Miller et al., 2010;

Robertson et al., 2010). This allows to reconstruct a reference bottom-up in a context where the genome is unknown, but demands for higher sequencing depths, manual curation, and the preferable use of paired-end libraries where aligned ends of the same fragment can be exploited to guide and refine the assembly. In many cases, where the genome or transcriptome are partially known but likely incomplete, a combination of annotation-guided and ab initio approaches is employed in order to take advantage of known sequences while further incorporating newly assembled ones (Fonseca et al., 2012;

Garber et al., 2011).

Given a reference, either defined a priori or learned from the data, read mapping ultimately provides the basic information for any subsequent analysis. Aligning the reads allows to ascertain their source location, while the number of reads overlapping a given sequence can be used as a proxy to estimate its relative abundance in the input library.

Pairwise alignment is a long-studied problem in bioinformatics, and many well-established solutions have been proposed to find optimal matches between two biological sequences (Durbin et al., 1998; Pevsner, 2009). In the case of high-throughput data, finding the best match for all reads is however complicated by a number of reasons, first and foremost the practical challenge of processing in a time- and space-efficient manner millions of short queries and an up to gigabase-sized target sequence (Trapnell and Salzberg, 2009).

22

(26)

Besides the computational burden, the number of reads and their length compared to the size of the reference are inherent limits to the definition of an optimal match, as the shorter a sequence is the higher the likelihood to be represented multiple times in the reference, either by chance or due to the presence of low-complexity or repetitive genomic elements (Li and Freudenberg, 2014a, 2014b; Sims et al., 2014). In some instances this is unavoidable and no unique solution exists, making it impossible to unambiguously assign a read to a single location. Besides the length, finding an optimal alignment is complicated by the fact that a read doesn't necessarily represent an exact match of the reference sequence for both technical (sequencing errors) or biological reasons (genetic variation).

Multiple mapping reads may hence arise due to the multiplication of alignment solutions ensuing increasing numbers of allowed differences between the query and the reference sequence i.e. mismatches for single nucleotide variants, or erroneous nucleotide calls, and gaps for insertions/deletions, or homopolymer-length errors (Derrien et al., 2012; Lee and Schatz, 2012; Ribeca, 2012). When equivalent solutions exist an algorithm can either prioritize matches based on heuristic criteria (Langmead et al., 2009; Li and Durbin, 2009) or exhaustively report all possible locations up to a given dissimilarity threshold (Hach et al., 2010; Marco-Sola et al., 2012).

In order to cope efficiently with the burdensome task of searching millions of short strings in an orders-of-magnitude longer sequence, mapping algorithms take advantage of the fact that the latter is fixed ahead of the search. Since the reference is defined a priori, it can be pre-processed via ad hoc strategies and stored in a convenient format that allows for fast retrieval of the shorter embedded sequences that would match a given read (or parts of it). Substrings of given length have indeed finite and fixed positions in the reference space, hence the query time of millions searches can be greatly reduced by the one-time construction of an indexed data structure that keeps track of these positions.

Reference indexes can be implemented in the form of lookup tables (hash tables) with location records for all substrings of fixed length (k-mers) or ordered tree structures (tries) storing positions of all suffixes such that each those sharing a common prefix are descendants of the same node and each corresponds to a single path from the root (Horner et al., 2010; Li and Homer, 2010; Ribeca, 2012).

Hash tables enable fast access to the location of a k-long query sequence and mapping is usually performed through a seeding strategy where the position of k-mers within each read is looked up in the index and exact matches (seeds) are used as a scaffold to reconstruct the full-length alignment (seed extension). This seed-and-extend

(27)

approach (Altschul et al., 1997, 1990) strongly relies on the choice of appropriate k-mers, and hence on the index construction itself, since which k-mers are indexed and their size determines the space and minimum granularity of the search. Ad hoc seeding strategies, which ultimately depend on the index built, need to be implemented in order to accommodate for approximate matching of reads with nucleotide variants or wrong base-calls (Jiang and Wong, 2008; H. Li et al., 2008; R. Li et al., 2008; Lin et al., 2008;

Rumble et al., 2009; Smith et al., 2008; Wu and Nacu, 2010). In suffix tries all suffixes are instead represented and identical substrings collapse on a single path, which enables fast extraction of exact matches but also efficient traversal of alternative paths through search algorithms that allow for inexact matching. Different alignment prerogatives can be hence accommodated by adjusting the search strategy without the need to predefine the data structure, in contrast to hash tables where index construction is an integral part of the seeding strategy (Li and Homer, 2010; Ribeca, 2012). Differences aside, both approaches enable substantially faster alignment compared to repeatedly scanning the reference sequence for an optimal match. The gain in speed is however counterbalanced by their memory footprint, since both hash tables and suffix tries can reach large to prohibitive sizes depending on the length of the reference sequence. Mappers exploit different trade-offs between speed and accuracy in order to optimize running time and memory usage, which for hash tables depend on ad hoc indexing and seeding solutions while for suffix tries rely on a memory-efficient implementation of the data structure itself through compact representations of the trie e.g. suffix trees or arrays (Dobin et al., 2013; Hoffmann et al., 2009; Kurtz et al., 2004; Meek et al., 2003) or data compression algorithms that can be coupled to the index construction e.g. BWT in FM-indexes (Lam et al., 2008; Langmead et al., 2009; Li and Durbin, 2010, 2009; R. Li et al., 2009).

In RNA-seq, splicing introduces an additional layer of complexity to the search of an optimal match, as any read might span an exon-exon junction at the boundary where an intron was excised from the primary transcript. Similar to insertions or deletions, the alignment of junction-spanning reads (split-reads) to a genome reference must allow for gaps of variable size at any point of the query sequence to account for the fact that this might represent a combination of multiple elements separated by introns in genomic space. This makes the alignment of RNA libraries an exceptionally difficult task, as any given sequence might arise from the ligation of two, or more, distant exons and required to be split in shorter pieces. Since these pieces can be as short as one nucleotide and occur in variable number and combinations, additional strategies must be envisaged to map

24

(28)

spliced reads to a genome reference (Horner et al., 2010; Li and Homer, 2010; Ribeca, 2012; Trapnell and Salzberg, 2009). If transcripts are known, this can however be avoided by directly mapping the reads to the set of annotated sequences. This approach allows to capitalize on established methods to align spliced reads in an ungapped fashion, but inference of a mapping location is limited by prior knowledge and annotation status. In order to incorporate novel transcripts, or refine the reference annotation, transcriptome alignment can be coupled to ab initio reconstruction methods or used in combination to genome alignment to reduce the amount of reads that need to be mapped with gaps and/or guide de novo splice site prediction.

Different gapped alignment strategies have been developed to reliably infer splice site locations from RNA sequencing reads (Alamancos et al., 2014; Garber et al., 2011;

Zhang and Stamm, 2012). Exon-first methods rely on prior ungapped alignment of exonic reads to the genome (or transcriptome-first mapping) to find read clusters distributed over separate genomic locations (Dimon et al., 2010; Li et al., 2013; Trapnell et al., 2009; Wang et al., 2010; Zhang et al., 2012). The remaining unmapped reads are hence split into shorter pieces, which are individually mapped against the read clusters. Separate parts are then merged into a full-length spliced alignment via a subsequent extension step. An alternative strategy is the seed-and-extend approach, which exploits hash tables or lookup structures to directly map consecutive k-mers, or spaced k-mers, within each read (Bryant et al., 2010; Dobin et al., 2013; Jean et al., 2010; Wu and Nacu, 2010). Similarly to exon-first methods, candidate matches for shorter pieces are then extended into a larger gapped alignment. Multi-seed methods capitalize on alignment seeding to map separately different parts of each read, and likewise connect subread matches in the final extension step (Liao et al., 2013; Philippe et al., 2013; Wu et al., 2013).Given the inherent difficulty of finding unambiguous matches for short read subsequences, mappers commonly exploit further heuristics to prioritize candidate splice site locations and/or filter out unrealistic intron predictions (reviewed in Alamancos et al., 2014). These range from criteria defined based on prior biological knowledge e.g. match of predicted intron termini to canonical dinucleotides or length restrictions, to alignment reliability criteria e.g. base-call quality or read coverage at a given site. In some cases, more sophisticated models trained on features from annotated splice sites, or available mapping data, are used to prioritize spliced alignments and guide the discovery of novel exon-exon junctions.

Following the mapping, the read density across different locations can be used to estimate gene expression and splicing levels or identify the binding sites of cross-linked

(29)

proteins in immunoprecipitation experiments. Splice site usage can be assessed from the ratio between the number of reads spanning a given exon-exon junction against the total number of reads supporting alternative donor and/or acceptor sites, either pooled or by pairwise comparison (Pervouchine et al., 2013; Shen et al., 2014, 2012). Exon inclusion levels may be similarly estimated from read counts (Anders et al., 2012; Griffith et al., 2010; Wang et al., 2013) or based on pooled abundance of transcripts (Alamancos et al., 2015; Katz et al., 2010; Merkin et al., 2012). Due to the presence of overlapping regions, transcript-level quantification relies on additional strategies to infer relative abundances (Trapnell et al., 2013, 2012, 2010). Reads spanning an element common to multiple isoforms are assigned probabilistically to a single one based on counts from unambiguous mappings (Li and Dewey, 2011; Roberts and Pachter, 2013). Some methods adopt this strategy within an alignment-free framework where expected counts are directly inferred from compatible k-mer matches to known reference transcripts. While limited to well-annotated transcriptomes, this approach allow to quantify isoform levels without the need to first assign a location to each read, and thus provide comparable results at substantially reduced in running times (Bray et al., 2016; Patro et al., 2015, 2014). In general, different approaches come with different strengths and limitations. Their suitability largely depends on the underlying biological question, but the extent to which a given method may provide better insights or more appropriate solutions is unclear and often difficult to establish a priori.

Challenges and perspectives

Whereas methods to study gene expression from RNA-Seq data are relatively well established, best practices in the analysis of alternative splicing remain a largely open-ended issue (Carrara et al., 2015; Conesa et al., 2016; Kakaradov et al., 2012). A vast, and increasing, number of algorithms (Figure 3) adopt diverse strategies that provide various, often disparate, solutions to the many challenges posed by splicing in terms of mapping, quantification and statistical analysis (Alamancos et al., 2014). The lack of an integrated framework and standardized guidelines constitutes a major bottleneck in genome-wide analyses.

Even at the most basic level, represented by read mapping, critical issues emerged from the benchmarking of popular softwares in terms of splice junction false discovery rates, annotation usage and allocation of multiple mapping reads (Engström et al., 2013;

Grant et al., 2011). These limitations, together with those more inherently related to the

26

(30)

limits of the technology and to the complexity of the biological problem, make it hard to define a common framework to study splicing at a global level and streamline the analysis of high-throughput sequencing data.

Figure 3. Overview of available tools for splicing analysis. These comprise mapping algorithms and methods for reconstruction and assembly of splicing events and transcript isoforms, as well as tools for splicing quantification and statistical comparison at different levels (junctions, exons, splicing events or isoforms). Illustration retrieved from figshare Methods to Study Splicing from RNA-Seq (Alamancos et al., 2013.).

On the biological side, the role and impact of alternative splicing still remains elusive and difficult to distinguish from noise and lack of fidelity. Big genomics studies in human and mouse showed that transcription is pervasive across the genome, and production of

(31)

multiple isoforms occurs in most multi-exon genes (Djebali et al., 2012; Yue et al., 2014).

Whether this is a regulated and functional process or the unavoidable consequence of spliceosomal catalysis and neutral change is still unclear. Different interpretations were proposed based on comparative analyses of splicing across tissues and species. A study in different vertebrate lineages found splicing patterns in orthologous genes to be largely species-specific, and thus likely to evolve neutrally, in contrast to expression levels that clustered predominantly by tissue type (Barbosa-Morais et al., 2012). The authors didn't however account for biological variability, which is higher for splicing especially compared to transcription. When biological replicates are included, tissue-specific patterns emerge in both mammals and primates (Merkin et al., 2012; Reyes et al., 2013). Most orthologous exons that undergo alternative splicing in a tissue-dependent fashion are prevalently found to change in brain, muscle and heart (Merkin et al., 2012). Though relatively uncommon, differences in splicing rates between tissues would appear to be conserved, at least based on inter-species similarity.

Similar patterns in human were observed in large cohorts with sequencing data from multiple tissues, where tissue-specific changes across individuals were prevalently found in neural tissues, skeletal muscle and heart (Ardlie et al., 2015; Melé et al., 2015). In heart and brain distinct RNA-binding proteins would appear to regulate the inclusion of a highly conserved set of smaller exons (Irimia et al., 2014; Li et al., 2015; Raj and Blencowe, 2015) and, in contrast to other tissues, alternative splicing was found to be more frequently coupled to translation of specific protein isoforms (Abascal et al., 2015; Ezkurdia et al., 2015). Conserved cassette exons have been linked to post-transcriptional regulation of protein synthesis through nonsense mediated decay (Braunschweig et al., 2014; Lareau and Brenner, 2015; Q. Yan et al., 2015), and alternative splicing further suggested to play a role in expanding the interaction capabilities and phosphorylation potential of encoded proteins (Merkin et al., 2012, 2015; Yang et al., 2016). Numerous examples show individual isoforms to be implicated in a variety of cellular and developmental processes (Braunschweig et al., 2013; Kalsotra and Cooper, 2011)or are linked to disease (Cooper et al., 2009; Tazi et al., 2009; Wang and Cooper, 2007; Xiong et al., 2015). Functional studies are however rare and, despite the technological advancements, genome-wide analyses still suffer from limitations that range from methodological to interpretation issues. This is particularly true in the case of splicing, and the lack of consolidated pipelines and common frameworks make it hard to reach conclusive insights into the biology of this remarkably complex mechanism.

28

(32)

Objectives

(33)

30

(34)

The broad aim of this doctoral project is to establish an integrated methodological and theoretical framework to study splicing on a genome-wide scale, by evaluating current solutions to process high-throughput sequencing data, developing ad hoc strategies to streamline the analysis and applying these methods to available transcriptomics data in order to provide new insights into the impact and functional role of alternative splicing.

Specific objectives:

1. To compare selected approaches and benchmark their effectiveness in terms of mapping, detection and quantification capability for known and non-annotated splice sites

2. To assess their suitability and behavior in different conditions and experimental settings, using synthetic data for multiple libraries, base-call error models and annotations

3. To systematically evaluate the performance in all simulation set-ups and compare results from simulations to experimental data from publicly available sources

4. To establish an integrated solution and develop an ad hoc bioinformatics pipeline to improve on existing methods and streamline the analysis from RNA sequencing data

5. To apply implemented tools to comparative transcriptomics data and investigate tissue-dependent usage of alternative acceptors in mouse and primates

6. To identify and characterize potentially conserved tissue-specific signatures and provide additional insights into the biological impact of alternative splicing

(35)

32

(36)

Materials

and methods

(37)

34

(38)

RNA-Seq simulation

A total of 10 random data sets for 12 different experimental set-ups were generated using the Flux Simulator pipeline version 1.2 (Griebel et al., 2012), based on the GRCh37.p8 assembly of the human genome and Ensembl genebuild (release 69) annotation (Flicek et al., 2013). Each combination of the following parameters was employed to generate a data set: 50 and 76 bp read length, 8M, 20M and 40M reads sequencing depth, single-end and paired-end library. Following the procedure in the documentation, a custom error model at 50 bp read length was produced using in-house RNA-Seq data. The sequencing run was deposited in the NCBI Sequence Read Archive, with accession number SRR1105576. Default parameters were used for all other options.

For each simulated data set, 10% of the exons at each expression decile were randomly removed from the original annotation in order to evaluate the de novo splice site detection capability and the impact of novel and misannotated features.

Alignment benchmarking

The following alignment algorithms have been tested: TopHat2 version 2.0.6 (Kim et al., 2013), GSNAP version 2012-12-20 (Wu and Nacu, 2010; Wu and Watanabe, 2005), STAR version 2.2.0 (Dobin et al., 2013), OLego version 1.08 (Wu et al., 2013) and SOAPsplice version 1.9 (Huang et al., 2011). All aligners were run with default parameters.

For paired-end data, whereby required, the insert size was empirically determined from uniquely mapping, perfect matching pairs via a preliminary alignment with Bowtie version 0.12.9 (Langmead et al., 2009) and supplied to the algorithm. Except for SOAPsplice, which is an ab initio alignment method, input annotations were constructed to comply the required format of each aligner. Site-level input files were produced for GSNAP, as suggested in the documentation.

For both known and novel exon junctions, the mapping performance was evaluated in terms of percentage of uniquely mapped reads and positive predictive value over unique alignments at base pair resolution. Nucleotides mapped to the wrong genomic location were regarded as false positives, correctly aligned nucleotides as true positives. The junction detection performance was assessed in terms of sensitivity and positive predictive value based on unique gapped alignments reported with an N operation in the CIGAR string (H. Li et al., 2009). Expressed junctions in the simulated data spanned by at least one read in the alignment were considered true positives. Expressed junctions with no

(39)

unique hits were regarded as false negatives and junctions spanned by at least one read in the alignment but not present in the simulated data as false positives. Over all detectable junctions, quantification accuracy was assessed in terms of absolute difference between true counts and alignment counts (number of uniquely mapped reads spanning a junction). For true positive hits, the absolute difference was as well computed relative to the true expression value. Plots were generated in R using ggplot2 (version 0.9.3.1).

The FineSplice pipeline

Based on benchmarking results, an integrated pipeline was implemented to achieve superior mapping and quantification accuracy while optimizing the reliability of splice site detection from junction-spanning alignments. The proposed strategy couples TopHat2 to a novel wrapper, named FineSplice, that enables to identify unreliable gapped matches and filter out false positive junctions via semi-supervised logistic regression. FineSplice is written in Python and depends on pysam (version 0.7.4), scipy (version 0.7.2) and numpy (version 1.7.1) modules for BAM file parsing and scientific computing. Logistic regression was implemented with scikit-learn (version 0.13.1). The latest version of FineSplice is available at https://sourceforge.net/projects/finesplice/.

36

(40)

Figure 4. Overview of FineSplice pipeline. Following alignment with TopHat2, multiple mapping junction-spanning reads are temporarily filtered out (1). The set of split-read overhangs across the junction are then computed (2). A subset of potential false positives is defined (3) based on the probability of observing, at the given read count, at least one overhang greater than the first mismatching position, if none is found. Feature vectors are constructed based on the deviation at each position between observed and expected read counts under uniformity assumptions (4), after trimming mismatching overhangs at the first mismatch position. A logistic regression model is fitted on the subset of potential false-positive junctions against the remaining total (5). After discarding junctions with a higher posterior probability of belonging to the false-positive set (6), multiple mapping reads with a unique location after filtering are reassigned to the junction that passed the filtering.

FineSplice takes as input a valid TopHat2 alignment in BAM format, which must be specified with the mandatory command-line argument -i <path/to/file> and the read length, specified with the -l <length> option. An example file with 50 bp long reads is available in the repository website and can be run with the command:

python FineSplice.py -i example.bam -l 50

The proposed strategy is best used in a context were transcript annotations are, at least partially, available. The procedure is illustrated in Figure 4 and hereby described in further detail.

1. Align with TopHat2

Transcriptome-guided alignment is first performed with TopHat2, using available annotations for known transcripts and allowing for de novo discovery of non-annotated splice sites. Default parameters should be used otherwise, but the effectiveness of the filtering strategy does not depend critically on the specified alignment options.

2. Compute the set of split-read overhangs across each junction

For each uniquely mapping read j spanning a given junction i its overhang O_jⁱ is defined as the shortest overlapping segment of the read across the junction i.e.

O_jⁱ=min

(

^Lji, R_jⁱ

)

where L_jⁱ and R_jⁱ represent the length of the left and right arm of the read across the splice site. Each junction is hence represented by a set of n split-reads overhangs

Oⁱ=

(

^Oj

i

)

j=1,. . . ,n

(41)

Under the assumption of random cDNA fragmentation, all overhangs are taken to be equally likely, and O_jⁱ hence assumed to follow a discrete uniform distribution i.e.

Oⁱ_j ∼ U(1,⌊read length /2⌋)

3. Define a subset of potential false positives

For each junction, if (i) a single mismatch is present and (ii) none out of n reads is found with an overhang greater than the first mismatching position, the probability

P

(

at least one O_jⁱ w/ mismatches ⩾ max

(

Oⁱ

) )

⁼^{1 −}

(

⌊read length/2^max

⁽

^Oⁱ

⁾

⌋

)

ⁿ

is considered. If greater than 0.99, the junction i is deemed as a potential false positive and labelled Y_i=1 . Splice junctions with no matching overhang are labelled as well as potential false positives. The remaining total of detectable junctions is assumed to mostly comprise valid spliced alignments and assigned the class label Y_i=0 .

4. Construct feature vectors

For all possible overhang values k ∈

[

1, ⌊read length /2⌋

]

, let N_kⁱ be the number of reads with an overhang longer than k after trimming overhangs with mismatches at the first dissimilar position. For each junction i a feature vector

X_i=

(

^xki

)

k= 1,. . . ,⌊read length /2⌋

is constructed based on the log2-transformed deviation of observed counts from expected at each position relative to the splice site i.e.

x_kⁱ =log₂

(

^N^Eki^kⁱ

)

where E_kⁱ = number of reads × P

(

^Oji ⩾ k

)

^and

P

(

^Oji ⩾ k

)

⁼^{1 −}

(

^⌊read length/2^{k −1} ⌋

)

5. Fit a logistic regression model

Following step 3 and 4, each junction is represented by a class label Y_i and a feature vector X_i . A L1-regularized (Ng, 2006) logistic regression model is fitted over the whole set of junctions.

38

(42)

6. Discard spurious alignments based on posterior probability

For each junction i the posterior probability of belonging to the false positive class is computed: if P

(

^Yi=1∣ X_i

)

^>0 .5 the junction is deemed as a false positive and discarded.

7. Reallocate multiple mapping reads

Reads mapped to multiple splice sites for which a unique hit is recovered after filtering are assigned to the remaining junction.

FineSplice improvement over TopHat2 was assessed in synthetic data, under all simulation settings, and further allowing for supplementary alignment options. Additional TopHat2 alignments were performed enabling the remapping of ambiguously mapping multi-exon reads, with a cut-off of either 1 or 2 mismatches in the segment alignment step.

FineSplice performance was also compared to TrueSight version 0.06 (Li et al., 2013), a more recently published method that similarly employs logistic regression to enhance junction mapping. TrueSight was run with default parameters. Splice junctions, together with the associated score (posterior probability), were retrieved from the corresponding output. Precision and sensitivity were computed as described above, both in the default setting and at increasing thresholds for the posterior probability estimated by FineSplice and TrueSight.

Comparison with experimental data

The splice junction detection performance was evaluated on real data from publicly available RNA-Seq experiments in human (two data sets, high-quality or low-quality reads) and pig (as an example of more partial transcriptome annotation). The high-quality human data set comprises three high-depth, paired-end sequencing runs at 76 bp read length (SRA Experiment SRX084679). The low-quality human data set comprises two low-depth, paired-end sequencing runs at 45 bp read length, exhibiting a high per base error rate (SRA Experiment SRX011546). The pig data set comprises three sequencing runs at 51 bp read length, single-end (SRA Experiments SRX242929, SRX242930 and SRX242931).

Raw data were downloaded from the NCBI Short Read Archive (SRA) and converted to FASTQ with the SRA Toolkit. The alignment was carried out with the 5 benchmarked algorithms plus TrueSight, as described above, but using the full transcript annotation (whereby possible). In the absence of ground truth, the detection performance was

(43)

evaluated by computing pseudo sensitivity and pseudo precision metrics (Dobin et al., 2013; Zhang et al., 2012), along with the corresponding F1 score, and by evaluating the mean read counts and the consensus across all alignments for all the junctions detected by each algorithm. Splice junctions were designated as pseudo true (i.e. effectively expressed) based on read counts and consensus among all methods, by deeming as effectively expressed those with a median read count across all methods greater than 0.

The distribution of read counts at each overhang position over all alignments was further assessed for all junctions accepted and discarded by FineSplice.

Comparative transcriptomics data

TopHat2 and FineSplice were used to analyze multi-tissue transcriptome data from three primate species (Macaca mulatta, Pan troglodytes and Homo sapiens) and mouse (Mus musculus, C57/BL6 strain). poly(A)+ RNA libraries were generated and single-end sequenced by Bozek et al., 2014 in the context of a comparative study of metabolome evolution. The authors collected 6 samples from three brain regions (cerebellar, prefrontal and primary visual cortex), skeletal muscle and renal cortex in all species. To the purpose of a balanced comparison among the three tissues, only prefrontal cortex samples were included in the analysis.

A total of 144 raw files in FASTQ format (two sequencing runs per sample) were retrieved from the European Nucleotide Archive (ENA). These are publicly available under accession number PRJNA213747. Technical replicates were processed separately: for each sequencing run, reliable splice sites were recovered with FineSplice version 0.2.2 following transcriptome-first alignment with TopHat2 version 0.2.14. Both were run with default options as previously described, using reference genomes and transcript annotations from Ensembl release 79 (Cunningham et al., 2015).

Tissue-specific splicing analysis

Splice sites identified from experimental data with FIneSplice were matched across orthologous genes. Pairwise homology relationships were inferred from Ensembl Compara annotations (release 79), and a total of 13358 protein-coding genes were found to be one-to-one orthologs among all species. Sites recovered in any sample from a given species were assigned to an ortholog whenever matching a donor or acceptor site of the canonical transcript. An optimal set of non-overlapping sites was hence determined for

40

(44)

each species based on pooled read counts over the respective samples. This was carried out using a dynamic programming implementation of the Weighted Interval Scheduling algorithm that maximizes pooled counts for all sites, either known or novel, assigned to a given ortholog (Kleinberg and Tardos, 2006).

Donor sites were matched among the optimal set of each species, based on the similarity of the 60 bp long upstream sequence. For all sequences in one species, the most similar one in all other species was determined based on Levenshtein distance.

Whenever the set of four sequences with the minimum pairwise distance was the same in all species, a match for the donor was established. After matching donors from the optimal sets, all acceptor sites that were recovered with FineSplice were similarly matched based on the similarity of the 60 bp downstream sequence. For each ortholog, a donor-acceptor pair was selected from the optimal set of non-overlapping introns in mouse. This was chosen among the ones where both sites could be successfully matched and showing the highest pooled number of alternative reads. The most common acceptor in mouse (i.e.

with the highest pooled counts) was taken to represent the inclusion form, and the corresponding read counts were modelled as a binomial variable of the total number of reads in each sequencing run.

Inclusion rates were regressed on species and tissues as categorical predictor variables, the latter nested within each species group. A baseline rate was estimated for each species, and differences in inclusion rates were modelled as a function of between-tissue variability in each species. A Bayesian framework was adopted and species- and tissue-dependent estimates were inferred from the posterior distribution after fitting a hierarchical binomial-logit model with non-centered parametrization (Betancourt and Girolami, 2013; Roberts et al., 2003) of the species- and tissue-level predictors. The model was implemented in STAN (Carpenter et al., 2016).

Model compilation, fitting and posterior sampling were carried out using the pystan version 2.4.0.1 interface. Posterior sampling was performed on 4 Markov chains with 2000 iterations and 1000 warmup iterations per chain. Posterior samples (after warmup iterations) were drawn from all chains and permuted. These were used to compute mean estimates for the species-specific baseline rate Ψ and the tissue-specific difference in each species compared to baseline ΔΨ. The 95% interval of all tissue-specific differences was computed from the posterior, and differences deemed to be credible if the interval did not contain zero.

(45)

Characterization of splicing signatures

Expression estimates in all species were computed from normalized gene counts.

Alignments files were processed separately with htseq-count from HTSeq library version 0.6.1p1 (Anders et al., 2015). The script was run with default options, using the annotation in GFF format from TopHat2 transcriptome index. Gene counts for each sample were pooled across technical replicates and TMM-normalized with edgeR version 3.8.6 (Robinson et al., 2010; Robinson and Oshlack, 2010). Tissue-level means in all species were computed from log10-transformed TMM-normalized counts. These where used to determine the mean over all tissues and the minimum among the three tissue estimates.

The tissue-specificity index τ (Kryuchkova-Mostacci and Robinson-Rechavi, 2016) was further computed from mean expression in the three tissues. As positive values are required, a pseudocount of 1 was added prior to log10-transformation (for those genes where the mean in all tissues was greater than zero).

phastCons 7-way tracks (Siepel et al., 2005) were retrieved from UCSC for the GRCh38/hg38 human genome assembly (Kent et al., 2002). For all human orthologs with a matched alternative site, conservation was assessed at the donor and acceptor region and at the transcription start site. Potential CpG islands were further predicted based on the observed-to-expected frequency CpG dimers over consecutive windows of 200 bp as described in Gardiner-Garden and Frommer, 1987.

Per-tissue protein abundances in human and mouse were collected from PaxDb version 4.0 (Wang et al., 2015) for brain and kidney, and from heart as a proxy for skeletal muscle. Integrated tissue-specific abundances from weighted averages across different proteomics data sets were used. These are expressed as parts per million estimates (ppm). The mean over all tissues was computed, as in the case of expression, from log10-transformed abundances and tissue-specificity index τ similarly determined by adding a pseudocount of 1 prior to log10-transformation for all those proteins with an abundance greater than zero in all tissues.

42

(46)

Results

(47)

44