NOTA BIBLIOGRÁFICA.
Y CAPITÁN GENERAL DE LAS PROVINCIAS DE VENEZUELA
A recurring theme throughout my thesis has been that dropouts can be a substantial confounder when attempting to detect isoforms in individual cells, but it is unknown how best to correct for this confounder. There are three very different approaches to attempt to solve this problem. The first is to build models of technical and biological dropouts that would enable dropouts to be resolved in individual cells in a high confidence manner. This is essentially isoform-level imputation. Multiple tools have been developed that attempt to impute scRNA-seq data quantified at the gene level (Wagner et al., 2017; Li and Li, 2018; Huang et al., 2018; Gong et al., 2018; van Dijk et al., 2018; Eraslan et al., 2019), however their poor performance in a recent benchmark illustrates that this is a highly challenging problem that we are some way from solving (Andrews and Hemberg, 2018b). In general, bioinformatics dropout based methods such as imputation and Andrews and Hemberg’s Michaelis-Menten model have focused on correcting for dropouts for applications such as clustering and feature selection. Different approaches might be required when attempting to resolve dropouts in individual cells - for example, factors such as the cell’s library size and physical size might need to be accounted for. Attempting to develop such approaches could enable more accurate splicing analyses to be performed in future. However, this is an extremely challenging problem to solve, as illustrated by the poor performance of existing imputation tools (Andrews and Hemberg, 2018a).
The second approach that could be taken to attempt to solve confounding effects caused by dropouts would be to increase the capture efficiency of scRNA-seq. It is hy- pothesised that dropouts occur due inefficiencies in the enzymatic process of reverse transcription (Kharchenko et al., 2014). If this hypothesis is correct, improvements to the efficiencies of the enzymatic reactions that occur during library preparation could reduce the frequency of dropouts in scRNA-seq. Whilst this thesis was be- ing written, a new library preparation protocol called SMART-seq3 was released (Hagemann-Jensen et al., 2019). One of the stated improvements in SMART-seq3 relative to SMART-seq2 was that the efficiency of several enzymatic reactions in library preparation had been improved, which could theoretically improve the cap-
ture efficiency of SMART-seq3. Indeed, in their preprint, Hagemann-Jensen et al. showed that SMART-seq3 detected more genes per cell on average than SMART- seq2, which would be consistent with an increased capture efficiency. Hagemann- Jensen et al. also claimed that SMART-seq3 on average detected an estimated 69% of the molecules detected from four moderately expressed genes using smFISH. However, only four genes were reported and it is unclear how these estimates were generated, so these claims should perhaps be taken with a pinch of salt. An in- dependent benchmark of library preparation protocols, including SMART-seq3, is required to determine whether the capture efficiency of SMART-seq3 is genuinely el- evated relative to other library preparation methods. If SMART-seq3 truly does have a higher capture efficiency, it could play an important role in enabling the detection of isoforms in individual cells.
The final approach to solving confounding factors caused by dropouts in scRNA- seq data is to use a different technology to detect isoforms in individual cells. sm- FISH is the most obvious candidate. Whilst smFISH has traditionally been a low throughput technology which struggles to resolve between isoforms, recent and future improvements in throughput (Eng et al., 2019; Moffitt et al., 2016) and techniques to resolve between similar molecules (Levesque et al., 2013) could make a high through- put study of how many isoforms are expressed per gene per cell increasingly feasible. An smFISH dataset resolving the number of isoforms detected per gene per cell for a hundred or so genes would be hugely valuable to the scRNA-seq community. This dataset could be used as a ground truth dataset to benchmark scRNA-seq methods for inferring isoform number. Additionally, such an smFISH dataset could be used to train and test machine learning approaches, which are currently impossible due to a lack of training data.
An important point to recognise is that my simulation based approach only fo- cussed on isoform detection. Establishing the relative magnitude of expression of isoforms is likely to be of interest to many researchers, however simply detecting isoforms accurately is currently problematic. Therefore, accurately inferring the rel- ative magnitude of expression of isoforms in individual cells is not yet feasible in my view. Furthermore, the two library preparation protocols which performed well in my
benchmarking study (SMARTer and SMART-seq2) do not add UMIs to transcripts and so suffer from PCR amplification bias. This is likely to substantially confound attempts to infer magnitude of expression. SMART-seq3 does add UMIs to some reads, however Hagemann-Jensen et al. demonstrate that the UMI containing reads have substantial bias towards the 5’ end of the transcript (Hagemann-Jensen et al., 2019). This 5’ bias is likely to make the detection and quantification of isoforms that differ at their 3’ end challenging.
scRNA-seq is a very dynamic field, and many researchers are actively working on new technological developments. In the next section, I consider whether up and com- ing developments in scRNA-seq technologies could improve the feasibility of studying splicing in the future.