Párrafo II. En el plazo de diez (10) días hábiles a contar desde el día siguiente a aquél en que haya sido notificada la resolución por la que
APORTES A LAS ASOCIACIONES SIN FINES DE LUCRO
Inference of tumor subclonal composition and evolution from single or multiple bulk sam- ples has attracted a lot of attention in the computational community. Since 2012 tens of computational methods for solving these problems were introduced. Given the large number of methods and a wide spectrum of methodologies used, it is beyond the scope of this thesis to discuss each of the methods separately and in detail. Instead, we discuss only a selected subset of methods that, in our opinion, represent some of the most important advancements in the field. In addition, we also list some of the other existing methods.
TrAp[159] was one of the first proposed methods that infers tumor phylogenies without clustering mutations into subclones. Due to exhaustive search employed by this method, its applicability is limited to small sets (usually up to 30) of mutations. rec-BTP[60] is an approximation algorithm that infers a binary tree with nodes corresponding to subclones. Each of these two methods works only with single bulk sample data and, unlike other methods, TrAp and rec-BTP both use strong parsimony assumption.
AncesTree[38], CITUP[97], LICHeE [128] and Cloe [98] are methods designed for mul- tiple bulk samples data. CITUP and AncesTree are combinatorial optimization algo- rithms employing Quadratic (CITUP) and Mixed (AncesTree) Integer Linear Programming, LICHeE is heuristic algorithm and Cloe uses Metropolis-coupled MCMC search [81]. Each of these methods is restricted to the use of mutations from diploid regions. This limits their applicability in cases of largely aneuploid tumors, unless cellular prevalence values of individual mutations are provided. If provided with these values, CITUP can use them to infer clonal trees. We have seen this as a practice in several biological studies of cancer where PyClone, which adjusts cellular prevalence values in cases where mutation falls into region affected by CNA, was run first and cellular prevalences reported by PyClone used as input to CITUP to infer tumor phylogenies [1, 75, 79, 15, 167]. In Chapter 3 we intro-
duce CTPsingle, a method similar to the methods discussed above. The main advantage of CTPsingle over the above methods is in its performance on sequencing data of lower coverage. On the other hand, CTPsingle uses only mutations from diploid regions of the genome and is currently constrained to single bulk sample input.
PhyloWGS [33] extends the above methods by modelling CNA events overlapping with SNVs. The list of CNAs is expected to be provided as part of the input and they are used to adjust cellular prevalence values for mutations affected by some of these events. PhyloWGS makes several simplifying assumptions about relative timing of CNA and SNV events. By allowing deletions it only partially uses ISA. PhyloWGS is an extension of PhyloSub [70], it supports multiple bulk samples and is based on tree-structured stick breaking and MCMC in its inference procedure. SPRUCE, a method based on the use of exhaustive enumeration [81], extends previous methods by allowing violations of ISA. More precisely, SPRUCE uses the infinite alleles assumption (where a character may change state multiple times on the tree but can not reach the same state multiple times [39]).
Methods SubcloneSeeker [130], SCHISM [120], Canopy [69], PASTRI [145], MIPUP [66], BAMSE [164], CALDER [114] can also be classified as methods from this group.
2.2.7 Methods based on the use of CNAs and other types of mutations
Studying ITH and evolution by the use of CNA events as main identifiers of tumor evolution- ary history and subclonal composition is very challenging task. Genomic regions affected by CNAs are usually identified by comparing depth of coverage across regions (e.g., regions with copy number gains are expected to have increased coverage). Sometimes, the ratio of coverage at heterozygous SNP sites are considered in order to discern between gains/losses of the copies of two homologous chromosome regions. However, identification of CNAs is an arduous task, especially for the CNA events present in a smaller fraction of cells as in these cases the change in coverage for the affected genomic region is minor.
Another complication arises from the fact that many CNA events affect large genomic regions implying that the overlaps of different CNAs can not be excluded as unlikely events. Incorporating such events in the tree of tumor evolution is very difficult, usually with multiple equally likely and biologically plausible descriptions of the input signals. These complexities are most likely one of the main reasons that methods for inferring trees of tumor evolution from CNA data are still underdeveloped (one method was recently introduced in [185]).
ASCAT[168] and ABSOLUTE [14] are methods primarily designed for SNP array data, although both can be applied to NGS data (with some deterioration in the performance). When used for NGS data, ASCAT and ABSOLUTE are mostly suitable for detecting clonal CNAs (i.e., CNAs present in all cancerous cells). Recent research suggests that in some cases most of the CNAs occur during the early stages of tumor evolution [172, 88]. In these cases ASCAT and ABSOLUTE can give potentially interesting findings and help in
identifying clonal CNA events. Although it was developed almost a decade ago, we still see the use of ASCAT in practice (e.g., recent study [166] used ASCAT for generating copy number calls from bulk NGS data).
THetA [123] and THetA2 [124] are methods specifically designed for bulk NGS data. While ASCAT and ABSOLUTE only infer tumor purity and copy number of clonal CNAs, THetA and THetA2 also identify tumor subclonal composition. THetA relies solely on read depth and does not distinguish between different homologous chromosomes. THetA2 extends THetA by improving computational efficiency and allowing for the inclusion of B- allele frequencies (computed at the heterozygous SNP sites) in order to prioritize among solutions that are equally likely when only read depth information is used. APOLLOH [59] and Control-FREEC [6] work on WGS data without accounting for multiple subclones, whereas OncoSNP [182] is primarily designed for SNP array data and it also assumes that tumor sample contains only one population of cancerous cells. OncoSNP-seq [181], an extension of OncoSNP to WGS data, allows for multiple different cancer cell genotypes but does not model clustering of cells with similar genotypes (i.e., cells from the same subclone). TITAN [58] is a probabilistic approach working on WGS data and taking into account B-allele frequencies. In principle, it solves the same set of problems as THetA2.
Potential directions for the future work include development of more advanced algo- rithms for inferring both: tree of tumor evolution and subclonal proportions. Also, the above methods, with the exception of method for solving the CNTMD problem introduced in [185], are limited to the analysis of data from a single sample. Using the shared topology and data from multiple bulk samples can potentially lead to more accurate inference of history of tumor evolution, similar to methods working with SNV data.
Developments in design of methods based on the other types of structural variants are still in inception with two recently introduced methods, SVclone [21] and TUSV [37]. SVclone infers cancer cell fraction of a variant breakpoints from whole-genome sequencing data and is still in the preprint phase. TUSV is a more recent method which reconstructs tumor phylogenies and subclonal composition from whole-genome sequencing data. As input, TUSV uses variant calls obtained by Weaver [89] and the algorithm used in this method is mostly based on the use of Integer Linear Programming. Nevertheless, there is still a lot of room for the improvement of these methods [37] and we hope that SVclone and TUSV will inspire developments of more sophisticated methods in the future.