List of Tables

(1)

A nonlinear systemic approach to genome analysis

Doctoral Thesis

Directed by Dr. Hilario Ramírez Rodrigo This work has been realized within the doctoral program Biochemistry and Molecular Biology (113.89.1)

University of Granada, November 2015

(2)

Editor: Universidad de Granada. Tesis Doctorales Autor: Jens Warfsmann

ISBN: 978-84-9125-816-2

URI: http://hdl.handle.net/10481/43537

(3)

(4)

Compromiso de respeto de los derechos de autor

El doctorando Jens Warfsmann y el director de la tesis Hilario Ramírez Rodrigo , garantizamos, al firmar esta tesis doctoral, que el trabajo ha sido realizado por el doctorando bajo la dirección del director de la tesis y hasta donde nuestro conocimiento alcanza, en la realización del trabajo, se han respetado los derechos de otros autores a ser citados, cuando se han utilizado sus resultados o publicaciones.

Granada 15.11.2015

Director de la Tesis Doctorando

Fdo.: Fdo.:

Hilario Ramírez Rodrigo Jens Warfsmann

(5)

(6)

iii

"[. . . ] los sistemas biológicos son producto de la evolución. Si las matemáticas son el arte de lo perfecto y la física es el arte de lo óptimo, la biología no es más que el arte de lo satisfactorio:

cualquier cosa sirve, siempre que funcione. [. . . ]"

— Sidney Brenner, El Pais, 12-01-1999

(7)

(8)

List of Figures

2.1 Research fields associated with complexity science. . . 16 2.2 Robust reactions of the system: to stay or to change. . . 21 2.3 Global depiction of epigenomic alterations during oncogenesis. . 24 2.4 Nucleotide composition continuum of completely sequenced mi-

tochondrial and plastid DNA sequences. . . 29 2.5 Support vector maschines: Finding hyperplanes. . . 31 2.6 Support vector maschines: Search for the maximum margin hy-

perplane. . . 32 2.7 Support vector maschines: Finding the “convex hull”. . . 33 2.8 Support vector maschines: Mapping into higher dimension spaces. 34 5.1 Work flow of a nonlinear systemic approach for genome analysis. 50 5.2 Influence of the threshold ε on the RQA measures shown for the

standard Rössler system. . . 52 5.3 Standard Rössler attractor for a = 0.2; b = 0.2; c = 5.7 . . . 53 5.4 Time series build from the y-coordinates of a standard Rössler

system and its reconstruction. . . 54 5.5 Distance contour plot and recurrence plot (ε = 0.22) for the

reconstruction shown in 5.4. . . 55 5.6 Visualization of the directed acyclic graph describing the first

entry from a magetab document. . . 59 5.7 Structure of an analysis working directory. . . 61

ix

(13)

5.8 DNA methylation density time series from a normal head and neck sample.. . . 63 5.9 Graphical representation of reconstruction parameters. . . 64 5.10 Multi scatter plot of a reconstructed phase space. . . 65 5.11 Projection of a reconstructed phase space to the first three di-

mensions. . . 66 5.12 a) Distance contour plot and b) recurrence plot (ε = 0.126) for

the reconstruction shown in 5.10. . . 68 5.13 Recurrence plots of phase spaces representing a) colon and rectal

adenocarcinoma, b) normal colon tissue and c) a difference plot (AJRP) of both. . . 69 5.14 Cluster analysis of RQA measures and lacunarity. . . 70 5.14 Cluster analysis of RQA measures and lacunarity. Continued . . . 71 5.14 Cluster analysis of RQA measures and lacunarity. Continued . . . 72 5.15 Pairwise comparison of RQA measures and lacunarity. . . 73 5.16 ROCs and AUCs used to evaluate the binary classifications of

complete data sets. . . 76 5.17 Performance of a marker-based and a marker-less head and neck

cancer versus normal tissue SVM binary classification. . . 78 5.18 ROC curves and AUCs for head and neck tumor versus nor-

mal tissue SVM binary classification based on DNA-methylation data from TCGA.. . . 79 5.19 Boxplots for β-value sums . . . . 80 5.20 Boxplots of β-value sums including CpG sites located only in

BLOCKS. . . 81 5.21 Boxplots of β-value sums including CpG sites located only in

regions belonging to cancer related gene symbols, BLOCKS and cDMR. . . 82 5.22 Boxplots of β-value sums including CpG sites located only in

regions belonging to cancer related gene symbols and cDMR. . . 83 5.23 Boxplots of β-value sums including CpG sites located only in

cDMR. . . 84 5.24 The ITAG annotation pipeline¹ and the activity diagram of the

PCB gene ontology annotation protocol. . . 86

(14)

List of Figures xi 5.25 A casual drawing of mated distance patterns of the tomato genome. 93 5.26 Casual drawing of plausible and not plausible Chimpanzee mi-

grations . . . 97

(15)

(16)

List of Tables

5.1 The head of a TCGA 3rd level HumanMethylation450 data file. 58 5.2 Description of the bar code meta data. . . 62 5.3 Statistical summary of all delays calculated for the marker-less

cancer classification. . . 66 5.4 Statistical summary of all embedding dimensions calculated for

the marker-less cancer classification. . . 66 5.5 Sample number distribution for the different data sets used in

the tumor versus normal tissue classification on systemic features and complete CpG sets. . . 74 5.6 Areas under the curve (AUC) in % for marker-less classifications. 74 5.7 Areas under the curve (AUC) in % for marker-less classifications.

Continued . . . 74 5.8 Wilcoxon rank sum test with continuity correction on β value

sums for complete and reduced DNA methylation data sets of different cancer types. . . 81 5.9 Areas under the curve (AUCs) in % for the β-value sum classi-

fications. . . 84 5.10 Functional distances as a function of inter gene distances of the

+ strand for the tomato chromosomes 1 – 12. . . 87 5.10 Continued . . . 88 5.10 Continued . . . 89

xiii

(17)

5.11 Gene base position distance time series, their reconstructed phase space distance plots and the corresponding recurrences plots. . . 90 5.11 Continued . . . 91 5.11 Continued . . . 92 5.12 Sample number distribution for Mt genomes of the genus Pan. . 96 5.13 Graphs from the recurrence analysis of time series based on the

CG content of Pan Mt-genome sequences. . . 98 5.14 ROCs and AUCs for systemic classifications on CG content of

Mt-genome sequences from the genus Pan. . . 100 5.15 Sample number distribution for Mt genomes of the suborder

Caniformia. . . 101 5.16 Graphs from the recurrence analysis of time series based on the

CG content of Caniformia Mt-genome sequences. . . 102

(18)