• No se han encontrado resultados

1. EL PALIMPSESTO TRANSMEDIÁTICO

1.2 LA TRANSMEDIA

1.2.2 LAS NUEVAS ESCRITURAS

TFBSs play critical roles in regulating transcription rates and expression levels of their target genes. The knowledge of genome-scale TFBSs can greatly help the

elucidation of gene regulatory mechanisms in a cell. Hence, de-novo motif identification and associated computational analyses (e.g., motif scanning and comparison) play an important role in regulatory network construction in all organisms.

Although substantial algorithms and tools have been developed in the past few decades, phylogenetic footprinting-based algorithms have not been fully considered. Additionally, no such work has considered the relationship and visualization among identified motifs (or corresponding genes). Existing DL models use a subjective threshold in motif identification, which incurs severe false positive and false negative issues.

Furthermore, DNA shape has not been integrated with DL in shape motif identification. To overcome these limitations, we have made two main contributions as follows: We have developed an integrated web server, DMINDA 2.0, which contains: (i) five motif prediction and analysis algorithms, including a phylogenetic footprinting

framework; (ii) 2,125 species with complete genomes to support the above five functions, covering animals, plants, and bacteria; and (iii) bacterial regulon prediction and

visualization. Compared to other existing web servers, DMINDA 2.0 provides comprehensive motif analysis functions. DMINDA 2.0 is freely available at http://bmbl.sdstate.edu/DMINDA2.

We have proposed a DL framework, DESSO, which is used to identify both sequence and shape motifs from ChIP-seq data. To optimize the threshold in motif identification, we introduced a binomial distribution to select the best threshold based on

a P-value strategy. Results show that DNA shape also has strong predictive power in TF- DNA binding specificity prediction. In addition, shape motifs are prevalent and can help interpret why TFs with the same sequence motif bind to distinct genome regions.

Compared to CNN, the GCNN model proposed in this study can improve TF-DNA binding specificity prediction on long DNA sequences. This performance mainly benefits from GCNN’s capability in capturing motif dependencies.

REFERENCES

1. D'Haeseleer, P., What are DNA sequence motifs? Nature Biotechnology, 2006.

24(4): p. 423-5.

2. Brohee, S., et al., Unraveling networks of co-regulated genes on the sole basis of

genome sequences. Nucleic Acids Res, 2011. 39(15): p. 6340-58.

3. Davidson, E. and M. Levin, Gene regulatory networks. Proc Natl Acad Sci U S A, 2005. 102(14): p. 4935.

4. Liu, B., et al., Bacterial regulon modeling and prediction based on systematic cis

regulatory motif analyses. Scientific reports, 2016. 6.

5. Baumbach, J., On the power and limits of evolutionary conservation--unraveling

bacterial gene regulatory networks. Nucleic Acids Res, 2010. 38(22): p. 7877-84.

6. Liu, B., et al., An algorithmic perspective of de novo cis-regulatory motif finding

based on ChIP-seq data. Briefings in Bioinformatics, 2017: p. bbx026.

7. Stormo, G.D., DNA binding sites: representation and discovery. Bioinformatics, 2000. 16(1): p. 16-23.

8. Zambelli, F., G. Pesole, and G. Pavesi, Motif discovery and transcription factor

binding sites before and after the next-generation sequencing era. Brief

Bioinform, 2013. 14(2): p. 225-37.

9. Nishida, K., M.C. Frith, and K. Nakai, Pseudocounts for transcription factor

binding sites. Nucleic Acids Res, 2009. 37(3): p. 939-44.

10. Nishida, K., M.C. Frith, and K. Nakai, Pseudocounts for transcription factor

11. Weirauch, M.T., et al., Evaluation of methods for modeling transcription factor

sequence specificity. Nat Biotechnol, 2013. 31(2): p. 126-34.

12. Ma, Q., et al., An integrated toolkit for accurate prediction and analysis of cis-

regulatory motifs at a genome scale. Bioinformatics, 2013. 29(18): p. 2261-8.

13. Das, M.K. and H.K. Dai, A survey of DNA motif finding algorithms. BMC

Bioinformatics, 2007. 8 Suppl 7: p. S21.

14. Bailey, T.L. and C. Elkan. Fitting a mixture model by expectation maximization to

discover motifs in biopolymers. in International Conference on Intelligent Systems for Molecular Biology. 1994.

15. Ikebata, H. and R. Yoshida, Repulsive parallel MCMC algorithm for discovering

diverse motifs from large sequence sets. Bioinformatics, 2015. 31(10): p. 1561-8.

16. Park, P.J., ChIP-seq: advantages and challenges of a maturing technology. Nature Reviews Genetics, 2009. 10(10): p. 669-680.

17. Lingyun Song, G.E.C., DNase-seq: a high-resolution technique for mapping

active gene regulatory elements across the genome from mammalian cells. Cold

Spring Harbor Protocols, 2010. 2010(2).

18. Wang, Z., et al., RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 2008. 10(1): p. 57-63.

19. Tsankov, A.M., et al., Transcription factor binding dynamics during human ES

cell differentiation. Nature, 2015. 518(7539): p. 344-9.

20. Wu, F., B.G. Olson, and J. Yao, DamID-seq: Genome-wide Mapping of Protein-

DNA Interactions by High Throughput Sequencing of Adenine-methylated DNA Fragments. Journal of Visualized Experiments Jove, 2015(107).

21. Maragkakis, M., et al., CLIPSeqTools-a novel bioinformatics CLIP-seq analysis

suite. RNA (New York, N.Y.), 2015. 22(1).

22. Hafner, M., et al., PAR-CliP - A Method to Identify Transcriptome-wide the

Binding Sites of RNA Binding Proteins. Journal of Visualized Experiments, 2010.

41(41): p. e2034-e2034.

23. Ingolia, N.T., Ribosome profiling: new views of translation, from single codons to

genome scale. Nature Reviews Genetics, 2014. 15(3): p. 205-13.

24. Giresi, P.G., et al., FAIRE (Formaldehyde-Assisted Isolation of Regulatory

Elements) isolates active regulatory elements from human chromatin. Genome

Research, 2007. 17(6): p. 877-85.

25. Nutiu, R., et al., Direct measurement of DNA affinity landscapes on a high-

throughput sequencing instrument. Nature Biotechnology, 2011. 29(7): p. 659-64.

26. Collas, P. and J.A. Dahl, Chop it, ChIP it, check it: the current status of

chromatin immunoprecipitation. Frontiers in Bioscience A Journal & Virtual

Library, 2008. 13(4): p. 929-943.

27. Kimura, H. and Y. Sato, DNA Replication and Histone Modification. 2016: Springer Japan.

28. Suganuma, T. and J.L. Workman, Histone modification as a reflection of

metabolism. Cell Cycle, 2016.

29. Qu, H. and X. Fang, A brief review on the Human Encyclopedia of DNA Elements

(ENCODE) project. Genomics Proteomics Bioinformatics, 2013. 11(3): p. 135-

30. Consortium, E.P., The ENCODE (ENCyclopedia Of DNA Elements) Project. Science, 2004. 306(5696): p. 636-40.

31. Langmead, B., et al., Ultrafast and memory-efficient alignment of short DNA

sequences to the human genome. Genome Biology, 2009. 10(3): p. 1-10.

32. Li, H. and R. Durbin, Fast and accurate long-read alignment with Burrows–

Wheeler transform. Bioinformatics, 2010. 26(5): p. 589-595.

33. Szalkowski, A.M. and C.D. Schmid, Rapid innovation in ChIP-seq peak-calling

algorithms is outdistancing benchmarking efforts. Briefings in Bioinformatics,

2011. 12(6): p. 626-633(8).

34. Kharchenko, P.V., M.Y. Tolstorukov, and P.J. Park, Design and analysis of ChIP-

seq experiments for DNA-binding proteins. Nature biotechnology, 2008. 26(12):

p. 1351-1359.

35. Zhang, Y., et al., Model-based analysis of ChIP-Seq (MACS). Genome Biol, 2008. 9(9): p. R137.

36. Jiang, H., et al., CisGenome Browser: a flexible tool for genomic data

visualization. Bioinformatics, 2010. 26(14): p. 1781-2.

37. Fejes, A.P., et al., FindPeaks 3.1: a tool for identifying areas of enrichment from

massively parallel short-read sequencing technology. Bioinformatics, 2008.

24(15): p. 1729-30.

38. Valouev, A., et al., Genome-wide analysis of transcription factor binding sites

based on ChIP-Seq data. Nature Methods, 2008. 5(9): p. 829-834.

39. Xin, F., R. Grossman, and L. Stein, PeakRanger: a cloud-enabled peak caller for

40. Zhou, T., et al., Quantitative modeling of transcription factor binding specificities

using DNA shape. Proceedings of the National Academy of Sciences, 2015.

112(15): p. 4654-4659.

41. Yang, L., et al., Transcription factor family‐specific DNA shape readout revealed

by quantitative specificity models. Molecular Systems Biology, 2017. 13(2): p.

910.

42. Mathelier, A., et al., DNA shape features improve transcription factor binding site

predictions in vivo. Cell systems, 2016. 3(3): p. 278-286. e4.

43. Zentner, G.E., et al., ChEC-seq kinetics discriminates transcription factor binding

sites by DNA sequence and shape in vivo. Nature communications, 2015. 6.

44. Abe, N., et al., Deconvolving the recognition of DNA shape from sequence. Cell, 2015. 161(2): p. 307-318.

45. Samee, M.A.H., B. Bruneau, and K. Pollard, Transcription Factors Recognize

DNA Shape Without Nucleotide Recognition. bioRxiv, 2017: p. 143677.

46. LeCun, Y., Y. Bengio, and G. Hinton, Deep learning. Nature, 2015. 521(7553): p. 436-444.

47. Wang, H., B. Raj, and E.P. Xing, On the Origin of Deep Learning. arXiv preprint arXiv:1702.07800, 2017.

48. Lawrence, C.E., et al., Detecting subtle sequence signals: a Gibbs sampling

strategy for multiple alignment. Science, 1993. 262(5131): p. 208-14.

49. Pevzner, P.A. and S.H. Sze, Combinatorial approaches to finding subtle signals

50. Nakaki, R., J. Kang, and M. Tateno, A novel ab initio identification system of

transcriptional regulation motifs in genome DNA sequences based on direct comparison scheme of signal/noise distributions. Nucleic Acids Res, 2012.

40(18): p. 8835-48.

51. Li, G., et al., A new framework for identifying cis-regulatory motifs in

prokaryotes. Nucleic Acids Res, 2011. 39(7): p. e42.

52. Chen, X., et al., W-AlignACE: an improved Gibbs sampling algorithm based on

more accurate position weight matrices learned from sequence and gene expression/ChIP-chip data. Bioinformatics, 2008. 24(9): p. 1121-8.

53. Sinha, S., PhyME: a software tool for finding motifs in sets of orthologous

sequences. Methods Mol Biol, 2007. 395: p. 309-18.

54. Wang, T. and G.D. Stormo, Combining phylogenetic data with co-regulated

genes to identify regulatory motifs. Bioinformatics, 2003. 19(18): p. 2369-80.

55. Liu, X., D.L. Brutlag, and J.S. Liu, BioProspector: discovering conserved DNA

motifs in upstream regulatory regions of co-expressed genes. Pac Symp

Biocomput, 2001: p. 127-38.

56. Hertz, G.Z. and G.D. Stormo, Identifying DNA and protein patterns with

statistically significant alignments of multiple sequences. Bioinformatics, 1999.

15(7-8): p. 563-77.

57. Liu, X.S., D.L. Brutlag, and J.S. Liu, An algorithm for finding protein-DNA

binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol, 2002. 20(8): p. 835-9.

58. Bailey, T.L., et al., MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res, 2009. 37(Web Server issue): p. W202-8.

59. Olman, V., D. Xu, and Y. Xu, CUBIC: identification of regulatory binding sites

through data clustering. J Bioinform Comput Biol, 2003. 1(1): p. 21-40.

60. Li, X. and W.H. Wong, Sampling motifs on phylogenetic trees. Proc Natl Acad Sci U S A, 2005. 102(27): p. 9481-6.

61. Blanchette, M. and M. Tompa, Discovery of regulatory elements by a

computational method for phylogenetic footprinting. Genome Res, 2002. 12(5): p.

739-48.

62. Blanchette, M. and M. Tompa, FootPrinter: A program designed for phylogenetic

footprinting. Nucleic Acids Res, 2003. 31(13): p. 3840-2.

63. Li, G., B. Liu, and Y. Xu, Accurate recognition of cis-regulatory motifs with the

correct lengths in prokaryotic genomes. Nucleic Acids Res, 2010. 38(2): p. e12.

64. Tompa, M., et al., Assessing computational tools for the discovery of transcription

factor binding sites. Nat Biotechnol, 2005. 23(1): p. 137-44.

65. McCue, L.A., et al., Factors influencing the identification of transcription factor

binding sites by cross-species comparison. Genome Res, 2002. 12(10): p. 1523-

32.

66. Simcha, D., N.D. Price, and D. Geman, The limits of de novo DNA motif

discovery. PLoS One, 2012. 7(11): p. e47836.

67. Siddharthan, R., E.D. Siggia, and E. van Nimwegen, PhyloGibbs: a Gibbs

sampling motif finder that incorporates phylogeny. PLoS Comput Biol, 2005.

68. Blanchette, M., B. Schwikowski, and M. Tompa, Algorithms for phylogenetic

footprinting. J Comput Biol, 2002. 9(2): p. 211-23.

69. Neph, S. and M. Tompa, MicroFootPrinter: a tool for phylogenetic footprinting

in prokaryotic genomes. Nucleic Acids Res, 2006. 34(Web Server issue): p.

W366-8.

70. Carmack, C.S., et al., PhyloScan: identification of transcription factor binding

sites using cross-species evidence. Algorithms Mol Biol, 2007. 2: p. 1.

71. Zhang, S., et al., Genome-wide de novo prediction of cis-regulatory binding sites

in prokaryotes. Nucleic Acids Res, 2009. 37(10): p. e72.

72. Katara, P., A. Grover, and V. Sharma, Phylogenetic footprinting: a boost for

microbial regulatory genomics. Protoplasma, 2012. 249(4): p. 901-7.

73. Tagle, D.A., et al., Embryonic epsilon and gamma globin genes of a prosimian

primate (Galago crassicaudatus). Nucleotide and amino acid sequences,

developmental regulation and phylogenetic footprints. J Mol Biol, 1988. 203(2):

p. 439-55.

74. Borneman, A.R., et al., Divergence of transcription factor binding sites across

related yeast species. Science, 2007. 317(5839): p. 815-819.

75. Odom, D.T., et al., Tissue-specific transcriptional regulation has diverged

significantly between human and mouse. Nature genetics, 2007. 39(6): p. 730-

732.

76. Boyle, A.P., et al., Comparative analysis of regulatory information and circuits

77. Bingqiang Liu, C.Z., Hanyuan Zhang, Guojun Li, Guanghui Wang, Anne Fennell, Yu Kang, Qi Liu and Qin Ma, An integrative and applicable phylogenetic

footprinting framework for cis-regulatory motifs identification in prokaryotic genomes. BMC Genomics, 2016.

78. Kuan, P.F., et al., A statistical framework for the analysis of ChIP-Seq data. Journal of the American Statistical Association, 2011. 106(495): p. 891-903. 79. Mathelier, A. and W.W. Wasserman, The next generation of transcription factor

binding site prediction. PLoS Comput Biol, 2013. 9(9): p. e1003214.

80. Cheng, C., R. Min, and M. Gerstein, TIP: a probabilistic method for identifying

transcription factor target genes from ChIP-seq binding profiles. Bioinformatics,

2011. 27(23): p. 3221-3227.

81. Wu, S., et al., ChIP-PaM: an algorithm to identify protein-DNA interaction using

ChIP-Seq data. Theoretical biology and medical modelling, 2010. 7(1): p. 1.

82. van Heeringen, S.J. and G.J.C. Veenstra, GimmeMotifs: a de novo motif

prediction pipeline for ChIP-sequencing experiments. Bioinformatics, 2011.

27(2): p. 270-271.

83. Machanick, P. and T.L. Bailey, MEME-ChIP: motif analysis of large DNA

datasets. Bioinformatics, 2011. 27(12): p. 1696-7.

84. Kulakovskiy, I.V., et al., Deep and wide digging for binding motifs in ChIP-Seq

data. Bioinformatics, 2010. 26(20): p. 2622-3.

85. Jothi, R., et al., Genome-wide identification of in vivo protein–DNA binding sites

86. Mercier, E., et al., An integrated pipeline for the genome-wide analysis of

transcription factor binding sites from ChIP-Seq. PLoS One, 2011. 6(2): p.

e16432.

87. Hu, M., et al., On the detection and refinement of transcription factor binding

sites using ChIP-Seq data. Nucleic Acids Res, 2010. 38(7): p. 2154-67.

88. Bailey, T.L., DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics, 2011. 27(12): p. 1653-9.

89. Jia, C., et al., A new exhaustive method and strategy for finding motifs in ChIP-

enriched regions. Plos One, 2014. 9(9): p. e86044.

90. Pavesi, G., et al., Weeder Web: discovery of transcription factor binding sites in a

set of sequences from co-regulated genes. Nucleic Acids Res, 2004. 32(Web

Server issue): p. W199-203.

91. Thomas-Chollier, M., et al., RSAT peak-motifs: motif analysis in full-size ChIP-

seq datasets. Nucleic Acids Res, 2012. 40(4): p. e31.

92. Tran, N.T. and C.H. Huang, A survey of motif finding Web tools for detecting

binding site motifs in ChIP-Seq data. Biol Direct, 2014. 9: p. 4.

93. Jun Ding, H.H., Xiaoman Li, SIOMICS: a novel approach for systematic

identification of motifs in ChIP-seq data. Nucleic Acids Research, 2014. 42(5): p.

1635-1645.

94. Boeva, V., et al., De novo motif identification improves the accuracy of predicting

transcription factor binding sites in ChIP-Seq data analysis. Nucleic Acids Res,

95. Bailey, T.L., et al., MEME: discovering and analyzing DNA and protein sequence

motifs. Nucleic Acids Research, 2006. 34(2): p. 369-73.

96. Holger, H., et al., P-value-based regulatory motif discovery using positional

weight matrices. Genome Research, 2013. 23(1): p. 181-194.

97. Niu and Meng, De novo prediction of cis-regulatory modules in eukaryotic

organisms. Dissertations & Theses - Gradworks, 2014.

98. Bolouri, H. and W.L. Ruzzo, Integration of 198 ChIP-seq datasets reveals human

cis-regulatory regions. J Comput Biol, 2012. 19(9): p. 989-97.

99. Sun, H., et al., Unveiling combinatorial regulation through the combination of

ChIP information and in silico cis-regulatory module detection. Nucleic Acids

Res, 2012. 40(12): p. e90.

100. Alipanahi, B., et al., Predicting the sequence specificities of DNA-and RNA-

binding proteins by deep learning. Nature biotechnology, 2015. 33(8): p. 831-

838.

101. Kelley, D.R., J. Snoek, and J.L. Rinn, Basset: learning the regulatory code of the

accessible genome with deep convolutional neural networks. Genome research,

2016. 26(7): p. 990-999.

102. Zhou, J. and O.G. Troyanskaya, Predicting effects of noncoding variants with

deep learning-based sequence model. Nature methods, 2015. 12(10): p. 931-934.

103. Lanchantin, J., et al., Deep Motif Dashboard: Visualizing and Understanding

Genomic Sequences Using Deep Neural Networks. arXiv preprint

104. Quang, D. and X. Xie, DanQ: a hybrid convolutional and recurrent deep neural

network for quantifying the function of DNA sequences. Nucleic acids research,

2016. 44(11): p. e107-e107.

105. Dauphin, Y.N., et al., Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083, 2016.

106. Angermueller, C., et al., DeepCpG: accurate prediction of single-cell DNA

methylation states using deep learning. Genome Biology, 2017. 18(1): p. 67.

107. Li, G., et al., A new framework for identifying cis-regulatory motifs in

prokaryotes. Nucleic acids research, 2011. 39(7): p. e42-e42.

108. Li, G., et al., A new framework for identifying cis-regulatory motifs in

prokaryotes. Nucleic acids research, 2010. 39(7): p. e42-e42.

109. Ma, Q., et al., DMINDA: an integrated web server for DNA motif identification

and analyses. Nucleic acids research, 2014: p. gku315.

110. Liu, B., et al., An integrative and applicable phylogenetic footprinting framework

for cis-regulatory motifs identification in prokaryotic genomes. BMC genomics,

2016. 17(1): p. 578.

111. Gama-Castro, S., et al., RegulonDB (version 6.0): gene regulation model of

Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res, 2008. 36(Database

issue): p. D120-4.

112. Tompa, M., et al., Assessing computational tools for the discovery of transcription

113. Bailey, T.L., et al., MEME SUITE: tools for motif discovery and searching. Nucleic acids research, 2009: p. gkp335.

114. Mrázek, J. and S. Xie, Pattern locator: a new tool for finding local sequence

patterns in genomic DNA sequences. Bioinformatics, 2006. 22(24): p. 3099-3100.

115. Mrázek, J., et al., AIMIE: a web-based environment for detection and

interpretation of significant sequence motifs in prokaryotic genomes.

Bioinformatics, 2008. 24(8): p. 1041-1048.

116. Okumura, T., et al., Melina II: a web tool for comparisons among several

predictive algorithms to find potential motifs from promoter regions. Nucleic

acids research, 2007. 35(suppl 2): p. W227-W231.

117. Thijs, G., et al., A Gibbs sampling method to detect overrepresented motifs in the

upstream regions of coexpressed genes. Journal of Computational Biology, 2002.

9(2): p. 447-464.

118. Mahony, S. and P.V. Benos, STAMP: a web tool for exploring DNA-binding motif

similarities. Nucleic acids research, 2007. 35(suppl 2): p. W253-W258.

119. Yang, J., et al., DMINDA 2.0: integrated and systematic views of regulatory DNA

motif identification and analyses. Bioinformatics, 2017.

120. Ma, Q., et al., An integrated toolkit for accurate prediction and analysis of cis-

regulatory motifs at a genome scale. Bioinformatics, 2013. 29(18): p. 2261-2268.

121. Mao, X., et al., DOOR 2.0: presenting operons and their functions through

dynamic and integrated views. Nucleic acids research, 2014. 42(D1): p. D654-

122. Li, G., et al., Integration of sequence-similarity and functional association

information can overcome intrinsic problems in orthology mapping across bacterial genomes. Nucleic acids research, 2011. 39(22): p. e150-e150.

123. Angermueller, C., et al., Deep learning for computational biology. Molecular systems biology, 2016. 12(7): p. 878.

124. Park, Y. and M. Kellis, Deep learning for regulatory genomics. Nat Biotechnol, 2015. 33(8): p. 825-6.

125. Ching, T., et al., Opportunities And Obstacles For Deep Learning In Biology And

Medicine. bioRxiv, 2017: p. 142760.

126. Wang, J., et al., Sequence features and chromatin structure around the genomic

regions bound by 119 human transcription factors. Genome research, 2012.

22(9): p. 1798-1812.

127. Consortium, E.P., An integrated encyclopedia of DNA elements in the human

genome. Nature, 2012. 489(7414): p. 57.

128. Li, Q., et al., Measuring reproducibility of high-throughput experiments. The annals of applied statistics, 2011. 5(3): p. 1752-1779.

129. Chiu, T.-P., et al., DNAshapeR: an R/Bioconductor package for DNA shape

prediction and feature encoding. Bioinformatics, 2015. 32(8): p. 1211-1213.

130. Crooks, G.E., et al., WebLogo: a sequence logo generator. Genome research, 2004. 14(6): p. 1188-1190.

131. Mathelier, A., et al., JASPAR 2016: a major expansion and update of the open-

access database of transcription factor binding profiles. Nucleic acids research,

132. Matys, V., et al., TRANSFAC®: transcriptional regulation, from patterns to

profiles. Nucleic acids research, 2003. 31(1): p. 374-378.

133. Kulakovskiy, I.V., et al., HOCOMOCO: expansion and enhancement of the

collection of transcription factor binding sites models. Nucleic acids research,