1. EL PALIMPSESTO TRANSMEDIÁTICO
1.2 LA TRANSMEDIA
1.2.2 LAS NUEVAS ESCRITURAS
TFBSs play critical roles in regulating transcription rates and expression levels of their target genes. The knowledge of genome-scale TFBSs can greatly help the
elucidation of gene regulatory mechanisms in a cell. Hence, de-novo motif identification and associated computational analyses (e.g., motif scanning and comparison) play an important role in regulatory network construction in all organisms.
Although substantial algorithms and tools have been developed in the past few decades, phylogenetic footprinting-based algorithms have not been fully considered. Additionally, no such work has considered the relationship and visualization among identified motifs (or corresponding genes). Existing DL models use a subjective threshold in motif identification, which incurs severe false positive and false negative issues.
Furthermore, DNA shape has not been integrated with DL in shape motif identification. To overcome these limitations, we have made two main contributions as follows: We have developed an integrated web server, DMINDA 2.0, which contains: (i) five motif prediction and analysis algorithms, including a phylogenetic footprinting
framework; (ii) 2,125 species with complete genomes to support the above five functions, covering animals, plants, and bacteria; and (iii) bacterial regulon prediction and
visualization. Compared to other existing web servers, DMINDA 2.0 provides comprehensive motif analysis functions. DMINDA 2.0 is freely available at http://bmbl.sdstate.edu/DMINDA2.
We have proposed a DL framework, DESSO, which is used to identify both sequence and shape motifs from ChIP-seq data. To optimize the threshold in motif identification, we introduced a binomial distribution to select the best threshold based on
a P-value strategy. Results show that DNA shape also has strong predictive power in TF- DNA binding specificity prediction. In addition, shape motifs are prevalent and can help interpret why TFs with the same sequence motif bind to distinct genome regions.
Compared to CNN, the GCNN model proposed in this study can improve TF-DNA binding specificity prediction on long DNA sequences. This performance mainly benefits from GCNN’s capability in capturing motif dependencies.
REFERENCES
1. D'Haeseleer, P., What are DNA sequence motifs? Nature Biotechnology, 2006.
24(4): p. 423-5.
2. Brohee, S., et al., Unraveling networks of co-regulated genes on the sole basis of
genome sequences. Nucleic Acids Res, 2011. 39(15): p. 6340-58.
3. Davidson, E. and M. Levin, Gene regulatory networks. Proc Natl Acad Sci U S A, 2005. 102(14): p. 4935.
4. Liu, B., et al., Bacterial regulon modeling and prediction based on systematic cis
regulatory motif analyses. Scientific reports, 2016. 6.
5. Baumbach, J., On the power and limits of evolutionary conservation--unraveling
bacterial gene regulatory networks. Nucleic Acids Res, 2010. 38(22): p. 7877-84.
6. Liu, B., et al., An algorithmic perspective of de novo cis-regulatory motif finding
based on ChIP-seq data. Briefings in Bioinformatics, 2017: p. bbx026.
7. Stormo, G.D., DNA binding sites: representation and discovery. Bioinformatics, 2000. 16(1): p. 16-23.
8. Zambelli, F., G. Pesole, and G. Pavesi, Motif discovery and transcription factor
binding sites before and after the next-generation sequencing era. Brief
Bioinform, 2013. 14(2): p. 225-37.
9. Nishida, K., M.C. Frith, and K. Nakai, Pseudocounts for transcription factor
binding sites. Nucleic Acids Res, 2009. 37(3): p. 939-44.
10. Nishida, K., M.C. Frith, and K. Nakai, Pseudocounts for transcription factor
11. Weirauch, M.T., et al., Evaluation of methods for modeling transcription factor
sequence specificity. Nat Biotechnol, 2013. 31(2): p. 126-34.
12. Ma, Q., et al., An integrated toolkit for accurate prediction and analysis of cis-
regulatory motifs at a genome scale. Bioinformatics, 2013. 29(18): p. 2261-8.
13. Das, M.K. and H.K. Dai, A survey of DNA motif finding algorithms. BMC
Bioinformatics, 2007. 8 Suppl 7: p. S21.
14. Bailey, T.L. and C. Elkan. Fitting a mixture model by expectation maximization to
discover motifs in biopolymers. in International Conference on Intelligent Systems for Molecular Biology. 1994.
15. Ikebata, H. and R. Yoshida, Repulsive parallel MCMC algorithm for discovering
diverse motifs from large sequence sets. Bioinformatics, 2015. 31(10): p. 1561-8.
16. Park, P.J., ChIP-seq: advantages and challenges of a maturing technology. Nature Reviews Genetics, 2009. 10(10): p. 669-680.
17. Lingyun Song, G.E.C., DNase-seq: a high-resolution technique for mapping
active gene regulatory elements across the genome from mammalian cells. Cold
Spring Harbor Protocols, 2010. 2010(2).
18. Wang, Z., et al., RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 2008. 10(1): p. 57-63.
19. Tsankov, A.M., et al., Transcription factor binding dynamics during human ES
cell differentiation. Nature, 2015. 518(7539): p. 344-9.
20. Wu, F., B.G. Olson, and J. Yao, DamID-seq: Genome-wide Mapping of Protein-
DNA Interactions by High Throughput Sequencing of Adenine-methylated DNA Fragments. Journal of Visualized Experiments Jove, 2015(107).
21. Maragkakis, M., et al., CLIPSeqTools-a novel bioinformatics CLIP-seq analysis
suite. RNA (New York, N.Y.), 2015. 22(1).
22. Hafner, M., et al., PAR-CliP - A Method to Identify Transcriptome-wide the
Binding Sites of RNA Binding Proteins. Journal of Visualized Experiments, 2010.
41(41): p. e2034-e2034.
23. Ingolia, N.T., Ribosome profiling: new views of translation, from single codons to
genome scale. Nature Reviews Genetics, 2014. 15(3): p. 205-13.
24. Giresi, P.G., et al., FAIRE (Formaldehyde-Assisted Isolation of Regulatory
Elements) isolates active regulatory elements from human chromatin. Genome
Research, 2007. 17(6): p. 877-85.
25. Nutiu, R., et al., Direct measurement of DNA affinity landscapes on a high-
throughput sequencing instrument. Nature Biotechnology, 2011. 29(7): p. 659-64.
26. Collas, P. and J.A. Dahl, Chop it, ChIP it, check it: the current status of
chromatin immunoprecipitation. Frontiers in Bioscience A Journal & Virtual
Library, 2008. 13(4): p. 929-943.
27. Kimura, H. and Y. Sato, DNA Replication and Histone Modification. 2016: Springer Japan.
28. Suganuma, T. and J.L. Workman, Histone modification as a reflection of
metabolism. Cell Cycle, 2016.
29. Qu, H. and X. Fang, A brief review on the Human Encyclopedia of DNA Elements
(ENCODE) project. Genomics Proteomics Bioinformatics, 2013. 11(3): p. 135-
30. Consortium, E.P., The ENCODE (ENCyclopedia Of DNA Elements) Project. Science, 2004. 306(5696): p. 636-40.
31. Langmead, B., et al., Ultrafast and memory-efficient alignment of short DNA
sequences to the human genome. Genome Biology, 2009. 10(3): p. 1-10.
32. Li, H. and R. Durbin, Fast and accurate long-read alignment with Burrows–
Wheeler transform. Bioinformatics, 2010. 26(5): p. 589-595.
33. Szalkowski, A.M. and C.D. Schmid, Rapid innovation in ChIP-seq peak-calling
algorithms is outdistancing benchmarking efforts. Briefings in Bioinformatics,
2011. 12(6): p. 626-633(8).
34. Kharchenko, P.V., M.Y. Tolstorukov, and P.J. Park, Design and analysis of ChIP-
seq experiments for DNA-binding proteins. Nature biotechnology, 2008. 26(12):
p. 1351-1359.
35. Zhang, Y., et al., Model-based analysis of ChIP-Seq (MACS). Genome Biol, 2008. 9(9): p. R137.
36. Jiang, H., et al., CisGenome Browser: a flexible tool for genomic data
visualization. Bioinformatics, 2010. 26(14): p. 1781-2.
37. Fejes, A.P., et al., FindPeaks 3.1: a tool for identifying areas of enrichment from
massively parallel short-read sequencing technology. Bioinformatics, 2008.
24(15): p. 1729-30.
38. Valouev, A., et al., Genome-wide analysis of transcription factor binding sites
based on ChIP-Seq data. Nature Methods, 2008. 5(9): p. 829-834.
39. Xin, F., R. Grossman, and L. Stein, PeakRanger: a cloud-enabled peak caller for
40. Zhou, T., et al., Quantitative modeling of transcription factor binding specificities
using DNA shape. Proceedings of the National Academy of Sciences, 2015.
112(15): p. 4654-4659.
41. Yang, L., et al., Transcription factor family‐specific DNA shape readout revealed
by quantitative specificity models. Molecular Systems Biology, 2017. 13(2): p.
910.
42. Mathelier, A., et al., DNA shape features improve transcription factor binding site
predictions in vivo. Cell systems, 2016. 3(3): p. 278-286. e4.
43. Zentner, G.E., et al., ChEC-seq kinetics discriminates transcription factor binding
sites by DNA sequence and shape in vivo. Nature communications, 2015. 6.
44. Abe, N., et al., Deconvolving the recognition of DNA shape from sequence. Cell, 2015. 161(2): p. 307-318.
45. Samee, M.A.H., B. Bruneau, and K. Pollard, Transcription Factors Recognize
DNA Shape Without Nucleotide Recognition. bioRxiv, 2017: p. 143677.
46. LeCun, Y., Y. Bengio, and G. Hinton, Deep learning. Nature, 2015. 521(7553): p. 436-444.
47. Wang, H., B. Raj, and E.P. Xing, On the Origin of Deep Learning. arXiv preprint arXiv:1702.07800, 2017.
48. Lawrence, C.E., et al., Detecting subtle sequence signals: a Gibbs sampling
strategy for multiple alignment. Science, 1993. 262(5131): p. 208-14.
49. Pevzner, P.A. and S.H. Sze, Combinatorial approaches to finding subtle signals
50. Nakaki, R., J. Kang, and M. Tateno, A novel ab initio identification system of
transcriptional regulation motifs in genome DNA sequences based on direct comparison scheme of signal/noise distributions. Nucleic Acids Res, 2012.
40(18): p. 8835-48.
51. Li, G., et al., A new framework for identifying cis-regulatory motifs in
prokaryotes. Nucleic Acids Res, 2011. 39(7): p. e42.
52. Chen, X., et al., W-AlignACE: an improved Gibbs sampling algorithm based on
more accurate position weight matrices learned from sequence and gene expression/ChIP-chip data. Bioinformatics, 2008. 24(9): p. 1121-8.
53. Sinha, S., PhyME: a software tool for finding motifs in sets of orthologous
sequences. Methods Mol Biol, 2007. 395: p. 309-18.
54. Wang, T. and G.D. Stormo, Combining phylogenetic data with co-regulated
genes to identify regulatory motifs. Bioinformatics, 2003. 19(18): p. 2369-80.
55. Liu, X., D.L. Brutlag, and J.S. Liu, BioProspector: discovering conserved DNA
motifs in upstream regulatory regions of co-expressed genes. Pac Symp
Biocomput, 2001: p. 127-38.
56. Hertz, G.Z. and G.D. Stormo, Identifying DNA and protein patterns with
statistically significant alignments of multiple sequences. Bioinformatics, 1999.
15(7-8): p. 563-77.
57. Liu, X.S., D.L. Brutlag, and J.S. Liu, An algorithm for finding protein-DNA
binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat Biotechnol, 2002. 20(8): p. 835-9.
58. Bailey, T.L., et al., MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res, 2009. 37(Web Server issue): p. W202-8.
59. Olman, V., D. Xu, and Y. Xu, CUBIC: identification of regulatory binding sites
through data clustering. J Bioinform Comput Biol, 2003. 1(1): p. 21-40.
60. Li, X. and W.H. Wong, Sampling motifs on phylogenetic trees. Proc Natl Acad Sci U S A, 2005. 102(27): p. 9481-6.
61. Blanchette, M. and M. Tompa, Discovery of regulatory elements by a
computational method for phylogenetic footprinting. Genome Res, 2002. 12(5): p.
739-48.
62. Blanchette, M. and M. Tompa, FootPrinter: A program designed for phylogenetic
footprinting. Nucleic Acids Res, 2003. 31(13): p. 3840-2.
63. Li, G., B. Liu, and Y. Xu, Accurate recognition of cis-regulatory motifs with the
correct lengths in prokaryotic genomes. Nucleic Acids Res, 2010. 38(2): p. e12.
64. Tompa, M., et al., Assessing computational tools for the discovery of transcription
factor binding sites. Nat Biotechnol, 2005. 23(1): p. 137-44.
65. McCue, L.A., et al., Factors influencing the identification of transcription factor
binding sites by cross-species comparison. Genome Res, 2002. 12(10): p. 1523-
32.
66. Simcha, D., N.D. Price, and D. Geman, The limits of de novo DNA motif
discovery. PLoS One, 2012. 7(11): p. e47836.
67. Siddharthan, R., E.D. Siggia, and E. van Nimwegen, PhyloGibbs: a Gibbs
sampling motif finder that incorporates phylogeny. PLoS Comput Biol, 2005.
68. Blanchette, M., B. Schwikowski, and M. Tompa, Algorithms for phylogenetic
footprinting. J Comput Biol, 2002. 9(2): p. 211-23.
69. Neph, S. and M. Tompa, MicroFootPrinter: a tool for phylogenetic footprinting
in prokaryotic genomes. Nucleic Acids Res, 2006. 34(Web Server issue): p.
W366-8.
70. Carmack, C.S., et al., PhyloScan: identification of transcription factor binding
sites using cross-species evidence. Algorithms Mol Biol, 2007. 2: p. 1.
71. Zhang, S., et al., Genome-wide de novo prediction of cis-regulatory binding sites
in prokaryotes. Nucleic Acids Res, 2009. 37(10): p. e72.
72. Katara, P., A. Grover, and V. Sharma, Phylogenetic footprinting: a boost for
microbial regulatory genomics. Protoplasma, 2012. 249(4): p. 901-7.
73. Tagle, D.A., et al., Embryonic epsilon and gamma globin genes of a prosimian
primate (Galago crassicaudatus). Nucleotide and amino acid sequences,
developmental regulation and phylogenetic footprints. J Mol Biol, 1988. 203(2):
p. 439-55.
74. Borneman, A.R., et al., Divergence of transcription factor binding sites across
related yeast species. Science, 2007. 317(5839): p. 815-819.
75. Odom, D.T., et al., Tissue-specific transcriptional regulation has diverged
significantly between human and mouse. Nature genetics, 2007. 39(6): p. 730-
732.
76. Boyle, A.P., et al., Comparative analysis of regulatory information and circuits
77. Bingqiang Liu, C.Z., Hanyuan Zhang, Guojun Li, Guanghui Wang, Anne Fennell, Yu Kang, Qi Liu and Qin Ma, An integrative and applicable phylogenetic
footprinting framework for cis-regulatory motifs identification in prokaryotic genomes. BMC Genomics, 2016.
78. Kuan, P.F., et al., A statistical framework for the analysis of ChIP-Seq data. Journal of the American Statistical Association, 2011. 106(495): p. 891-903. 79. Mathelier, A. and W.W. Wasserman, The next generation of transcription factor
binding site prediction. PLoS Comput Biol, 2013. 9(9): p. e1003214.
80. Cheng, C., R. Min, and M. Gerstein, TIP: a probabilistic method for identifying
transcription factor target genes from ChIP-seq binding profiles. Bioinformatics,
2011. 27(23): p. 3221-3227.
81. Wu, S., et al., ChIP-PaM: an algorithm to identify protein-DNA interaction using
ChIP-Seq data. Theoretical biology and medical modelling, 2010. 7(1): p. 1.
82. van Heeringen, S.J. and G.J.C. Veenstra, GimmeMotifs: a de novo motif
prediction pipeline for ChIP-sequencing experiments. Bioinformatics, 2011.
27(2): p. 270-271.
83. Machanick, P. and T.L. Bailey, MEME-ChIP: motif analysis of large DNA
datasets. Bioinformatics, 2011. 27(12): p. 1696-7.
84. Kulakovskiy, I.V., et al., Deep and wide digging for binding motifs in ChIP-Seq
data. Bioinformatics, 2010. 26(20): p. 2622-3.
85. Jothi, R., et al., Genome-wide identification of in vivo protein–DNA binding sites
86. Mercier, E., et al., An integrated pipeline for the genome-wide analysis of
transcription factor binding sites from ChIP-Seq. PLoS One, 2011. 6(2): p.
e16432.
87. Hu, M., et al., On the detection and refinement of transcription factor binding
sites using ChIP-Seq data. Nucleic Acids Res, 2010. 38(7): p. 2154-67.
88. Bailey, T.L., DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics, 2011. 27(12): p. 1653-9.
89. Jia, C., et al., A new exhaustive method and strategy for finding motifs in ChIP-
enriched regions. Plos One, 2014. 9(9): p. e86044.
90. Pavesi, G., et al., Weeder Web: discovery of transcription factor binding sites in a
set of sequences from co-regulated genes. Nucleic Acids Res, 2004. 32(Web
Server issue): p. W199-203.
91. Thomas-Chollier, M., et al., RSAT peak-motifs: motif analysis in full-size ChIP-
seq datasets. Nucleic Acids Res, 2012. 40(4): p. e31.
92. Tran, N.T. and C.H. Huang, A survey of motif finding Web tools for detecting
binding site motifs in ChIP-Seq data. Biol Direct, 2014. 9: p. 4.
93. Jun Ding, H.H., Xiaoman Li, SIOMICS: a novel approach for systematic
identification of motifs in ChIP-seq data. Nucleic Acids Research, 2014. 42(5): p.
1635-1645.
94. Boeva, V., et al., De novo motif identification improves the accuracy of predicting
transcription factor binding sites in ChIP-Seq data analysis. Nucleic Acids Res,
95. Bailey, T.L., et al., MEME: discovering and analyzing DNA and protein sequence
motifs. Nucleic Acids Research, 2006. 34(2): p. 369-73.
96. Holger, H., et al., P-value-based regulatory motif discovery using positional
weight matrices. Genome Research, 2013. 23(1): p. 181-194.
97. Niu and Meng, De novo prediction of cis-regulatory modules in eukaryotic
organisms. Dissertations & Theses - Gradworks, 2014.
98. Bolouri, H. and W.L. Ruzzo, Integration of 198 ChIP-seq datasets reveals human
cis-regulatory regions. J Comput Biol, 2012. 19(9): p. 989-97.
99. Sun, H., et al., Unveiling combinatorial regulation through the combination of
ChIP information and in silico cis-regulatory module detection. Nucleic Acids
Res, 2012. 40(12): p. e90.
100. Alipanahi, B., et al., Predicting the sequence specificities of DNA-and RNA-
binding proteins by deep learning. Nature biotechnology, 2015. 33(8): p. 831-
838.
101. Kelley, D.R., J. Snoek, and J.L. Rinn, Basset: learning the regulatory code of the
accessible genome with deep convolutional neural networks. Genome research,
2016. 26(7): p. 990-999.
102. Zhou, J. and O.G. Troyanskaya, Predicting effects of noncoding variants with
deep learning-based sequence model. Nature methods, 2015. 12(10): p. 931-934.
103. Lanchantin, J., et al., Deep Motif Dashboard: Visualizing and Understanding
Genomic Sequences Using Deep Neural Networks. arXiv preprint
104. Quang, D. and X. Xie, DanQ: a hybrid convolutional and recurrent deep neural
network for quantifying the function of DNA sequences. Nucleic acids research,
2016. 44(11): p. e107-e107.
105. Dauphin, Y.N., et al., Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083, 2016.
106. Angermueller, C., et al., DeepCpG: accurate prediction of single-cell DNA
methylation states using deep learning. Genome Biology, 2017. 18(1): p. 67.
107. Li, G., et al., A new framework for identifying cis-regulatory motifs in
prokaryotes. Nucleic acids research, 2011. 39(7): p. e42-e42.
108. Li, G., et al., A new framework for identifying cis-regulatory motifs in
prokaryotes. Nucleic acids research, 2010. 39(7): p. e42-e42.
109. Ma, Q., et al., DMINDA: an integrated web server for DNA motif identification
and analyses. Nucleic acids research, 2014: p. gku315.
110. Liu, B., et al., An integrative and applicable phylogenetic footprinting framework
for cis-regulatory motifs identification in prokaryotic genomes. BMC genomics,
2016. 17(1): p. 578.
111. Gama-Castro, S., et al., RegulonDB (version 6.0): gene regulation model of
Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res, 2008. 36(Database
issue): p. D120-4.
112. Tompa, M., et al., Assessing computational tools for the discovery of transcription
113. Bailey, T.L., et al., MEME SUITE: tools for motif discovery and searching. Nucleic acids research, 2009: p. gkp335.
114. Mrázek, J. and S. Xie, Pattern locator: a new tool for finding local sequence
patterns in genomic DNA sequences. Bioinformatics, 2006. 22(24): p. 3099-3100.
115. Mrázek, J., et al., AIMIE: a web-based environment for detection and
interpretation of significant sequence motifs in prokaryotic genomes.
Bioinformatics, 2008. 24(8): p. 1041-1048.
116. Okumura, T., et al., Melina II: a web tool for comparisons among several
predictive algorithms to find potential motifs from promoter regions. Nucleic
acids research, 2007. 35(suppl 2): p. W227-W231.
117. Thijs, G., et al., A Gibbs sampling method to detect overrepresented motifs in the
upstream regions of coexpressed genes. Journal of Computational Biology, 2002.
9(2): p. 447-464.
118. Mahony, S. and P.V. Benos, STAMP: a web tool for exploring DNA-binding motif
similarities. Nucleic acids research, 2007. 35(suppl 2): p. W253-W258.
119. Yang, J., et al., DMINDA 2.0: integrated and systematic views of regulatory DNA
motif identification and analyses. Bioinformatics, 2017.
120. Ma, Q., et al., An integrated toolkit for accurate prediction and analysis of cis-
regulatory motifs at a genome scale. Bioinformatics, 2013. 29(18): p. 2261-2268.
121. Mao, X., et al., DOOR 2.0: presenting operons and their functions through
dynamic and integrated views. Nucleic acids research, 2014. 42(D1): p. D654-
122. Li, G., et al., Integration of sequence-similarity and functional association
information can overcome intrinsic problems in orthology mapping across bacterial genomes. Nucleic acids research, 2011. 39(22): p. e150-e150.
123. Angermueller, C., et al., Deep learning for computational biology. Molecular systems biology, 2016. 12(7): p. 878.
124. Park, Y. and M. Kellis, Deep learning for regulatory genomics. Nat Biotechnol, 2015. 33(8): p. 825-6.
125. Ching, T., et al., Opportunities And Obstacles For Deep Learning In Biology And
Medicine. bioRxiv, 2017: p. 142760.
126. Wang, J., et al., Sequence features and chromatin structure around the genomic
regions bound by 119 human transcription factors. Genome research, 2012.
22(9): p. 1798-1812.
127. Consortium, E.P., An integrated encyclopedia of DNA elements in the human
genome. Nature, 2012. 489(7414): p. 57.
128. Li, Q., et al., Measuring reproducibility of high-throughput experiments. The annals of applied statistics, 2011. 5(3): p. 1752-1779.
129. Chiu, T.-P., et al., DNAshapeR: an R/Bioconductor package for DNA shape
prediction and feature encoding. Bioinformatics, 2015. 32(8): p. 1211-1213.
130. Crooks, G.E., et al., WebLogo: a sequence logo generator. Genome research, 2004. 14(6): p. 1188-1190.
131. Mathelier, A., et al., JASPAR 2016: a major expansion and update of the open-
access database of transcription factor binding profiles. Nucleic acids research,
132. Matys, V., et al., TRANSFAC®: transcriptional regulation, from patterns to
profiles. Nucleic acids research, 2003. 31(1): p. 374-378.
133. Kulakovskiy, I.V., et al., HOCOMOCO: expansion and enhancement of the
collection of transcription factor binding sites models. Nucleic acids research,