5. RESULTADOS Y DISCUSIÓN
5.8. Resultados de las variables agronómicas
5.8.2. Variables agromorfológicas
data from bisulfite sequencing
2C-2.1 Motivation
The most accurate and probably the most widely used experimental protocol for analyzing DNA methylation makes use of selective conversion of unmethylated cytosines to uracils, in- duced by bisulfite treatment (Frommer et al. 1992; Hajkova et al. 2002). Subsequent amplifi- cation, cloning, sequencing, and comparison with the genomic sequence allows for identifica- tion of unmethylated cytosines, which appear as thymines in a multiple sequence alignment. Although bisulfite sequencing protocol is generally reliable, the necessary data processing steps are tedious to perform manually, and several potential error sources have to be ad- dressed. We developed BiQ Analyzer, an interactive software tool that provides start-to-end support for this process. In an easy-to-use manner, the tool helps the user to import the se-
1 Quoted after: http://www.edge.org/q2007/q07_14.html
2 This chapter describes published work conducted in collaboration with Sabine Reither, Thomas Mikeska, Martina Paulsen and Jörn Walter
(Bock et al. 2005). Martina Paulsen suggested to develop a software for automated analysis of DNA methylation data from bisulfite sequenc- ing. All collaboration partners contributed their expertise with manual data analysis and acted as pilot users of BiQ Analyzer.
quence files from the sequencer, to align them, to exclude or correct critical sequences, to document the experiment, to perform basic statistical analysis and to produce publication- quality diagrams.
C-2.2 Methods
Potential error sources in bisulfite sequencing arise from three phases of the experimental pro- tocol: bisulfite conversion, PCR, and sequencing. Each of these steps can give rise to charac- teristic errors in the sequences, which the experimenter must address before deriving DNA methylation profiles. Here we describe these error types, their impact on methylation data, and the measures of quality control that BiQ Analyzer applies to identify the critical se- quences.
(1) Incomplete conversion. In bisulfite sequencing we assume that all unconverted Cs were originally methylated. Therefore, when the bisulfite treatment fails to convert un- methylated Cs, DNA methylation will be overestimated. Fortunately, for vertebrates it is possible to identify those sequences with low conversion rates, assuming that Cs outside a CpG context are always unmethylated (Reik et al. 2003). BiQ Analyzer calculates the conversion rate of a sequence as the ratio between the number of correctly converted Cs outside a CpG context divided by the sum of converted and unconverted Cs outside a CpG context. By default, BiQ Analyzer marks all sequences with a conversion rate be- low 90% as critical (the default values for all parameters were selected based on expert opinion by researchers with extensive experience in bisulfite sequencing and can be modified according to user preferences).
(2) Clone sequences. PCR amplification bias can result in vast over-representation of se- quences from a single cell or from a small number of cells. Regarding the resulting identical clones as independent sources of DNA methylation data would result in biased estimation of the overall variability of DNA methylation within a sample. BiQ Analyzer thus implements a heuristic clone detection method. It marks those sequences as critical that are identical in all correctly aligned C positions. The advantage of this method over simple sequence comparison is that it is insensitive to sequence truncations and se- quencing errors at non-C positions. However, it can lead to discarding of a significant number of valid clones when conversion rates are close to 100% and all cells in a sam- ple are highly similar in their DNA methylation patterns.
(3) Sequencing errors. Sequencing errors changing C to T and vice versa can lead to errors in the DNA methylation patterns derived from the sequences. Therefore, BiQ Analyzer suggests to exclude all sequences that fall below a local sequence identity level of 80% as compared to the genome sequence (C-to-T conversions and truncations are ignored). Furthermore, in our experiments we regularly observe ambiguous base insertions within a CpG context (i.e. CG CTG or CG TCG). In these cases, BiQ Analyzer reports the methylation state of the CpG dinucleotide as unknown.
As a Java application, BiQ Analyzer runs on almost any platform, requiring only a recent ver- sion of the Java virtual machine (which can be downloaded from http://www.javasoft.com/) and a screen resolution of at least 1024 · 768 pixels. The multiple sequence alignment makes use of a local version of ClustalW (Thompson et al. 1994), which is included in the standard download package. The alignment step is computationally expensive and can be slow on older
C-2. BiQ Analyzer: Visualization and quality control for DNA methylation data from bisulfite sequencing 77
computers. Therefore, the program provides an option to calculate the alignment over the internet on a high-performance computer at the Max Planck Institute for Informatics.
C-2.3 Results and Discussion
BiQ Analyzer is a software tool designed to mimic the manual process of DNA methylation analysis (XFigure 19X). In several steps, the user is guided from the import of sequences, across
several phases of quality control and multiple sequence alignment, to a questionnaire docu- menting the experiment. In each of the quality control steps, the program makes suggestions on how to handle critical sequences, but the ultimate decision to include or exclude a se- quence always stays with the user. Based on the user decisions during that process, the pro- gram finally generates a single-file HTML documentation (including publication-quality me- thylation diagrams in the widely-used “lollipop” style) and saves the derived methylation data to the system clipboard, ready for subsequent analysis with spreadsheet software or a statistics package.
Figure 19. BiQ Analyzer provides a user-friendly interface and automatic expert advice to support analysis and quality control of DNA methylation data obtained by bisulfite sequencing
This figure shows a typical BiQ Analyzer screenshot. The text boxes on the left contain the raw sequences. The main window on the right displays a ClustalW multiple sequence alignment with CpGs, unconverted Cs, and critical sequences being highlighted. The text box below guides the user through the program and provides hints regarding sequences with quality problems. The bar at the top displays the current status and accepts some general settings. Finally, the button bar at the bottom enables the user to navigate through the different steps.
In summary, BiQ Analyzer provides start-to-end support for visualization and quality control of DNA methylation data from bisulfite sequencing. For the frequent user of bisulfite sequencing it will lead to significant speed-up of the data analysis process. The occasional
user will benefit from extensive hints that help to perform rigorous quality control. Beyond that, BiQ Analyzer promises to be a first step toward standardization in quality control and documentation. Non-commercial users can download BiQ Analyzer free of charge from
http://biq-analyzer.bioinf.mpi-inf.mpg.de/, commercial licenses are available through Max-
Planck-Innovation (http://www.max-planck-innovation.de/de/industrie/technologieangebote/
software/article.php?id=2662).