ANEXOS 1 GLOSARIO DE TERMINOS
2. DATOS PARA ANÁLISIS DE WEIBULL
From motif discovery to binding sites: a framework
The jPREdictor motif evolution was first applied to all 76 human data sets. This was done, because the human data set is much greater than the fly one, and thus gives a general overview on how the jPREdictor performs. A framework was created around the motif discovery approach. The first step in the framework was the evolution. For each of the 76 human data sets, the motif evolution “RewardedSelection” was run 10 times. The command-line call for every run was:
-m model -b background --motifEvolutionRewardedSelection -d NNNNNNNN -d NNNNNNN -v -p single -c 1000
In the assessment, the motif discovery tools were not informed about the sequence type. For com- parability, the same condition is assumed here and the evolutions were run with only one background sequence set. This negative training set (background) consisted of 10 randomly chosen promoters from the human genome, each of length 10 kb.
The evolution was repeated 10 times and yielded up to 200 unique motifs for every data set. After- wards, the motifs were clustered using the “Likelihood” measure and the “Forward” algorithm. The cut-off was set to 0.1. On average, this reduced the number of motifs to 14 for each data set. The motifs resulting from the clustering were reweighted using model and background sequences, and for each set, the highest-weighting motif was chosen.
Each highest-weighting motif was then matched to the sequences of its corresponding data set (model) and the obtained positions together with the extracted binding sites were reported to the webpage (http://bio.cs.washington.edu/assessment/).
Results of the motif discovery in the human data set
For each human data set a highest-weighting motif existed. No motif was rejected. Thus, for each data set binding sites were reported to the website. In fact, rejecting motifs is very difficult without ample knowledge about real biological binding sites. A possible approach in order to support the rejection decision could be to mark low-weighting motifs. Nevertheless, with this approach, the rejection task is shifted to answering the question what low means in the context of weights. The problem is that the maximal weight for a motif is unknown within one data set as well as from one data set to the next. In addition, even discovering motifs in randomly generated sequences yields motifs with high weights. These two problems are the reason for refraining from a rejection scheme. Along with the jPREdictor, Improbizer, SeSiMCMC, MITRA, and MotifSampler also predicted binding sites for each human data set [102].
For the jPREdictor, the mean positive predictive value for all 78 human data sets was 6.8% on both nucleotide and site level (Table 5.2). The sensitivity was 2.9% on nucleotide level and 5.3% on binding site level. This means that only very few binding sites were predicted correctly. For the other tools Tompa assessed, the results are listed in Table 5.2, too. These results were obtained from the website. Consensus is missing in the list, because it was not applied to the human dataset. The best tools in terms of the performance coefficient are ANN-Spec and Weeder. They found the most
5.4 Single motif discovery: continuing the Tompa assessment
Table 5.2: Assessment results for the human data sets. Abbreviations used for the statistical param- eters: PC performance coefficient, PPV positive predictive value. The abbreviations for the motif discovery tools are: E expectation maximization based, G Gibbs sampling based, W word-based. On the human data set, the jPREdictor motif discovery shows a mediocre performance in comparison to the other tools.
Nucleotide level Site level
PC Sensitivity PPV Specificity Sensitivity PPV
AlignACE G 2.92 3.93 10.26 99.38 7.38 12.36 ANN-Spec G 5.06 9.03 10.32 98.59 16.44 9.84 GLAM G 1.46 2.36 3.68 98.88 4.03 6.00 Improbizer E 2.27 4.16 4.76 98.50 7.05 4.84 MEME E 2.39 3.81 6.04 98.93 6.04 8.11 MEME3 E 2.27 4.20 4.71 98.47 6.38 7.88 MITRA W 1.63 2.44 4.71 99.11 4.03 4.69 MotifSampler G 1.59 2.50 4.17 98.96 4.70 4.31 oligodyad W 3.27 3.71 21.40 99.75 6.04 15.00 QuickScore W 0.34 0.51 0.99 99.09 0.00 0.00 SeSiMCMC G 1.77 4.59 2.80 97.13 6.71 6.31 Weeder W 4.75 5.43 27.47 99.74 10.74 25.81 YMF W 2.97 4.10 9.67 99.31 7.38 8.03 jPREdictor W 2.06 2.87 6.83 99.29 5.26 6.79
true positives while keeping the number of false negatives and false positives low. AlignACE and ANN-Spec have almost the same PPV (nucleotide level), but differ in their sensitivity. Therefore, AlignACE missed a lot of planted binding sites, even though the ratio of true positives among all positives is high. This means that ANN-Spec predicted many more binding sites than AlignACE. In fact, AlignACE reported no binding sites for 17 out of 26 human binding factors (for each of the three sequence types, thus 51 out of 76 data sets), whereas ANN-Spec for only one [102]. The best tools in terms of true positive ratio (PPV) seem to be Weeder and oligodyad, as their predicted binding sites cover planted sites with a probability higher than 20% (nucleotide level). This is expressed as a high performance coefficient. Both tools are word-based and (almost) exhaustively enumerate sequences.
The jPREdictor motif discovery performs with a low sensitivity, and a moderate PPV compared to the other tools. In this analysis, the word-based tools are the best in terms of PPV, their accuracy is the highest. Nevertheless, in terms of sensitivity, all tools perform comparably low, with ANN-Spec being the most sensitive, because its number of predicted sites was very high. The specificity was higher than 95% for every data set, mostly around 99%. This means that the predicted number of binding sites always was low enough to not cover large areas of the sequences.
Two examples
Applying the jPREdictor motif discovery to the human data sets produced a mix of very accurately predicted binding sites for some sets and, for other sets, an utter lack of conformance between pre- dicted and planted sites. A very good prediction result is shown in Figure 5.7. In the example, the PPV was 73.5% for nucleotide positions, and 80% for site positions. The latter means that four out of five
Figure 5.7: The prediction of binding sites in the data set hm08r, consisting of 15 real human promoter sequences, each of length 500. The first seven sequences are shown, depicted as black horizontal lines. The blue bars are the planted transcription factor binding sites, the green bars are the predicted binding sites using the jPREdictor’s motif discovery. For this data set, the positive predictive value on nucleotide level was over 73%.
5.4 Single motif discovery: continuing the Tompa assessment
Figure 5.8: The prediction of binding sites in the data set hm22r, consisting of 6 real human promoter sequences, each of length 500. All six sequences are shown, depicted as black horizontal lines. The blue bars are the planted transcription factor binding sites, the green bars are the predicted binding sites using the jPREdictor’s motif discovery. For this data set, the positive predictive value on both nucleotide and site level was zero percent.
predicted binding sites correctly shared two third of their nucleotides with the planted binding site, on nucleotide level, this PPV drops a little due to fewer overlap. The sensitivity in the example was 39% for nucleotide coverage, and 61% for site coverage. This means that three out of five planted binding sites were correctly covered by the predicted ones and that almost 40% of the known nucleotides from the planted sites had their counterpart in the predicted sites.
An example for a very bad prediction is shown in Figure 5.8. The positive predictive value as well as the sensitivity was zero on both nucleotide and site level. This means that none of the predicted binding sites covered the planted sites, not even partially. Actually, having a zero PPV and a zero sensitivity for a data set happened very often. On nucleotide level, this was the case for 54 out of 76 human data sets. On site level, no overlap existed in 59 cases (Table A.1 in the appendix). Performing with zero PPV and sensitivity was not limited to the jPREdictor motif discovery, but all assessed tools performed this way for the majority of the data sets ([102], supplementary material).
However, performing with zero sensitivity does not mean that no real transcription factor bind- ing sites were predicted. It simply means that the planted binding sites were missed. Some of the sequences the binding sites were planted into are real promoter sequences. As Tompa et al. [102] al- ready pointed out, these sequences may contain other true transcription factor binding sites that were predicted by the motif discovery tools instead of the planted ones.
Discussion
Tompa et al. [102] gave many reasons, why all motif discovery tools performed so badly. One of these reasons is that the planted binding sites for one transcription factor differ in length. The sites are cataloged in TRANSFAC and reflect the resolution of experimental approaches. The true binding site may actually be a shorter subsequence. Tompa et al. [102] used sites up to a length of 71, and 35 binding sites planted were of length longer than 31 nucleotides. In Figure 5.7 the smallest binding site has a length of 5, but the largest was of size 34. Like the other tools, the jPREdictor has no means to evolve a motif based on binding sites of different lengths. It is able to evolve motifs of length in-between 5 to 10, but the each motif itself is of fixed size.
Tompa et al. [102] gave another reason for the low performance of the tools. Each tool was allowed to report the sites for only one motif. Tompa et al. [102] argue that the choice of this one motif is subjective and error-prone. However, in the framework, the highest-weighting motif was chosen as the one motif, and this is an objective decision. Nevertheless, it remains error-prone. In practice, a reasonable approach would be to pursue the top several motifs discovered by any given tool. Allowing for more than one motif and therefore allowing for more reported sites would have a drastic positive effect on the sensitivity. This effect is shown in the next chapter, when the jPREdictor motif discovery is applied to the fly data set. The assessment designed by Tompa et al. [102] is based on binding sites and not on motifs and therefore supports the report of sites for multiple motifs.