• No se han encontrado resultados

3.14 ¿LECTURA EN EDUCACIÓN INFANTIL?

6. DESARROLLO DE LA PROPUESTA

Now these new matrices Hmatrix and Btmatrix need to be integrated into the Smith- Waterman implementation with the SSE2 instructions. The Hmatrix simply stores the H- values which can be performed by a single store-instruction. The pseudocode for filling the

Btmatrixwith the backtrace-encoding of the H-matrix is shown in algorithm 3. In the inner loop the new H register is set by maximizing over H, E and F. Then we compare the segments in the H and E register (line (1) in algorithm 3). This command sets all bits in one segment of the register Bttmpto 1, if the two corresponding segments in H andE are equal, otherwise to 0. So all bits equal 1 in one segment ofBttmpindicates that the value of theH-value of this segment comes from the correspondingE-value. The_mm_and_si128

command (line (2)) performs a bitwise “AND” between all bits in the two registers and sets all bits to 0 except the last half-byte of each segment, which describes this encoding. The lines (3)-(5) perform the same for theH andF registers and store the encoding in the second to last half-byte. In a similar way the encoding for the E- and F-matrix is generated.

After the matrices are filled by this algorithm, the maximum score in the Hmatrix is searched and from this point a backtrace is started. The backtrace ends when the algorithm reaches a cell with a score of 0 and the start and end positions of this alignment are saved. To eliminate this alignment from further searches we inactive the cells in theHmatrixaround this alignment by looping over all cells contained in the alignment and setting the cells in the same row and column within a distance of 150 to zero. This distance is approximately 2/3 of the distance parameter for the tube shading (see section 3.4.2) to ensure that no short domains will be missed in the HMM-HMMcomparison by inactivating to much cells. Then, a suboptimal alignment is searched by identifying the next maximum score. If the score is also above the given threshold, a backtrace for this suboptimal alignment is performed and the new positions are stored. This is repeated until no further suboptimal alignment with a score above the threshold is found or until a maximum of 10 suboptimal alignments is extracted. We use this approach instead of the Waterman-Eggert algorithm (Waterman and Eggert, 1987) which computes suboptimal and non-overlapping alignments, because in the Waterman-Eggert algorithm the complete dynamic programming matrix have to be recalculated after each suboptimal alignment and this is too slow for our prefiltering.

3.4. Additional filter steps

The main part in decreasing the runtime of HHblits and so the possibility to perform an iterative HMM-HMM comparison was achieved by the fast prefilter explained in 3.3. To further reduce the runtime ofHHblits and to reach runtimes similar toPSI-BLAST, additional filter steps are integrated in theHMM-HMMcomparison step of our method. In this section we give a detailed overview of these additional filters: a filter for skipping further HMM-

HMM comparisons (“early stopping“) if there are no good matches in the 200 previously alignedHMMs, a filter for reducing the search space in the dynamic programming matrix by

3.4 Additional filter steps 34

only searching in a tube around the prefilter matches and a filter which prevents realigning matches that had already been identified in a previous iteration. All these additional filters can be separately disabled by command-line parameters, but this is not recommended as it results in a longer runtime at an only very slightly increased sensitivity.

3.4.1. Early stopping

Theearly stoppingfilter further reduces the number ofHMMs that will be compared by the time-consuming HMM-HMM comparison. This filter is based on the assumption that the prefilter is good in dividing the homologous and non-homologous proteins. TheHMM-HMM

comparison runs over the list of prefilteredHMMs sorted by their prefilter scores. Therefore, most of the homologous matches should be found at the beginning of this comparison. If we identify only non-homologous matches over a long range, the probability to find a homologous match is very low and the runtime costs for searching the remaining HMMs do not further justify a possible gain in sensitivity for finding a few more homologous matches. Furthermore, homologs that are missed due to the early stopping might still be found in the following search iterations.

The scheme of this early stopping filter is shown in figure 3.10. A coarse estimate for the probability for a match to be a true homolog is 1/(1 +E) for a Viterby E-valueE. Before starting theHMM-HMMcomparison between a new databaseHMMand the queryHMM, we average 1/(1 +E) over the last 200 processed Viterbi alignments:

1 200 X k∈{last 200 matches} 1 1 +E(k) (3.9)

If this average drops below a given threshold (default is 0.01), theHMM-HMMcomparison

Figure 3.10:Scheme of theearly stop- ping filter step. The HMM-HMM comparison of HHblits runs over a list of all HMMs that have passed the prefilter, sorted by their prefilter scores. Before starting the HMM- HMM comparison of the query and the current HMM (red box), an av- erage over the E-values of the last 200 matches 1 200 P200 n=1 1 1+E(n) is cal-

culated. If this average drops be- low a given threshold, theHMM-HMM comparison stops and discards all fur- ther HMMs. For the first 200 HMMs the sum is filled up with artificial

3.4 Additional filter steps 35

stops and all furtherHMMs in the prefilter list will be discarded. For an efficient calculation, the E-values of the last 200 matches are stored in an array and after each HMM-HMM

comparison the oldest entry in this array will be replaced with the new E-value and the sum is updated. At the beginning, this array is initialized with artificial E-values of 0 to guarantee that at least 200 HMMs will be compared. Note that, when performing HHblits with more than one thread, the results between several runs could be slightly different, because the order in which alignments are processed by differentCPU-cores can change and hence the threshold may be reached at another position in the list.

3.4.2. Tube shading

The tube shading filter reduces the search space in the dynamic programming matrix of theHMM-HMMcomparison. In this main comparison, a Viterbi alignment is calculated for each database HMM (see section 2.4). For the Viterbi algorithm, a dynamic programming m×nmatrix (m is the query length, nis the length of the database HMM) must be filled by running over all cells. But we already have the information from the last prefilter step in which regions of this matrix the best matches are likely to be located according to the profile-profile Smith-Waterman algorithm. In most cases, the final alignment is located in the same region. Therefore, we constrain the Viterbi search to only the regions around the prefilter matches.

The scheme of this tube shading is given in figure 3.11. At first, all cells in the dynamic

Figure 3.11.: Scheme of the tube filter step. This filter reduces the search space in the dynamic programming matrix of the Viterbi algorithm when comparing the query and databaseHMM. This is done by crossing out all cells in the matrix and activate only the cells in a tube around the alignments identified in the last prefiltering step. There can be more than one alignment if some of the suboptimal alignments also have a score above the prefilter threshold. The size of the tube can be specified by ashading space parameter. The Viterbi algorithm now runs only on the activated cells.