• No se han encontrado resultados

Influencia del viaje en la práctica docente y profesional

Turquía, 1907, Colección Victor Burguete Madrid

“PROYECTO DE PARQUE CONMEMORATIVO A UN HÉROE MUERTO EN EL COMBATE NAVAL” 4º envío de pensionado de ANTONIO FLÓREZ Y FRANCISCO AZNAR,

1.4. Influencia del viaje en la práctica docente y profesional

the 6 groups with name “translation” as well as groups 1367, 1196 and 1371 all share 113 of their 195 distinct genes, with only 7 genes unique to a single group. The remaining four groups (1069, 1142, 1348 and 1156) share 111 of their 123 genes and only 5 genes are found in only one group.

These findings suggest that the FuSiGroups algorithm may not be rigorous enough in avoiding duplication, as there is clearly a considerable level of overlap between groups, both in terms of content and definitions. In fact, the only step of the algorithm that really addresses duplication is the removal of groups whose definition is a subset of another group’s definition. No similar step is applied to the group content. This is because the algorithm was originally designed based on the assumption that thresholds (semantic and functional) would be sufficient to create fairly discrete groups. Overlap in content was also expected, as related gene products are often related on multiple features, either in the same sub-ontology or in different ontologies, e.g. two gene products that are related based on their function are likely to also be related based on their location, or due to the process they take part in. It is therefore reasonable to expect a number of groups with roughly the same gene products but different definitions. The fact that there is also a lot of overlap among group definitions suggests that a more stringent grouping process may be required. In order to visualise the extent of the overlaps, a matrix of groups against genes and one of groups against GO terms were created using Microsoft Excel. A sample screen shot of the groups/gene matrix is shown in Figure 7.2. Unfortunately, the matrix is too large to reproduce here. The screen shot does however demonstrate the extent of the overlap even on a small section of the matrix.

7.2

Supergroups

In order to address this overlap issue, an algorithm for creating supergroups was designed.

Definition 9. Supergroup - a group that is created through the merging of two or more groups. A supergroup is not subject to the maximum-completeness rule.

This algorithm merges groups with a high level of overlap into supergroups. This may lead to supergroups that violate the original maximum-completeness rule for group definitions or group content, as the GO terms or genes in the supergroup may no longer all have the required level of similarity with each other. In cases where a number of groups have a suitable level of overlap, this compromise was however deemed acceptable in order to reduce excessive overlap. A suitable level of overlap

7.2 Supergroups

Figure 7.2: Screenshot of group alignment matrix for ST28-FT17. Note that the image has been rotated 90◦ anti-clockwise. The image shows 30% of the total width of the matrix and

8% of its height, at a zoom setting of 25%. The top three rows of the matrix list each group’s ontology, size and ID number. The first five columns list the location of each gene in different hierarchical trees and the ID numbers of different clusters associated with these trees (see later). The actual gene IDs (SGD IDs) are in column 6. The matrix is colour-coded according to the GO ontologies for ease of interpretation: BP groups are red, CC green and MF purple. The solid orange rows represent genes that are not found in any groups.

7.2 Supergroups

was defined as a cosine similarity equal to or greater than 0.5 for both definition and content. Cosine similarity was discussed in the context of the phenotype dataset, in Section 3.2.1.

The similarity level of 0.5 was chosen by visual analysis of pairwise definition and content similarities between all pairs of groups that have any overlap in both categories (2535 distinct pairs). Most pairs of groups show either very high levels of overlap in both categories, or very low levels of overlap. Only a few groups have a high level of overlap in one category but not in the other, with a notable prevalence of pairs with high content overlap but low definition overlap (347 pairs) compared to pairs with high definition overlap but low content overlap (78 pairs). The option of using two distinct similarity levels for definition and content overlap was considered, but due to time constraints and the ad hoc nature of this additional algorithm, it was decided to proceed with a single threshold for both types of overlap.

The merging algorithm was designed as follows: first the overlap between all pairs of groups is calculated, both for their definitions and their content. Then, each group is matched with all the groups with which it has the required level of overlap. Immediately merging groups at this point would of course inevitably lead to duplicate supergroups, as a group A, which overlaps with another group B, would become a supergroup, but B, overlapping with A, would become a distinct, yet identical supergroup. For this reason, the algorithm first checks every set of matched groups against all other sets and removes duplicates, so that each set of identical groups is only merged into a supergroup once. Sets of matched groups are also checked to ensure that they are not subsets of others, and subsets are removed. Finally, prior to merging, the algorithm checks that the new supergroup would have a group content of at least four gene products in order to avoid generating non- meaningful supergroups.

Initial tests indicated that the supergroups had almost as much overlap in terms of their content as their original constituent groups. For this reason, the checking of sets of matched groups against each other was extended to consider the level of overlap between sets, and further merge closely related matched sets. The level of overlap was this time set to 0.8. Specifically, this meant that if there are two sets of related groups, {A, B, C, D, E} and {A, B, C, E, F }, respectively, which have an overlap of 0.8 or more, they are merged into a single set.

7.2 Supergroups initialise list allGroups

initialise list matchedSets initialise list mergedGroups FoR ALL G ∈ allGroups DO

FOR ALL T ∈ allGroups − G DO

calculate overlapdef (G, T )

calculate overlapcont(G, T )

IF overlapdef(G, T ) ≥ 0.5 && overlapcont(G, T ) ≥ 0.5 THEN

add T to list(groupsthatoverlapwithG) END IF

END FOR

add list(groupsthatoverlapwithG) to matchedSets END FOR

FOR ALL setG∈ matchedSets DO

FOR ALL setT ∈ matchedSets − setG DO

IFsetG == setT k setT ⊂ setG k (setG∩ setT) ≥ 0.8 THEN

FOR ALL t ∈ setT DO

IF t /∈ setG THEN

add t to setG

END IF

remove setT from matchedSets

END FOR END IF END FOR END FOR

FOR ALL setG∈ matchedSets DO

merge all groups in setG into supergroupG

IF supergroupGcont ≥ 4 THEN

add supergroupGto mergedGroups

END IF END FOR

RETURN mergedGroups

Table 7.4: Pseudocode for the supergroups algorithm

7.2.1

Pseudocode

7.2.2

Merging results

Of the 481 groups originally obtained for the parameters used in this chapter, 244 groups were merged into 54 supergroups, leaving 237 unmerged original groups. Out of the 244 merged groups the majority were merged into a supergroup once, with only 10 groups merged twice and no groups more than twice. The supergroups range in size from 23 to 235 gene products and in definition size from 4 to 161 GO terms. The supergroups and unmerged original groups were visualised in a new colour-