• No se han encontrado resultados

CAPÍTULO 3: PROCESOS DE PLANEACIÓN

11. Procesos de Planeación de la Gestión del Cambio

Having discovered transcript clusters based on retrotransposon content, I was in- terested in establishing whether similar clusters existed in different samples (e.g., in different cell types). Visual inspection of the plots described above suggests that this is the case, and so I devised a method to carry out comparison of clusters in order to quantify and formalise these observations.

Separating data into clusters implies that the points within a single cluster are similar to each other; the concept of similarity is defined by the clustering algorithm used. Suppose we have two datasets D1 and D2 representing observations of the

same phenomena under different conditions (so that D1, D2 ⊂ D, where D is the

set of all possible observations). In the context of this project, D1 and D2 could

be the retrotransposon content of transcripts in B and T cells.

Now suppose we cluster each dataset, so that we obtain disjoint subsets Cij ⊂ Di, j = 1, . . . , Ni where Ni is the number of clusters in Di. Choose a pair of

clusters C1j, Ck

2. If the elements in each of these clusters are similar in the sense

defined by the clustering algorithm (i.e., they would cluster together), then these two clusters can be said to be similar. By comparing every pair of clusters C1j, Ck 2

in this way, we can discover pairs of clusters that are similar.

To formalise this, we create a new dataset E = D1 ∪ D2, and apply the clus-

tering algorithm to create clusters CEj, j = 1, . . . , NE. For each cluster C1i, we can

calculate what proportion of its elements are assigned to each CEj. In this way we can form a matrix Q where each element Qij is the proportion of elements from

Ci

1 that are assigned to C j

proportion of elements from Ci

2 that are assigned to C j

E. Now define the matrix

M = QRT

Therefore the element Mij represents the probabilities that an element from cluster

Ci

1 and an element from cluster C j

2 are found in the same cluster CEk, summed over

all clusters in E. Mij can be used as a score to measure how similar two clusters

are. If two clusters have elements that are often found in the same cluster CEk, then they will have a high score; if their elements are rarely found in the same cluster, the score will be low.

However, these scores can have a large range, and are not comparable between different datasets, as they depend on the number of clusters found in E. In order to make them more comparable and easier to visualise, I transform the elements of M as follows: Mij → ˆMij = log2  1 NE Mij + 1  where NE is the number of clusters found in E.

I tested this method by creating a dataset that samples data from several mul- tivariate normal (MVN) distributions and combines them. The elements sampled from each MVN therefore form natural clusters in the dataset. By changing the covariance of the MVN distribution, the mixing of the datasets is increased, thus making accurate clustering more difficult.

This dataset is clustered, and then split into two datasets based on the results. Some clusters are placed in one dataset (A), some in another (B), and the re- maining clusters are divided between the two. In this way, A and B contain data that should form corresponding clusters (the clusters split between A and B), and

data that certainly do not correspond (the clusters that are placed in either A or B). Hence, clustering A and B separately, we can infer a true mapping between clusters.

I then apply the cluster comparison algorithm described above to A and B. By choosing a score threshold to decide whether two clusters correspond or not, we can compare the found cluster mapping to the true mapping, and thus calculate the true positive rate (TPR) and false positive rate (FPR). By using different covariance values and different score thresholds, I was able to measure the method’s performance on noisy data, and find the optimal value for a score cutoff.

The testing results are shown in Figure 3.6 and Table 3.3. As expected, choos- ing very low score thresholds results in high false positive rates, whereas overly stringent thresholds cause true positives to be missed. A threshold of 7 seems to perform well across the covariance values, even when noise is high, with false positive rates at 0 and true positive rates at 1. It should be noted that in order to maintain reproducibility and reliable clustering structure, the testing data is somewhat artificial; however, it does indicate that the method performs well, and gives a guideline for choosing a score threshold for correspondence. Visual inspec- tion of results from real data also suggests that this method performs well (see Figure 5.11 for an example).

Figure 3.6: Receiver operating characteristic (ROC) plots showing the performance of the cluster comparison algorithm with different covariance values and different score thresholds. Sigma represents the value used to construct the covariance matrix for the MVN distributions. As the covariance increases and clustering becomes more noisy, the false positive rate (FPR) increases; however, using a score threshold of 7.0 produces optimal true positive rates (TPRs) in every case, and low FPRs.

Covariance Threshold TPR FPR 1.0 1 1.00 0.00 1.0 4 1.00 0.00 1.0 7 1.00 0.00 1.0 10 0.00 0.00 1.0 13 0.00 0.00 2.5 1 1.00 0.06 2.5 4 1.00 0.00 2.5 7 1.00 0.00 2.5 10 0.00 0.00 2.5 13 0.00 0.00 5.0 1 1.00 0.22 5.0 4 1.00 0.12 5.0 7 1.00 0.00 5.0 10 0.00 0.00 5.0 13 0.00 0.00 7.5 1 1.00 0.42 7.5 4 1.00 0.25 7.5 7 1.00 0.02 7.5 10 0.00 0.00 7.5 13 0.00 0.00

Table 3.3: A representative subset of the results from testing the cluster compar- ison algorithm (full results can be found in Online Resources). This confirms the observations from Figure 3.6 that 7.0 is a good choice of score threshold, as it produces optimal TPRs with minimal FPRs (usually zero or close to zero).