• No se han encontrado resultados

2 PLUS+1 GUIDE SERVICE TOOL

2.1 Ejecución

To enumerate the combination of multiple objects, we should collect the information for single object first. A naive approach, as used in [80], is to preprocess the raw data to each object according to the genotype and phenotype with a sample id list. Figure 3.3 (a1) displays the data format after preprocessing. The data in the first column is the object id. The second and the third columns are the phenotype and genotype in this object. The last list is used to store the sample ids whose related object has the same phenotype and genotype as shown in the second and third columns. Therefore, all the rows with the same object id in Figure 3.3 (a1) belong to a single object.

However, the above preprocessing technique is not only inefficient with regard to statistical testing, it is also not memory efficient. Therefore, we propose a new technique, called IRBI (Integer Representation and Bitmap Indexing), which is both CPU-efficient with regard to statistical testing, and storage and memory efficient.

39

to represent the long string data values in the raw data. Furthermore, IRBI builds the Bitmap index for each object to facilitate CPU-efficient statistical testing. Considering the example in Figure 3.1, we use 0 to represent the phenotype value “Heart Attacks” and 1 to “Breast Cancer”. For the genotype data, AA, AT and TT are represented as 0, 1 and 2 respectively. It is important to note that the number of objects is large in these systems and a lot of terms use long string values, such as “Pineoblastoma and supratentorial primitive neuroectodermal tumors”. Our adopted IRBI method can largely reduce the data size.

To collect single object information, we build a Bitmap index for each object (each column on the table) based on the phenotype and genotype. Each bit in the Bitmap index corresponds to a sample id. For one given phenotype and genotype, the corresponding positions in the Bitmap index are set to 1 if the samples have the same phenotype and genotype as the given ones. Otherwise, they are set to 0. Figure 3.3 (b1) shows the index data format under IRBI. For example, the first five rows are the index data for the first object. As the phenotype and genotype of the sample ids 2, 6 and 8 are 0 and 0, the index data is “10100010” as shown in the first row in the new formatted data. The corresponding positions for 2, 6 and 8 are therefore 1 in the index data “10100010”. Thus, the IRBI approach reduces the data sizes and memory utility.

Now, recall that in the statistical analysis, the contingency table has to be collected for each combination.

To construct the contingency table, the first step is to calculate the ni(j,k)for each grid in the table. If we want to calculate the ni(j,k)for the pair of object x and object y, we need the information from⟨x, i, j, list1(sampleID)⟩ in object x and ⟨y, i, k, list2(sampleID)⟩ in object y. We can derive ni(j,k)from the intersection between the two sample id lists. Under the naive scheme (without Bitmaps) as shown in Figure 3.3 (a2), this can be easily done using a hashing method - first, we build a hash table for the sample ids in the first

list; second, we use the sample ids in the second list to probe the hash table for matching sample ids. For example, to get n0(1,0) for the pair of object 0 and object 1, we inter-

sect the two sample lists as shown in Figure 3.3 (a2). However, our preliminary study suggests that using such an approach to collect the contingency table is computationally expensive.

In our COSAC framework, our solution employs the Bitmap index. As mentioned above, instead of storing the sample ids in the list, we build the Bitmap index for each object. Figure 3.3 (b1) is the Bitmap index data for the raw data in Figure 3.1. In our framework, all the operations are based on the index data. With the index data, if we want to calculate the ni(j,k)for the pair of object x and object y, we need the information

from⟨x, i, j, index⟩ in object x and ⟨y, i, k, index⟩ in object y. We can conduct an AND

operation on the two index data to find the intersection between them more efficiently. We can easily get the number of intersection samples from counting the 1’s bits from the AND result. Figure 3.3 (b2) depicts how n0(1,0) for the pair of object 0 and object

1 can be calculated using the Bitmap index. Thus, the IRBI approach is much more CPU-efficient than the naive scheme to collect the data (known as contingency table) during statistical testing. When we collect the contingency table for more than 2 objects, similar operations can be conducted. For example, to combine 3 objects, we can combine 2 objects first and then combine the result with the third object.

Now, we introduce how to efficiently build the Bitmap index under MR framework. In our COSAC framework, the index building is conducted in one MR job where the mapphase parses the objects from each sample and the reduce phase builds the index for each object. Take the data in Figure 3.1 as an example, the map phase reads the raw data line by line, each of which is an individual sample data. The map function parses the different objects in each line and emits (key,value) pairs (each for one object) which includes all other necessary information for each object like sample id and phenotype

41

data. Our own partitioning function partitions the pairs based on the object and the MR library shuffles all the (key,value) pairs belonging to the same object into a designated reducer. Once the reduce function gets all the information for one object, it builds the index based on the sample ids related to this object, and then writes the index data into the HDFS.

One optimization we have adopted is to combine the small index data into a big file. From our observation, if we store the index data for each object in a file, there are too many small files. During further processing, the MR library assigns each small file to one Mapper which brings too much overhead for launching a large number of mappers. Thus, we provide one optimization technique to combine all the index data emitted from one reducer into one big file. This will highly reduce the overhead for further processing on the index data. Note that the integer representation is also conducted in the map phase to build a one to one mapping between the raw data to the integer values.

For CSA applications, the Bitmap index is an effective structure. First, the operations are typically read-mostly. There are very few update operations. So there is no need to change the Bitmap index frequently. Second, each object has few candidates. In other words, each column has a small domain (i.e., there are few distinct values). Thus, the Bitmap index will not be too sparse. Last, the number of samples (rows) is small which guarantees that the Bitmap index is not large. We note that index building is an efficient, cheap and one-time fixed pre-processing step in our framework.

Discussion: In a broader context, our proposed technique above can be widely used

in many different applications besides the CSA applications . On one hand, the integer representation approach can be used in many computation intensive analysis applica- tions to reduce the data size and results in a more memory efficient computation. On the other hand, the proposed Bitmap index approach can be used in the systems which

Table 3.2: Frequently Used Notations in COSAC

Symbol Description

n the number of total objects

s the number of seed objects

n-s the number of non-seed objects

m the analysis level

r the number of reducers(processing units)

use statistics methods as the evaluation tool. It is expensive to conduct a large number of statistical testings, while the technique we adopted can be a promising solution for efficient statistical testings.

Documento similar