• No se han encontrado resultados

2. Marco teórico

2.3. DevOps

2.3.2. Métodos de adopción de DevOps

According to the general ‘knowledge discovery in databases‘ (KDD) process data selection and integration is followed by data transformation (Ester, 2000) into a readable format for data mining. In general this includes standardizing values, deleting irrelevant attributes, or converting numerical values into discrete values. Since we ignored irrelevant attributes in the data integration process, the task of attribute deletion can be omitted.

In order to verify the overlaps of large-scale data between different projects, for example, joins of relevant relations are required. This is implemented by a single SQL statement demonstrating the strength of database techology. In order to avoid duplicate data, it does not make sense to create a ‘real’ relation for each join. The solution to this problem is given by the idea of ‘views’. If each ‘join’ statement yielded a real relation, data redundancy would increase. Avoiding data redundancy is the main purpose of virtual relations. Thus, to derive the number of shared identified peptides between different experiments, for instance, one has to join the two corresponding tables that store the peptides identified in a certain experiment. This results in the creation of a virtual relation that contains peptides, which are common in two given experiments (Figure 4.14).

46

Figure 4.14: Database join of two relation instances (‘peptides A’, ‘peptides B’) containing detected peptides of a given project results into the creation of a virtual relation (‘overlap’)

To deduce the overlaps of phosphoproteomes between different organisms, we used the database relations that store evolutionary information such as homology between species for each integrated phosphorylated protein (Chapter 4.2.4). This simple way of dealing with data once stored in a consistent format once again underlines the benefit of databases.

As discussed in Chapter 4.5, the PHOSIDA database schema that stores non-redundant data such as 1:1 assignments between peptides and proteins (Chapter 4.2.1.1) is the one used for data mining of phosphoproteomes. On the one hand, we implemented mining tools in the language C# including statistical tests such as the χ2-tests to check significant overrepresentations of matching kinase motifs. These self coded methods rely on a consistent database schema with categorical requirements for data storage and applications including mining tools to derive significant patterns from the managed data. Another prime example is the training of the support vector machine (Chapter 7). The implementation of organism- specific predictors requires consistent database storage to obtain positive instances, namely phosphorylation sites along with their surrounding sequence, as training sets. On the other hand, we used already established public mining tools that are freely available to the community (Chapter 4.5). The software Cytoscape (Maere et al., 2005; Shannon et al., 2003) determines whether certain GO categories describing cell components, functions, or biological processes are significantly overrepresented in a given set of proteins in comparison to the whole gene ontology annotation of a specified species. Although established mining methods are relatively easy to handle and user friendly, the required input files have to satisfy certain format specifications. Such format stringencies demand conversions of accession numbers, combinations of various annotation sets and further formatting. Hence the PHOSIDA administration tool (Chapter 4.2.6) includes various C# classes that enable the

database administrator to create differently formatted files that are required as input for these mining tools (Figure 4.15). Underlying joins between relations storing protein annotations and relations containing phosphoproteome data and accession number conversions, for example, are executed automatically.

Figure 4.15: The PHOSIDA administration tool allows the conversion of accession numbers, joins on various annotation tables, or specified formatting of files required as input for certain mining methods such as Cytoscape

Besides the mining of integrated large scale data, the web application of PHOSIDA also demands an appropriate transformation of uploaded data. One prime example is the unification of different project specific subdatabases into one comprehensive organism specific database (Figure 4.16). Because of regularly updated versions of various databases, the spectrum-to-peptide assignments are often based on different database releases. For example, the identification of the human phosphoproteome identified upon epidermal growth factor stimulation (Chapter 4.6.1.1.1) was based on the human IPI database version 3.24, whereas the study of cell cycle dependent phosphorylation dynamics of kinases in human cells (Chapter 4.6.1.1.2) was based on IPI version 3.13. To unify the two subdatabases into one consistent database comprising both detected phosphoproteomes in human, it is indispensible to transfer the given data to a common database version. This is also required to determine the overlaps between large scale studies. Therefore, we reassigned the detected phosphorylated peptides to a more current database version, resulting in new peptide-to- protein assignments. Along with the amino acid sequence of a database entry the positions of identified phosphosites within the protein sequence can also change. In very few cases (less than 1%), identified peptide sequences cannot be reassigned to a more updated database release. Although the number of peptides that are not present in a more current database version is miniscule, this shows that databases do loose correct protein sequences between

48

versions. With phosphoproteomic data assigned to a common database, it is possible to compare various phosphorylation changes observed under different treatments together using the PHOSIDA web page. The reassignment of peptides to an up-to-date database release was also essential to unite the different subdatabases annotated in the former version of the proteome database MAPU resulting in the new release of MAPU 2.0 (Chapter 5).

Finally, the reassignment of identified peptides to another database was also one of the main underlying principles of the genome annotation study using the genomic database EnsEMBL as for assigning peptides to gene transcript entries (Chapter 8).

Figure 4.16: To unify various large scale data, the identified phosphopeptides have to be reassigned to a shared and more current database version