CAPÍTULO II MARCO TEÓRICO
2.7. BASES ADMINISTRATIVAS DE LA EMPRESA
In the era of high-throughput computational biology, discovering the biological functions of the genes/proteins within an organism is a central goal. Several studies have applied machine learning to infer functional properties of proteins, or directly predict one or more functions for unknown proteins (Clare and King, 2003; Qi and Noble, 2011). Indeed, the prediction of multiple biological functions with a single model, by using learning methods which exploit multi-label prediction, has made considerable progress (Barutcuoglu et al, 2006) in recent years.
A step forward is represented by models considering possible structural relationships among func- tional class definitions (Jiang et al, 2008; Vens et al, 2008). This is motivated by the presence of ontolo- gies and catalogs such as Gene Ontology (GO) (Ashburner et al, 2000) and MIPS-FUN (FUN henceforth) (Mewes et al, 1999) which are organized hierarchically (and, possibly, in the form of Directed Acyclic Graphs (DAGs), where classes may have multiple parents), where general functions include other more specific functions (see Fig. 5.4). In this context, the hierarchial constraint must be considered: A gene annotated with a function must be annotated with all the ancestor functions from the hierarchy. In order to tackle this problem, hierarchical multi-label classifiers, that are able to take the hierarchical organi- zation of the classes into account, during both the learning and the prediction phase, have been recently used (Barutcuoglu et al, 2006).
Increasing attention in recent years has been attracted by the topic of considering protein-protein interaction (PPI) networks in the identification and prediction of protein functions. This stream of re- search is mainly motivated by the consideration that “when two proteins are found to interact in a high throughput assay, we also tend to use this as evidence of functional linkage”(Jiang et al, 2008). As a
120 Learning PCTs for HMC from Network Data
confirmation, numerous studies have demonstrated that proteins sharing similar functional annotations tend to interact more frequently than proteins which do not share them (guilt-by-association principle). Interactions reflect the relation or dependence between proteins. In their context, gene functions show some form of autocorrelation.
Protein-protein interactions occur when two or more proteins bind together, often in order to carry out their biological function. Many of the most important molecular processes in the cell, such as DNA replication, are carried out by large molecular machines that are built from a large number of protein components organized by protein-protein interactions. Protein interactions have been studied from the perspectives of biochemistry, quantum chemistry, molecular dynamics, chemical biology, signal trans- duction and other metabolic or genetic/epigenetic networks. Indeed, protein-protein interactions are at the core of the entire interactomics system of any living cell. Interactions between proteins are important for the majority of biological functions. For example, signals from the exterior of a cell are mediated to the inside of that cell by protein-protein interactions of the signaling molecules.
The use of such relationships among proteins introduces the autocorrelation phenomenon into the problem of gene function prediction and violates the assumption that instances are independently and identically distributed (i.i.d.), adopted by most machine learning algorithms. Recall that while correla- tiondenotes any statistical relationship between different variables (properties) of the same objects (in a collection of independently selected objects), autocorrelation (Cressie, 1993) denotes the statistical rela- tionships between the same variable (e.g., protein function) on different but related (dependent) objects (e.g., interacting proteins). As described in the introductory chapters of this thesis, autocorrelation has been mainly studied in the context of regression analysis of temporal (Mitsa, 2010) and spatial (LeSage and Pace, 2001) data.
Although autocorrelation has not yet been studied in the context of Hierarchical Multi-label Clas- sification (HMC), it is not a new phenomenon in protein studies. For example, it has been used for predicting protein properties using sequence-derived structural and physicochemical features of protein sequences (Horne, 1988). In this work, we propose a definition of autocorrelation for the case of HMC and propose a method that considers such autocorrelation in gene function prediction.
The consideration of PPI network data for HMC has received limited attention so far. One of the works that faces this problem is presented by Jiang et al (2008), where a probabilistic approach that classifies a protein on the basis of the conditional probability that a protein belongs to the child func- tion class is presented, assuming that it belongs to the parent function class. The computation of this conditional probability takes the PPI network into account in terms of the number of the neighbors of a node. Although similar to the approach we propose latter in this chapter, this approach has problems with sparse networks and does not properly exploit the autocorrelation phenomenon. Second, in order to deal with DAGs, it transforms a DAG into a hierarchy by removing hierarchical relationships present at lower levels, thus ignoring possibly useful information. Third, this work reports experiments only for GO annotations.
The work presented in this chapter is intended to be a further step toward the investigation of methods which originate from the intersection of these two promising research areas, namely hierarchical multi- label classification and learning in presence of autocorrelation. In particular, we propose an algorithm for hierarchical multi-label prediction of protein functional classes. The algorithm exploits the non- stationary autocorrelation phenomenon which comes from protein-protein interaction by means of tree- based prediction models. In this way, we are able to:
• improve the predictive capabilities of learned models • obtain predictions consistent with the network structure
Learning PCTs for HMC from Network Data 121
(a) (b)
(c) (d)
Figure 8.1: An example of DIP Yeast network. Different colors correspond to different classes of the FUN hierarchy. (a) Examples that are not connected are arranged along the ellipse’s border; (b) Examples are arranged along the ellipse’s border to show the density of the PPI interactions; (c) Examples are arranged along the ellipse’s border and grouped according to the first level of the FUN hierarchy (not considering the root); d) Examples are arranged along the ellipse’s border and grouped according to the second level of FUN. The networks are drawn using the Pajek software by Batagelj and Mrvar (1998).
nitions and PPI networks)
• capture the non-stationary effect of autocorrelation at different levels of the hierarchy
• also work with DAG (directed acyclic graph) structures, where a class may have multiple parents. We first introduce new measures of autocorrelation for HMC tasks. We then describe the learning of PCTs for the HMC task in more detail. We focus on the proposed algorithm and its computational complexity.