• No se han encontrado resultados

The massive acquisition of gene expression profiles can provide a deeper insight into the function of cells. A variety of mathematical formalisms for modelling this type of data have been proposed in the literature. Ventura et al. [2006] provide a wider review of mathematical modelling and its application in biology. So far, these modelling approaches have been most successful for systems of simpler organisms likeE. coli and S. cerevisiae [Cantone et al., 2009].

genes forming a specific gene regulatory network (GRN) may be simulated under a variety of conditions and used to test hypotheses. Conversely, the observation of gene behaviour under specific conditions may be used to infer the underlying GRN. Generally speaking, the reconstruction of a GRN from the observed measurements is known as a “reverse engineering” approach.

In general, there are two well known information extraction approaches, char- acterised as “top-down” and “bottom-up”, which have been applied to inferring GRNs from high-throughput data. A “top-down” approach mainly breaks down a system from experimental observations, in order to gain insights into the system. Alternatively, in a “bottom-up” approach, the researchers attempt to build up a system using observations from different components of the system.

1.3.1 Modelling and reverse engineering approaches

Mathematical and statistical models represent a powerful approach to understand, reflect and describe observations by representing them in terms of a variety of al- ternative mathematical/statistical frameworks. The benefits of using mathematical models lie in their ability to enhance and augment our understanding of a system, to make quantitative predictions from past and present observations, and to condense previously observed behaviour into a concise framework.

Ordinary differential equation (ODE) models are of a differential equation form that describes the rate of change of gene expression with respect to time, as a function of other gene expression, and as an external perturbation. The model has a differential equation for each of the genes in the network. The parameters of the model are then inferred from the gene expression data.

In information-theoretic approaches, the gene network is reconstructed by considering one pair of genes at a time and checking the co-expression of the two genes across the experimental data set. Evaluation of co-expression between two genes can be done either by correlation or by using a mutual information score

[Bansal et al., 2007].

1.3.1.1 Bayesian Networks

A Bayesian network (BN) describes a directed acyclic graph (DAG) using a prob- abilistic graphical network model. In the model each node describes a random variable, and edges represent conditional independence relations between random variables. For example an edge from nodextoyrepresents a statistical dependency between variable x and y. Further the arrow indicates that x influencesy. Nodex

is parent ofy andy is a child of x. In a broader sense these relations define the set of descendants, the set of nodes that can be reached directly from ancestral nodes. No node can be its own ancestor because of the structure of the acyclic graphs.

A BN reflects the conditional independence statement, such that each vari- able is independent of its non-descendants in the graph given the state of its parents. This property is very useful to reduce the number of parameters that are needed to define a joint probability distribution of the variables. This reduction also leads to a better estimation of posterior probabilities.

These kinds of relationships are useful to represent gene-gene interactions which can be visualised by a directed graph without cycles. “Without cycles” (acyclic) means a gene may have no direct or indirect interaction with itself. This approach can be used to reverse engineer a gene network by finding the directed acyclic graph that best describes the gene expression data. The particular limi- tation of a directed acyclic graph can be overcome by using a dynamic Bayesian network if time series observations are available (for more details see next section) [Husmeier et al., 2005].

BNs provide a flexible framework for giving a diagrammatic representation of the probabilistic relationships between sets of variables. In our case, these sets of variables are sets of gene expression measurements, and establishing relationships among these variables will define interactions between the genes. The interactions

between a set of genes can be defined in terms of conditional independence relations [Husmeier et al., 2005]. The overall representation of a BN can be given by a graphical structure G = (V, E) where V are the vertices and E are the edges. G

specifies a joint distribution over the set of random variables of interest by defining conditional probability distributions.

For example, consider any given joint distributionP(x, y, z) over three vari- ablesx,y andz. By using the product rule of probability the joint distribution can be written as

P(x, y, z) =P(z|x, y)P(x, y).

By using a second application of the product rule we can factorise P(x, y) as

P(y|x)P(x) , giving,

P(x, y, z) =P(z|x, y)P(y|x)P(x).

This factorization is shown graphically in Figure 1.6.

X

Z

Y

Figure 1.6: A directed graph representing a factorization of the joint probability distribution over three variables x, y, and z.

Despite technological advances in measuring gene expression levels as time series for thousands of genes, the complex nature of the data does not allow us to explore all of the factors that might contribute to genetic regulation and the inter-

actions among genes. Bayesian networks have the advantage of modelling hidden factors, making them very powerful tools for inferring gene networks. However BNs have some limitations, eg. (a) self regulation and feedback loops are likely features in GRNs, but the strict use of a DAG makes it impossible to capture any direct cycle or feedback loops without the use of time series observations and (b) discretization of data for BN analysis may result in a loss of information from continuous gene expression measurements.

1.3.1.2 Dynamic Bayesian Networks

Dynamic Bayesian Networks (DBNs) are Bayesian networks that model sequences of variables. Murphy and Mian [1999] first introduced the use of DBNs to model gene expression data. The benefits of DBNs include the ability to handle latent variables and missing data (such as TF protein concentrations that effect the steady state concentrations of mRNA) and to model stochasticity. Friedman et al. [2000] explored experimental applications to microarray data analysis.

Feedback loops can also be unfolded with respect to time, by explicitly mod- elling the influence of a gene at timet= 1 (i.e. G1) on another gene at a later time

t= 2 (i.e. G2), as shown in Figure 1.7