• No se han encontrado resultados

CONCEPTO DE POLÍTICA PÚBLICA 

1.2 EL CONCEPTO DE GESTION PÚBLICA

1.3.1.1 El sentido de lo público

implying that the two simulators use these intermediate variables in a similar way.

ˆ High correlations between iz.pd and the latter two points of iz.dn with SPE for both datasets suggest that these intermediate variables are used differently in the two simulators.

ˆ When the OG99NPZD emulator is used over HAD1007, the zooplankton re- lated intermediates are quite strongly correlated with the SPE. This consoli- dates the conclusion from step 13 that these intermediates are more influential in HadOCC.

Summary

This stage has improved our understanding of how the different intermediate pro- cesses contribute to OG99NPZD and HadOCC, and where differences lie. Studying the correlations between intermediate variables and output gave an indication of the most important intermediate variables and of their effects. Using the emulator of one simulator over the intermediate variable data of the other, made possible by the formation of the same intermediate variables in each simulator, enables a di- rect analysis of how similarly the two simulators use their processes. Both of these steps enabled us to identify possible avenues for further investigation of differences between OG99NPZD and HadOCC.

6.7

Further directions

This chapter has covered the process of using intermediate variable emulation to compare two simulators and better understand each of them. Various aspects of the methods shown could be developed, refined or augmented to bring improvement. Some possible areas for further work are discussed here.

6.7. Further directions 178

Dimension reduction

Principal variables have been used in this chapter, largely because they are relatively robust to different trends and patterns in the data, and simple to interpret in the latter stages of intermediate variable emulation. In some cases it might be that a functional approximation, for example a smoothing spline, might much better cap- ture the features of the intermediate variable data, and allow for more information to be retained. This requires careful study of how the method will be chosen using the data, how the parameters will be interpreted, and of situations in which using an alternative dimension reduction method would be particularly beneficial.

Choice of dimension reduction technique may also depend on the overall goal of emulation. If a particular output variable is of interest, it may be better to choose a subset of the full intermediate variables that best predicts this value, rather than choosing the subset that best represents the intermediate variables. This may involve purely data analytic techniques, but it could make use of understanding of the simulators. For example, if only the previous time point is used at each stage in the process, the intermediate variables from the time point before the one in question might be a good choice.

Experimental design in intermediate variable space

It has been noted that one of the main difficulties in emulating the output from the intermediate variables is our lack of control over the intermediate variables, and therefore our inability to specify a design over that space. While this is inevitably true (unless the simulator itself can be re-written to take intermediate variables as inputs), history matching techniques could be used, along with the emulators from input to intermediate variables, to create an approximate design.

In order to be able to specify the nature of the intermediate variable space, the full set of intermediate variables would have to be jointly emulated from the input variables, which is not the case in this chapter. There is also no guarantee that all regions of the intermediate variable space can be filled; the values of intermediate variables may be inherently linked in such a way that prevents, for example, a high value of one and a low value of another. This may in itself be interesting, particularly

6.7. Further directions 179

if the limitations are different for the different simulators.

Where two intermediates are generally highly correlated, it may be possible to use the input to intermediate variable emulators to produce data where this is not the case. This could then provide more information about the effects of these variables, in situations like that of iz.pz and iz.dz in HadOCC, whose effects are difficult to tell apart in the example in Section 6.6.3.

Model selection for intermediate to output variables

In the examples in this chapter, the regression surfaces were built using the stepwise selection procedure in R (R Development Core Team, 2011), searching by adding and deleting terms, and allowing squared and second-order interaction terms. How- ever this has certain shortfalls, particularly in being unable to choose between highly correlated variables, and therefore potentially leading to spurious conclusions if the effects of some variables are attributed solely to one. Existing work on model se- lection in regression problems with highly collinear input variables could be used here.

One potential solution would be to emulate the output using the principal com- ponents of the intermediate variables, rather than the intermediate variables them- selves (as in principal component regression). Although this might seem more diffi- cult to interpret, the diagnostics and plots used in the example in Section 6.6.3 are all also possible with a principal component emulator.

When one simulator is ‘better’

In this chapter the simulators are being compared without one being judged to be more reliable or accurate. However, often it may be the case that one generally per- forms better against observed data, or has had much more time and effort invested in it, than another. In this case, intermediate variable emulation could perhaps be viewed more as a tool for using the ‘better’ simulator to inform the other. In par- ticular, the ranges and behaviour of the intermediate variables could be viewed as standards, and used to deduce sensible input ranges for the other. This leads into the much broader area of using expert knowledge in order to interpret the findings

6.8. Summary 180

of intermediate variable emulation.

As a general emulation strategy

It has been mentioned throughout that intermediate variable emulation can be of use in a single simulator context. This has mostly been related to an increased un- derstanding of the simulator through use of intermediate variables. However, there might be situations in which intermediate variable emulation is a better strategy than standard emulation. Appendix C.3 explores the idea of combining the input to intermediate and intermediate to output variable emulators to form an emula- tor from input to output variables. Figures C.8 and C.9 show samples from the combined intermediate variable emulators.

Because intermediate variable emulation splits the simulator into different stages, and offers more flexibility in how the individual emulators are constructed, it may be better equipped to deal with simulators that contain complicated relationships for particular processes.

6.8

Summary

This chapter has presented intermediate variable emulation, a method enabling emu- lation of multiple simulators of the same system in a way that improves understand- ing of each, and facilitates comparison. Methods have been illustrated throughout using OG99NPZD and HadOCC. Some pairs of highly active input parameters, given different meanings in HadOCC and OG99NPZD, were shown to affect almost all intermediate variables similarly, therefore suggesting links between the two input spaces. Other inputs that are unique to one simulator were shown to be largely inactive, lessening the motivation to link the input spaces in full.

Emulators from the intermediate to output variables showed that there are sys- tematic differences between the two simulators. The transfer of nitrogen from phy- toplankton to nutrient, iz.pn, is the most active in both HadOCC and OG99NPZD, and appears to be treated very similarly. Other transfers, particularly iz.dn and iz.pd, appear to contribute quite differently to the two simulators.

6.8. Summary 181

Unlike hierarchical emulation, intermediate variable emulation does not require that the simulators’ input spaces be almost the same, but instead makes use of simi- lar process represented in each simulator, using them to create a set of ‘intermediate’ variables. By analysing the distributions and trends of the intermediate variables, differences in the general behaviour of the simulators can be understood.

For each simulator, the input variables can be used to emulate the intermediate variables, enabling a detailed study of the relationships between the input spaces. Using expert knowledge of the system, unrealistic values of the intermediate vari- ables can be used to refine the input spaces through history matching and similar techniques.

Emulators of the output variables from the intermediate variables can also be created for each simulator. Although the intermediate variable spaces are likely to present difficulties for emulation because of their irregular shapes and collinear- ity, having emulators with the same input and output variables for all simulators enables direct comparison. Not only can the effects of the intermediate variables on the output be observed for each simulator, the emulator of one simulator can be used to predict the behaviour of another. Studying the behaviour of the errors for these predictions reveals the key systematic differences between the simulators’ representations of the system.

Chapter 7

An object-oriented structure for

emulation

Up to this point, the focus has been on methods for emulation, rather than on their implementation. Because of the quantity of data and the number of operations involved in building emulators, the only feasible approach is to program. For this thesis, all emulation was done in R (R Development Core Team, 2011), and in an object-oriented way using the S4 class structure. In this chapter we explore the ben- efits of object-oriented programming and apply them specifically to emulation. First of all, we motivate object-oriented programming, and then introduce the S4 classes in R. A framework for emulation is then presented, and extended to incorporate the new methods from Chapters 5 and 6.

7.1

Why use objects?

In this thesis, methods for emulation have been presented that use large amounts of simulator data, perform many calculations, and result in large collections of results. Many collections of the same sort of data or results may be stored, and may need to be accessed by different people or after long breaks, and so the potential for mistakes and inefficiency is high. For instance, time consuming calculations such as finding the inverse or Cholesky decomposition of the covariance matrix of the correlated errors (the matrix Σ (x) in the notation of Chapter 3) or estimating the

7.1. Why use objects? 183

correlation lengths (see Section 3.3.3) may be repeated often as new techniques are tried, or even as the same operations are carried out at different times or by different people. The details of the correlation or regression surface associated with a set of predictions may be lost or confused.

It is also likely that as understanding of the problem develops and new techniques are devised, existing code will need to be adapted to deal with new sorts of simulator data, or to perform new tasks. For example, code that performs standard emulation as in Chapter 3 may need to be extended to be able to use a new correlation function or to perform hierarchical emulation (Chapter 5) or intermediate variable emulation (Chapter 6). Ideally, this would not require a new set of functions written entirely from scratch, but could be built on an existing foundation. Object-oriented programming (OOP) addresses each of these issues.

Rather than focus on functions, OOP revolves around tightly structured objects and their interactions. In OOP, information that belongs together is encapsulated as one object. All objects belong to a particular class, and classes have strict definitions; knowing the class of a particular object means knowing exactly what each part of the object is, and how the different components relate to one another. The structure of the data is maintained without any part being lost, a feature that is not guaranteed when components are stored separately. In emulation, there may be delays between designing an experiment, running a simulator, building emulators and making predictions, and several similar processes may be ongoing at once. An object-oriented structure ensures that no information is confused or lost.

Classes can be related to one another through inheritance. Alfons et al. (2010) describe this as one of the main advantages of object-oriented programming. Inher- itance allows sub-classes to inherit their structure and behaviour from their super- class, each sub-class extending the super-class in some way. Thus several classes may be created representing fundamentally the same sort of information, but each in a slightly different way, or with extra features.

Outside of OOP, in order to deal with different forms of the same sort of data, functions must contain many checks to discern the meanings and features of their arguments each time they are evaluated, in order to know how to behave. Another

7.1. Why use objects? 184

advantage of OOP is multiple dispatch. This streamlines the way functions are made and used. A generic function is created representing the goal or task at hand, and methods are written for this function, dispatching on various combinations of classes of arguments, or signatures. When the function is called, the classes of the arguments are checked against the signature of each method, and the correct method is dispatched. Many methods can be written for each function, so that the same task can be performed using the same function with any manifestation of the same sort of information, so long as the relevant methods have been written.

This helps enormously with maintaining and adapting code. Suppose one has a framework for emulation for a particular sort of simulator data, held in objects of a particular class, and that throughout the code the simulator data is handled using functions with methods defined for that class. In the event that another class is created, containing a slightly different form of simulator data, one can simply write new methods for each function to be able to handle the new class, and any code using these functions will still work. In a non object-oriented setting, it can be much more difficult to adapt code to deal with such a fundamental change. Methods also provide flexibility in allowing objects of the same class to be created from various different combinations of arguments.

OOP is also often preferred at an ideological level. Rather than think in terms of long sequences of instructions and procedures starting with primitive items, it is posited by many that people generally think in terms of meaningful objects, and interesting operations one might want to perform on them. Leisch (2004), who holds this view, uses the example of probability distributions. He argues that it is intuitive to store pdf’s and cdf’s as objects, and to define operations on them, for example the mean, variance, random sampling or some sort of plot, rather than to write separate functions for each operation and distribution.

As the S4 emulation framework is presented in Section 7.3 and extended in Sec- tions 7.4 and 7.5, the benefits of OOP for emulation will become clearer. Before outlining this framework, we will introduce the S4 class structure in R (R Develop- ment Core Team, 2011; Chambers, 1998).