There are classical control examples, which are used to illustrate the general purpose of the control technique and validate or benchmark an algorithmic im- plementation. Two typical examples are ”Left-Right” and the ”Car on the Hill” problem, which we also will be using here. The ”Left-Right” problem is intu- itively illustrative, relatively easy to learn, and highly stochastic. The ”Car on the Hill” problem is less intuitive, deterministic, and the numerical derivation of the optimized policy is considered to be difficult, because the success rate in the trainings set is extremely low.
Classical control problem – Left-Right
!" #" $#" %&'(%)" """"*#" %&'(%)" """$##" +," ,"
A simple example to illustrate the control purpose in classical engineering is a Brownian motion of a particle confined between two absorbing boundaries x ∈ [0, 10]. The particle can be moved to the left or to the right by application of discrete control signals u ∈ U = {−2, 2}. If it crosses the left boundary a reward of a value 50 is given, if it crosses the right boundary a reward of value 100 is given. The dynamics of the particle can be described via a simple SDE Eq. 5.6. So the aim of the control policy is to direct the stochastic particle to one of the boundaries within a given time interval or within a constraint finite time horizon maximizing the reward. xt= x0+ Z t 0 dωs+ Z t 0 u(s)ds (5.6)
with dωsGaussian white noise.
Intuitively we expect that reward within a finite time horizon is maximized directing the particle to the left boundary if it has already started close to the left boundary. And if the particle starts somewhere in the middle then it would be better to control it to the right boundary associated with the larger reward. We expect a threshold rule as the optimized control strategy, but which value is the threshold value?
for genetic networks 141 a b 0 2 4 6 8 10 30 40 50 60 70 80 90 100 x Q10 1 2 3 4 5 6 !2 0 2 4 6 8 10 12 t x sample trajectories
Figure 5.4: Algorithmic estimations of the optimal control for the benchmark ”Left-
Right”. a)Numerical estimation of the optimal control policy with fitted-Q algorithm
obtained after 10 iterations. The score for left kick is shown in green for the right in red. The estimated control policy is state dependent and is obtained for each x by taking the action with the largest Q - score. The lines cross at the expected threshold value x=2.7. b) Two trajectories starting at the initial conditions below (green) and above (red) the threshold, which are controlled using the estimated policy shown in a). The absorbing boundaries at x=0 and x=10 are shown in gray dashed lines.
fying the most suitable threshold value given the data set. It turns out it is 2.7 for the discounting factor γ=0.75. Even though this is just a one-dimensional problem and we can almost guess the solution, the learning of the threshold rule is impeded by the noise. Therefore this example is often chosen as a benchmark to prove both that the proposed algorithm can learn under noise and that the implementation of the algorithm is correct. The results of our algorithmic imple- mentation on this classical example are shown in Fig. 5.4 and demonstrate both that the fitted-Q is a suitable framework for learning under noise and that the implementation is correct.
Classical control problem – Car on the Hill
Another classical problem often used to illustrate the purpose of optimal control in engineering is the ”Car on the Hill”. Consider a landscape as illustrated in Fig. 5.5 containing a dip between a hill and a sheer and a car positioned at the minimum. The control aim is to direct the car to the top of the hill without letting it fall over the sheer by applying discrete control signals u ∈ U = {−4, 4} ”left acceleration” or ”right acceleration”. Intuitively, it is clear that if we just applied
5.2 Methods Description 142
the signals at random then the probability of falling down the sheer is by far higher than reaching the top of the hill, because the hill is higher. Therefore we expect that the estimation of the optimal strategy will be hard to induce due to low success rate in the trainings set. And indeed the trainings set containing 60 000 observations had only 18 successes, where the reward 1 was assigned, because the car achieved the top of the hill.
The dynamics are described by the ODE Eq. 5.7 on the support (p, s) ∈ [−1, 1]× [−3, 3]: dtp = s dts = u m(1+ h0 (p)2) − h0 (p)g 1+ h0 (p)2 − s2h0 (p)h00 (p) 1+ h0 (p)2 . (5.7)
with the hill potential h(p)
h(p)= p2+ p if p< 0 p √ 1+5p2 if p ≥ 0.
The instantaneous reward function is defined on the boundaries: the hill top p ≥ 1&s ≤ 3= 1 and on the sheer (p ≤ −1, ·) = −1 and is zero everywhere else on the
support (p, s) ∈ [−1, 1] × [−3, 3]. !"# !$%&# "# !'# '# ()*+(,# ###!"# ()*+(,# #####"#
Figure 5.5: Classical control prob- lem ”Car on the Hill”.
Figure 5.6 shows the algorithmic output of this control problem. The algorithmically op- timized trajectory 5.6 c) has an interesting fea- ture: the optimized use of the applied signals is achieved by a combination of the intuitively ob- vious acceleration towards the maximum point u= 4 and the less intuitive contra force u = −4. Energetically it is favorable to accelerate to- wards the other end in order to take advan- tage of the gravitational force (second term in Eq. 5.7). This is one of the useful characteristics of the approach that the estimated policy takes advantage of system’s inherent characteristic
for genetic networks 143 although the algorithmic input does not con- tain the equations of motion or any kind of explicit information on this force, but only sample trajectories.
This is also the reason why we want to apply the algorithm to a biologically motivated examples, since in many biological applications only limited informa- tion is directly accessible and inherent characteristics are often unknown. This approach would take advantage of inherent systems dynamics using time trajec- tories of the key variables and control signals.