7. SINTOMATOLOGÍA DE LA CATARATA
10.1 INTRODUCCIÓN
10.3.2 ANESTESIA OCULAR
We now illustrate the general solution to the problem of constant creation in symbolic regression. Suppose we are given a sampling of the numerical values from the given curve
over 20 randomly chosen points in some domain, such as the interval [-1, +1].
Because of the presence of the coefficients 2.718 and 3.1416 in the target expression above, it is unlikely that we could genetically discover an S-expression that closely fits the 20 sample points using only the techniques described for symbolic regression in section 7.3. Clearly, in order to do symbolic regression in general, we need the ability to create arbitrary floating-point constants to appear in the S-expressions produced by genetic programming.
The problem of constant creation can be solved by expanding the terminal set by adding one special new terminal called the ephemeral random constant
Page 243 and denoted ℜ. Thus, the terminal set for a symbolic regression problem with one independent variable x is expanded to
T = {X, ℜ}.
Whenever the ephemeral random constant ℜ is chosen for any endpoint of the tree during the creation of the initial random population in generation 0, a random number of a specified data type in a specified range is generated and attached to the tree at that point.
For example, in the real-valued symbolic regression problem at hand, it would be natural for the ephemeral random constant to be of the floating-point type and to yield a number in some convenient range, say between -1.000 and +1.000. In a problem involving integers (e.g., induction of a sequence of integers), ℜ might yield a random integer over some convenient range (such as -5 to +5). In a problem involving modular numbers (say, 0, 1, 2, 3, and 4 for a problem involving modulo-5 numbers), the ephemeral random constant ℜ would yield a random modulo 5 integer. In a Boolean problem, the ephemeral random constant ℜ would necessarily yield one of the two Boolean constants, namely T (True) or NIL (False).
Note that this random generation is done anew each time an ephemeral terminal ℜ is encountered, so the initial random population contains a variety of different random constants. Once generated and inserted into an initial random S-expression, these constants remain fixed.
When we create floating-point random constants, we use a granularity of 0.001 in selecting floating-point numbers within the specified range. Figure 10.2 shows an initial random individual containing two random constants, +0.1297 and -0.3478.
After the initial random generation, the numerous different random constants arising from the ephemeral ℜ terminals will then be moved around from tree to tree by the crossover operation. These random constants will become embedded in various subtrees, which then carry out various operations on them.
This moving around of the random constants is not at all haphazard; it is driven by the overall goal of achieving ever-higher fitness. For example, a symbolic expression that is a reasonably good fit to a target function may become a better fit if a particular constant is decreased slightly. A slight decrease can be achieved in several different ways. For example, there may be a multiplication by 0.90, a division by 1.11, a subtraction of 0.008, or an addition
Figure 10.2 Initial random S-expression containing
two random constants, +0.1297 and -0.3478.
Page 244 of -0.0004. If a decrease of precisely 0.09 in a particular constant would produce a perfect fit, a decrease of 0.07 is usually fitter than a
decrease of only 0.05. The creation of the value π/2, after a long sequence of intermediate steps, as described in the previous section, is another example.
Thus, the relentless pressure of the fitness function in the process of natural selection determines both the directions and the magnitudes of the adjustments in numerical constants.
In one run of the problem of symbolic regression for the target function 2.718x2 + 3.1416x, the best-of-generation S-expression in generation 41 was
(+ (- (+ (* -0.50677 X)
(+ (* -0.50677 X) (* -0.76526 X)))) (* (+ 0.11737) (+ (- X (* -0.76526 X)) X))). This best-of-run S-expression is equivalent to
The numerical constants -0.50677, -0.76256, and +0.011737 appearing in the above S-expression were originally created at random for some individuals in generation 0. These constants survived to generation 41 because they were carried from generation to generation as part of some individual in the population. If the individual carrying a particular constant is selected to participate in crossover or reproduction more than once on a particular generation, the constant would then appear in an increasing number of individuals. If no individual carrying a particular constant is selected to participate in crossover or reproduction in a particular generation, that constant would disappear from the population. As previously mentioned, crossover can combine expressions containing one or more existing constants to create new constant values. The run producing the above S-expression was terminated at generation 41 because the S-expression came within 0.01 of the value of the target function for all 20 randomly chosen values of the independent variable x in the domain [-1, +1]. That is, this individual scored 20 hits. Scoring 20 hits is one of the termination criteria for this problem (the other being that the run has reached the maximum specified generation number, i.e., 50). Unlike the S-expression produced in section 7.3 for the symbolic regression problem involving the quartic polynomial x4 + x3 + x2 + x, the S-expression above is not an exact solution to the problem. The coefficient 2.76 is near 2.718 and the coefficient 3.15 is near 3.1416, so this S-expression produces a value that is close to the given target expression for the 20 fitness cases.
The above genetically produced best-of-run S-expression is, with certainty, an approximately correct solution to the problem only for the particular 20 randomly chosen values of the independent variable x that were available to the genetic programming paradigm. If the best-of- run S-expression were a polynomial of order 19, we would wonder whether it was merely a polynomial that happened to pass through the particular 20 given x-y points. This particular suspicion does not arise here, since the best-of-run polynomial is
Page 245
Figure 10.3
Performance curves for the symbolic regression problem with 2.718x2 + 3.1416x as the target function.
only quadratic. However, the question remains as to how well this approximately correct quadratic expression discovered by genetic programming generalizes over the entire domain [-1, +1].
We can begin to address this question concerning the generality of an S-expression discovered from only a limited number of fitness cases by retesting the S-expression against a much larger number of fitness cases. For example, when we retest this S-expression over 1,000 randomly chosen values of the independent variable x in the domain [-1, +1], we find that the S-expression returns a value that comes within 0.01 of the target function for all 1,000 of the new fitness cases. That is, this S-expression scores 1,000 hits on the retest. This success increases our confidence that the genetically produced S-expression is a good fit for the given target function over the entire domain [-1, +1].
Figure 10.3 presents the performance curves showing, by generation, the cumulative probability of success P(M, i) and the number of individuals that must be processed I(M, i, z) to guarantee, with 99% probability, that at least one S-expression comes within 0.01 of the target function for all 20 fitness cases for the symbolic regression problem with 2.718x2 + 3.1416x as the target function. The graph is based on 100 runs and a population size of 500. The cumulative probability of success P(M, i) is 30% by generation 46 and 31% by generation 50. The numbers in the oval indicate that, if this problem is run through to generation 46, processing a total of 305,500 (i.e., 500 x 47 generations x 13 runs) individuals is sufficient to guarantee solution of this problem with 99% probability.