• No se han encontrado resultados

CARACTERÍSTICAS PRINCIPALES DE LA REPÚBLICA DEL PARAGUAY

2. ANTECEDENTES HISTORICOS: LA CUESTION AGRARIA Y LA

2.1 CARACTERÍSTICAS PRINCIPALES DE LA REPÚBLICA DEL PARAGUAY

We will now conduct a series of experiments to empirically demonstrate the lemmas and theor- ems presented above. We will perform experiments in 3 different MDP environments.

The first MDP environment purely consists of IRTC. The experiment will be repeated with different lengths of IRTC. This will demonstrate that consecutive, continuous loops through an IRTC will lead to all Q-values in the IRTC tending toward a common limit (Lemmas 1 and 2). The MDP models the sub-MDPs shown in Figure 3.1.

The second MDP environment consists of 4 states, two of which are goal states. The agent begins in either state s1or s2and the episode will end when the agent arrives in state s3 or s4. There are 2 actions available a1and a2. The reward for reaching state s4 is 1, the reward for reaching state s3 is -1, and all other transition rewards yield 0. Figure 3.2a shows the MDP. The probability of transitioning from a non-goal state to a goal state is 0.5, the probability of transitioning between two non-goal states is 1. The second MDP is designed in this way to demonstrate that even after an agent has learnt the optimal policy, the interaction with IRTCs can cause the agent’s policy to temporarily diverge. Moreover, it is only with further exploration that the temporary divergence ceases.

The third MDP environment consists of 5 states, one of which is a goal state. The agent begins in state s1and the episode ends when the agent arrives in state s5. There are two actions available in all states expect s4where there are 3 actions. The reward for arriving in s5is 1, all other transitions yield a reward of 0. Figure 3.2b shows the MDP. The third MDP is designed to demonstrate that another goal state is not required for temporary divergence from the optimal policy to occur.

(a) Second MDP (b) Third MDP

Figure 3.2: MDP environments to demonstrate theory

3.3.3.1 Experiments & Results

In our first environment, an MDP that purely consists of an IRTC; we have set up each the results to show examples, for each length of IRTC, the Q-values tending toward each of the three limits described in Lemma 2. We randomly choose the values of r, α, and γ for each experiment to

demonstrate that the theory is not dependent on particular parameter values. However, α and γ where within than bounds, r ≤ |100|, and in the cases where the limit is ±∞ γ was set to 1.

Figure 3.3: Results from an MDP consisting purely of a single IRTC of length 1. Transition Length = 1, α = 0.899628, γ = 0.63834, r = −47.018

The graphs in Figures 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 3.10, and 3.11 show Q-values in the IRTC over the number of complete loops through the IRTC. The x-axis represents the number of updates the Q-values have had and the y-axis represents the value of the Q-value. Furthermore, Qn is the nthQ-value in the IRTC. We find that the graphs in these Figures show precisely what the theory predicted. Lemmas 1 and 2 state that each Q-value in an IRTC has the same recurrence relation and that their limits are common. Lemma 3 states that the Q-value with the greatest value can change.

In our second and third MDP environment we set up the parameters as stated below the graphs (see Figure 3.14 and 3.15). These results show how even if looping through an IRTC is not continuous, if enough the policy can change. Furthermore, the Q-values never reach the limit, in this case 0, but the greedy policy does change. This empirically demonstrates Lemma 3. Moreover, the comparison between α = 0.2 and α = 0.4 strengthens Lemma 4 since the change in α coincides with a change in the frequency at which the policy changes. The graphs in Figures 3.14 and 3.15 show the Q-values changing over time. The x-axis shows the total number of numbers that have occurred and the y-axis shows the Q-value at that point.

In Figure 3.14 we observe all Q-values in the environment. We observe that the greedy policy changes when Q(s2, a1)’s value drops below Q(s2, a2). At the times this occurs the

Figure 3.4: Results from an MDP consisting purely of a single IRTC of length 1. Transition Length = 1, α = 0.83881, γ = 1, r = −44.6894

agents currently held optimal policy is to never perform a1 in s2 and therefore never receive the reward of 1. Furthermore, we observe that in the case where α = 0.4 around update 3300 that at the same time as Q(s2, a2) > Q(s2, a1) Q(s1, a2) > Q(s1, a1); at this point the agent’s currently held optimal policy is to move to s3, the most sub-optimal policy. Therefore, these results empirically demonstrate our above theory.

In Figure 3.14 we observe the 3 Q-values available in state s4. We observe that for the most part Q(s4, a3) has the highest value. This matches the correct optimal policy of the environment. However, on occasion we observe that its Q-value drops below one of the other Q-values. At these points the agent’s currently held optimal policy is not to attempt to move to state s5where the reward is 1, rather move to state s2 or s3. Therefore, once again these results empirically demonstrate that the policy can change before convergence has occurred in MDPs that have IRTCs.