• No se han encontrado resultados

Análisis DOFA ALUSUD Colombia

In document Follow this and additional works at: (página 85-101)

Analisis Competitivo Competitivo

DEBILIDADES Convertir

3.3.1 Análisis DOFA ALUSUD Colombia

In this section I assume that the agent has uncertainty only over the reward function, i.e., un-certainty over the user’s preferences for how it should act. To learn more about the user’s preferences over how the agent should act, the agent can ask action queries. An action queryasks the user for what action she would take in some state s, which corresponds to asking for the optimal action for state s given knowledge of the underlying reward function ω. Note that independent Dirichlet priors cannot account for the correlations introduced among state-rewards when incorporating the response to an action query (no such correla-tions are introduced for the reward and transition queries studied in the previous section), and hence are not a viable choice to parameterize the agent’s uncertainty when it considers action queries. Instead, I assume that the set of possible reward functions is finite, so that the agent’s uncertainty ψ is expressed as an arbitrary discrete probability distribution (i.e., a categorial distribution) over a finite set Ω of reward functions.

Let QAdenote the set of all action queries (which is implicitly a function of the state space of the underlying MDP, which is known to the agent). Recall that EVOI is specified as follows:

EV OI(q, ψ, sc) = Ej∼q;ψVψ|q=j (sc) − Vψ(sc), (3.4) For the case considered in this section where the agent only has reward uncertainty, drastic simplification applies. First, as described in Section3.2.2, the mean-MDP method can be used to exactly compute Bayes-optimal values for the case of reward uncertainty only. Specifically, the expected value of a policy over a reward distribution is its value for the single mean-reward function, denoted ¯ψ, which implies that the Bayes-optimal policy for ψ is the optimal policy for ¯ψ. (See proof of Thm. 3 by Ramachandran and Amir2007.) Thus, Equation3.4can be simplified as

EV OI(q, ψ, sc) = Ej∼q;ψVψ|q=j (sc) − Vψ(sc),

where ψ|q = j denotes the mean-reward function for the posterior distribution of ψ induced upon incorporating response j to q, or more formally, ψ|q = j is defined as Eω∼ψ|q=jω.

Updating ψ to incorporate the response j to an action query q to obtain ψ|q = j is not trivial as is the case for the reward and transition queries considered in the previous section, but fortunately this problem is closely connected to related work. Namely, this is exactly the problem solved in Bayesian Inverse Reinforcement Learning (BIRL) (Ramachandran and Amir, 2007), and likeLopes et al. (2009) I use BIRL to perform Bayes updates over the reward space in this section.

In BIRL, the starting assumption is a noisy-model of the user’s action selection. For the MDP given by reward function ω, there is an associated action-value function Qsuch that Q(s, a, ω) is the expected value obtained when the start state is s, the first action is a, and the optimal policy is followed thereafter. The user is assumed to respond with action j to an action query q for state s with probability Pr(q = j|ω) = Z1

ωeαQ(s,a,ω), where Zω is the normalization term and α is a noise (or confidence) parameter such that the larger the α, the more confident the agent is that the response received is indeed optimal for the user. Setting α lower can help in situations in which the user’s responses are noisy, or inconsistent with respect to all rewards in the reward space. Given a response j to query q about state s, and a current distribution ψ over rewards, the posterior distribution ψ|q = j over rewards is defined by the Bayes update

ψ(ω|q = j) = 1

ZPr(q = j|ω)ψ(ω), (3.5)

where again Z is the appropriate normalization term. The finiteness of the reward set allows the agent to tractably update ψ exactly according to Equation 3.5 upon receiving the response to an action query, since the agent can precompute (or cache) value functions

for each reward parameter, resulting in substantial computational savings when updating the reward distribution (provided the set of reward functions is small). Although it can be more natural from the agent designer’s perspective to use continuous reward spaces to account for many possible policies, performing BIRL in continuous reward spaces comes at a significant computational cost. (SeeLopes et al.(2009) andRamachandran and Amir (2007) for Monte Carlo methods for approximating BIRL in continuous reward spaces.)

Combining the above model for action query responses with the finiteness of the agent’s action space, Equation 3.3.1 can be explicitly computed as a weighted sum over the |A|

possible responses as follows:

EV OI(q, ψ, sc) =

|A|

X

j=1

Pr(q = j; ψ)Vψ|q=j (sc) − Vψ(sc)

=

|A|

X

j=1

Vψ|q=j (sc)X

ω∈Ω

ψ(ω) Pr(q = j|ω) − Vψ(sc). (3.6)

Assuming that the value caching scheme above is used to allow each Pr(q = j|ω) and ψ|q = j to be explicitly computed in an efficient manner, the main computational bot-tleneck associated with using Equation 3.6 to approximate EVOI is the computation of each Bayes-optimal posterior value Vψ|q=j (sc), which are not always cached by the above scheme since the mean-rewards ψ|q = j are a convex combination of members of Ω and thus not necessarily a member of Ω.

This completes the description of how the agent updates its reward function distribu-tion after each query, and utilizes the structure present in reward funcdistribu-tion distribudistribu-tions to simplify EVOI computations. I will refer to the algorithm that exhaustively computes the EVOI of every action query in this manner and then selects the best one as EMG-based action query Selection (EMG-AQS).

Reward function parameterizations. Parameterizing the reward function allows for a compact representation of the reward function distribution, even when defined over an in-finite state space. Consider a distribution over a discrete parameter space Ω and a function φ that maps ω ∈ Ω to a reward function R. The mean reward function can be safely used for computing policies and values optimal with respect to a reward distribution, but the reward function associated with the mean reward parameters can be safely used instead only when Eω∼ψ[φ(ω)] = φ(Eω∼ψ[ω]) (an extension of Theorem 3 in Ramachandran and Amir 2007). Due to linearity of expectation, this equality holds when φ(ω) is a linear

function of ω, but not necessarily otherwise. Therefore I use the mean reward parame-ters when this equality holds, but otherwise use the mean reward function, calculated as Eω∼ψ[φ(ω)] =P

ω∈Ωφ(ω)ψ(ω). I explain in more detail how I make use of parameterized reward functions in the descriptions of the experiments that follow.

In document Follow this and additional works at: (página 85-101)