• No se han encontrado resultados

Tipo de comprobantes de pago emitidos electrónicamente

CAPÍTULO 4. RESULTADOS

4.2. Modificaciones de Resoluciones de Superintendencia

4.2.2. Tipo de comprobantes de pago emitidos electrónicamente

DTGOLOGwas proposed by Boutilier et al. (2000). It extends GOLOGwith decision-theoretic planning. Formally, with the domain axiomatization together with an optimization theory one specifies a fully observable finite-horizon MDPM = hS, A, T, Ri where S is a final set of states, A is a finite set of actions, T is a transition model, and R a real-valued reward function. The set of states of the MDP is implicitly given by the situation terms from the situation calculus, the action set is defined by the domain axiomatization, and the transition model is implicitly defined via the successor state axioms. Additionally, a reward function must be specified.

3.3. REASONING ABOUT ACTION AND CHANGE 57

DTGOLOGthen works as follows. It takes a GOLOGprogram as input and interprets it with an evaluation semantics as given below. Decision-theoretic planning is modeled by nondetermin- istic choices of actions. At a choice point,DTGOLOGevaluates all possible successor branches according to the optimization theory and inserts the best one into the policy. The policy is a (con- ditional) GOLOGprogram where all but the best (nondeterministic) agent choices are optimized away. DTGOLOGimplements a forward search value iteration algorithm (cf. Section 3.1.1). The great advantage ofDTGOLOGover ordinary value iteration is that it does not rely on an explicit state enumeration. The MDP is induced by the action theory of GOLOG. The reachable states are induced by the successor state axioms. This theoretically allows it to solve MDPs with infinite (continuous) state spaces as only the reachable states will be selected and iterated over.

DTGOLOGintroduces a notion for stochastic actions. Reiter’s basic action theories do not provide a notion for stochastic actions. It seems like a new sort for stochastic actions has to be introduced. On the other hand, the character of a stochastic actions is such that the agent performs the respective action and nature will choose among the possible outcomes of this action with a certain probability. These outcomes can be regarded as deterministic actions and therefore the basic action theories are extended by only introducing a new predicatechoice(A, n, s), where A is a stochastic action,n is one of the outcomes of action A in situation s which nature can choose from. Assuming a finite numberm of possible outcomes N1, . . . , Nmfor the stochastic actionA

under the condition that certain formulasϕ(s) hold, we can define choice(A, a, s) def= ϕ1(s) ⊃ (a = N11∨ · · · Nm1) ∧

.. .

ϕk(s) ⊃ (a = N1k∨ · · · ∨ a = Nmk),

withϕ1(s), . . . , ϕk(s) a set of mutually disjoint logical conditions which are situation calculus

formulas such thatϕ1(s) ∨ . . . ∨ ϕk(s) is true for any s. Further, we have to model the proba-

bility with which nature chooses a certain outcome. Let prob(n, a, s) denote this probability. For simplicity of notion assume that the outcomes for actionA remain the same in each situation, i.e. choice(A)def= {N1, . . . , Nm}. We add the following sentences to the domain axiomatization:

prob(N1, A, s) = p1, . . . , prob(Nm, A, s) = pm,

wherep1, . . . , pmare probabilities summing up to 1. If an outcomeNiis not possible in situa-

tions the probability prob(Ni, A, s) must be zero as this outcome cannot occur according to the

background theory. Therefore the following must hold: (Poss(N1, s) ∨ . . . ∨ Poss(Nm, s)) ⊃ Poss(Ni, s) ≡ prob(Ni, A, s) > 0, i = {1, . . . , m}, (Poss(N1, s) ∨ · · · ∨ Poss(Nm, s)) ⊃ m X i=1 prob(Ni, A, s) = 1.

To acquire the assumption of full observability for MDPs one has to extend the action theory by defining formulassenseCond(n, ϕ) which define how the different outcomes of a stochastic action can be discriminated. To sense the state, i.e. evaluate the conditionϕ in order to determine

which outcome has occurred, the sensing action senseEffect is introduced. The axiomatizer of the domain has to take care thatϕ discriminates the different outcomes.

In the following we give the semantics ofDTGOLOG. The semantics is similarly to the evalu- ation semantics of GOLOGdefined as abbreviations of situation calculus formulas.

1. Zero horizon

This is a termination condition for the recursive “calls” of BestDo. AsDTGOLOGim- plements a solution algorithm for finite-horizon MDPs, the search for the optimal policy terminates if the remaining horizon reaches zero.

BestDo(p, s, h, π, v, pr )def=

h = 0 ∧ π = Nil ∧ v = reward(s) ∧ pr = 1. 2. The null program

If the input program from which the policy is calculated is thenil program, the recursion terminates.

BestDo(Nil , s, h, π, v, pr )def=

π = Nil ∧ v = reward(s) ∧ pr = 1. 3. Deterministic Action

Similar to GOLOG, it is checked whether a primitive action is possible. If the action is not possible, the policyπ is terminated with a Stop action, the probability of success pr is set to zero and the value of the policy equals the reward in the current state.6 If the action is

possible, the policy for the remaining program is calculated. The resulting policy is then the primitive action in sequence with the policy for the remaining program, the value is the reward in the actual situation plus the value of the remaining policy.

BestDo(a; p, s, h, π, v, pr )def=

¬Poss(a, s) ∧ π = Stop ∧ pr = 0 ∧ v = reward(s)

∨Poss(a, s) ∧ ∃(π′, v, pr).BestDo(p, do(a, s), h − 1, π, v, pr)

∧π = a; π′∧ v = reward(s) + v′∧ pr = pr′ 4. Stochastic action

In the case of a stochastic action, the predicate BestDoAux with the set of all outcomes for this stochastic action is expanded. We use as Soutchanski (2003)choice′(a)def= {n

1, . . . , nk}

as an abbreviation for the outcomes of the stochastic actiona. BestDo(a; p, s, h, π, v, pr )def=

∃π′, v.BestDoAux (choice(a), a, p, s, h, π, v, pr ) ∧

π′= a; senseEffect(a); π′∧ v = reward(s) + v′

3.3. REASONING ABOUT ACTION AND CHANGE 59

The resulting policy isa; senseEffect(a); π′. The pseudo action senseEffect is introduced

to fulfill the requirement of full observability. The remainder policyπ′branches over the

possible outcomes and the agent must be enabled to sense the state it is in after having executed this action. The remainder policy is evaluated using the predicate BestDoAux . The predicate BestDoAux for the (base) case that there is one outcome is defined as

BestDoAux({nk}, a, δ, s, h, π, v, pr) def = ¬Poss(nk, s) ∧ π = Stop ∧ v = 0 ∧ pr = 0 ∨ Poss(nk, s) ∧ senseCond(nk, ϕk) ∧ ∃π′, v, pr.BestDo(δ, do(n k, s), h, π′, v′, pr′) ∧ π = ϕk?; π′∧ v = v′· prob(nk, a, s) ∧ pr = pr′· prob(nk, a, s)

If the outcome action is not possible, theStop action is inserted into the policy and no further calculations are conducted. Otherwise, if the current outcome action is possible, the remainder policyπ′for the remaining program is calculated. The policyπ consists of a

test action on the conditionϕkfrom thesenseCond predicate with the remainder policy π′

attached. The case for more than one remaining outcome action is defined as

BestDoAux({n1, . . . , nk}, a, p, s, h, π, v, pr) def

=

¬Poss(n1, s) ∧ BestDoAux ({n2, . . . , nk}, p, s, h, π, v, pr) ∨

Poss(n1, s) ∧ (∃π′, v′, pr′).BestDoAux ({n2, . . . , nk}, p, s, h, π′, v′, pr′) ∧

∃π1, v1, pr1.BestDo(p, do(n1, s), h − 1, π1, v1, pr1) ∧ senseCond(n1, ϕ1)

π = if ϕ1thenπ1elseπ′endif∧

v = v′+ v

1· prob(n1, a, s) ∧ pr = pr′+ p1· prob(n1, a, s)

The difference to the previous BestDoAux predicate is that the other outcomes are recur- sively interpreted, and that the resulting policy now consists of a conditional instead of a test action as in the previous case. The value for the outcome is clearly the value of the remaining policy which hasn1as prefix weighted by the probability of occurrence ofn1

plus the value gathered by the other possible outcomes. Similarly the probability of success pr is calculated.

5. Test Action

A test action is similar to GOLOGdespite thatπ, v, and pr have to be instantiated appropri- ately. Similar to a deterministic action,Stop is inserted in the case that the test condition does not hold and the calculation of the policy is terminated.

BestDo(a; p, s, h, π, v, pr )def= φ[s] ∧ BestDo(p, s, h, π, v, pr) ∨

6. Nondeterministic Choice of Actions

Nondeterministic choices of actions allow for DT planning. For both choices the contin- uation policiesπ1andπ2are calculated. A multi-criteria analysis over the values and the

probability of success is then done and the optimal policy is returned.

BestDo((p1|p2); p, s, h, π, v, pr )def= ∃π1, v1, pr1.BestDo(p1; p, s, h, π1, v1, pr1) ∧ ∃π2, v2, pr2.BestDo(p2; p, s, h, π2, v2, pr2) ∧ ((v1, p1) ≥ (v2, p2) ∧ π = π1∧ pr = pr1∧ v = v1) ∨ (v1, p1) < (v2, p2) ∧ π = π2∧ pr = pr2∧ v = v2) 7. Conditional

A conditional is like in GOLOGan abbreviation for a test action and the respective branches of the conditional.

BestDo((if ϕ then p1elsep2endif; p), s, h, π, v, pr ) def

= BestDo(((ϕ?; p1)|(¬φ?; p2); p), s, h, π, v, pr )

8. Nondeterministic Finite Choice of Arguments

DTGOLOGallows for optimal choices over action arguments. In the programp, all free variablesx are substituted by τ . The domain of τ = {v1, . . . , vn} must be finite. An

optimization is initiated for each substitution, the best argument is chosen for the policy. We also refer to this statement aspickBest as a optimized version of the GOLOG“pick” construct.7 BestDo((̟(x : τ )p); p′, s, h, π, v, pr )def = BestDo(p|x c1· · · p| x cn); p ′, s, h, π, v, pr ) 9. Sequential Composition

Sequential composition is the same as in GOLOG.

BestDo((p1; p2); p3, s, h, π, v, pr ) def

= BestDo(p1; (p2; p3), s, h, π, v, pr ).

Loops and procedures are not given a formal semantics in Soutchanski (2003). He remarks that these constructs require a second-order definition and refers to the implementation of his DTGOLOGinterpreter (Soutchanski 2003, Appendix C.1).

7In GOLOGthe “pick” statement is abbreviated with

π. This should not be confused with the variable π for policies

Documento similar