1.3. El derecho de la aseguradora a presentar objeciones a los siniestros
1.3.3. Prescripción
In Section 6.4 above, we have pointed out that the user is the ‘environment’ of this learning problem, and the task of EURS is to learn how the environment re- sponds to different recommendations in different situations, so as to give the most ‘satisfactory’ recommendations to the user in different situations. Given this un- derstanding, we model the simulated user as a MDP, and its four components are as follows (i ranges over all six appliances, i.e. i = 1, · · · , 6):
• States. Each day is a learning episode, which is divided into forty-eight equal-length time slots, each covering 30 minutes. Each state is a vector, consisting of the current time index, the status of all appliances (on/off), and when each appliance has been started using in today (if an appliance has not been used until the current time slot of the day, this value is −1). In the remainder of this chapter, unless stated otherwise, we use s to represent a state, s.time is the index of the current time slot in state s, and s.ti is the
6Note that botht
time when appliancei is started used today.
• Actions. Each primitive action is a recommendation to the user. Note that, since we consider six appliances, each primitive action consists of six sub- recommendations, or atomic actions, one for each appliance. We design five kinds of sub-recommendations for each appliancei (s is the current state, l is the most likely number of time slots the user will use appliancei for; l can be obtained easily fromPiprobability distributions):
– switch on: advises the user to turn on appliancei in the current time slot and keep using it for at least another time slot. To be more specific, this sub-recommendation recommends the user to use appliancei from times.time to time s.time+l. Note that this sub-recommendation can only be recommended to appliancei if i is currently off; when i is on, this atomic action is not available because it is ‘useless’.
– switch off : advises the user to turn off appliancei in the ensuing time slot. To be more specific, this sub-recommendation recommends the user to use appliancei from s.ti to the current times.time (note that
s.ti is the time when the user actually starts using appliancei). Note
that this sub-recommendation can only be recommended to an appli- ance that has been turned on and has not finished using; otherwise, it is ‘useless’.
– keep on: advises the user to keep using appliancei for at least another time slot; i.e. recommends the user to use appliancei from s.tito the
next time slots.time + 1. Note that this sub-recommendation can only be recommended to appliancei if i is currently on.
– keep off : when s.ti = t, i.e. the user plans to use appliance i now,
it advises the user to postpone the usage ofi for one time slot, i.e. it recommends the user to use appliancei from s.time + 1 to s.time + 1 + l; otherwise, it simply recommends the user not to use appliance i in the current time slot. Note that this sub-recommendation can only be recommended to appliancei if i is currently off.
– on and off : advises the user to turn on and finish usingi in the ensuing time slot, i.e. it recommends the user to use appliancei only during the current times.time. Note that this sub-recommendation can only be recommended to appliancei if i is currently off.
For example, a vectora1 = (wm : keep off, hv : switch on, pc : keep off, tv :
keep on, kt : switch off, ds : switch on) is a primitive action, which recom-
mends the user to keep off the washing machine (wm), switch on the hoover (hv), keep off the PC (pc), keep on the TV (tv), switch off the kitchen elec- tronics (kt) and switch on the dishwasher (ds). In the remainder of this chap- ter, we use recommendation and primitive action interchangeably, and use
sub-recommendationand atomic action interchangeably. Also note that, as
the user does not need to turn off dishwasher, kitchen electronics and wash- ing machine (these appliances automatically turn off after they finish their work), switch off and on and off are not advised to these three appliances. • Rewards. Note that RL-based EURS uses the rewards to evaluate to what
extent the user likes or dislikes a recommendation in a state. Because an ac- tion is a vector consisting of six sub-recommendations, the reward returned by the user is the sum of six sub-rewards, one for each sub-recommendation. To be more specific, if the current time slot is not the last time slot of the day, reward R = P6
i=1Ri, where Ri is the sub-reward for appliance i.
Sub-rewardRi = 0 iff the user accepts the ith sub-recommendation in the
recommendation, i.e. theith sub-recommendation is the same as the action the user actually performs on the corresponding appliance; otherwise,Ri =
pun (pun for punishment), where pun ∈ R, pun ≤ 0 is a constant number, and we will specify its value later in our experimental settings (Section 6.7). When the current time slot is the last time slot of the current day, R = P6
i=1(Ri + Eio− Eir), where Eoi and Eir are the whole-day expense for
appliance i based on the user’s original planned usage and his real usage, respectively. Note that Eo
i andEir are computed in the end of each day,
when the real usage of each appliances is known. Since the power of all appliances are known a priori (can be read from the original data set), the computation of these two values are therefore quite straightforward and we thus omit them here.
To understand why we give 0 when the system’s advice is accepted, consider a primitive action that is exactly the same as the user’s plan. This advice should not receive a positive reward, because it does not save any money, nor should it receive a negative reward, because it does not cause any loss. From the rewarding rules above, we can see that the system only receives positive rewards when it helps the user save money, i.e. Er
rejection punishment pun, we can adjust how much weight the simulated user puts on ‘minimising disruption For example, whenpun = 0, the simu- lated user does not mind to be disrupted and only cares about money-saving, while whenpun = −∞, the user only wants to avoid disruption and does not care about the expense.
• Transition function. In this problem, the transition function describes how the user adjusts his original usage after reading advice. Consider a situation where the user’s original plan is to use appliance i from ts tote, and the
system’s sub-recommendation for i is to use it from t1 tot2. We make the
user accepts this sub-recommendation iffPi(t1, t2) − Pi(ts, te) ≥ T h1(T h
for threshold), whereT h1∈ R, −1 ≤ T h1≤ 1 (we will specify the value of
T h1later in Section 6.7). We can see thatT h1is a threshold value to control
how willing the simulated user is to change its original planned usage: the bigger the T h1, the less willing the user is to change his original usage.
For example, when T h1 > 0, the user accepts a suggestion iff the advised
actions are, according to the user’s existing habits, performed more often than the original planned actions.
Now we design the MAXQ hierarchy. Recall that we only consider the activity- appliance hierarchy (because we focus on one specific type of day, see Section 6.4), so the composite sub-tasks in our MAXQ hierarchy correspond to activities, and the primitive actions correspond to appliance usages. Also note that in the execution of MAXQ, at each layer, only one sub-task can be selected. In other words, all activities are mutually exclusive, because at each time only one of them can be recommended to the user. In order to design mutually exclusive activities (sub-tasks), we first divide all six appliances into two groups: exclusive appliances, including hoover, kitchen electronics and PC, and compatible appliances, includ- ing all the other appliances. This division is based on our domain knowledge that, at each time slot, the user is unlikely to use more than one exclusive appliance, but may use one or more compatible appliances. For example, we think it is un- likely for a user to use the hoover and kitchen electronics at the same time, but it is common to use the hoover and dishwasher simultaneously.
Given the above division of appliances, we design five activities (composite sub- tasks): Cleaning, Cooking, Relaxing, UseCompatible and KeepAllOff . In Clean-
ing, Cooking and Relaxing, only one exclusive appliance can be advised to be used, namely the hoover, kitchen electronics and PC, respectively, while all compatible
Root
Cleaning Cooking Relaxing KeepAllOff
Use Hoover
Use Kitchen
Electronics Use PC
Stop Using All Appliances UseCompatible
Use Dishwasher, Washing Machine
or TV
Figure 6.3: The task graph for the energy adviser system.
appliances can also be advised to be used. In UseCompatible, one or more com- patible appliances can be advised to be used, while no exclusive appliances can be advised to be used. In KeepAllOff , all appliances are advised to stop working (i.e. switch off or keep off ). The MAXQ graph is illustrated in Fig. 6.3. Each bottom-layer box represents a collection of primitive actions: for example, box
Use Hooverincludes all primitive actions that advise the user to use the hoover.
So actiona1 = (wm : keep off, hv : switch on, pc : keep off, tv : keep on, kt :
switch off, ds : switch on) we mentioned above is in the box Use Hoover, whereas
action a2 = (wm : keep off, hv : switch on, pc : keep off, tv : keep on, kt :
switch on, dw : switch on) is not included in this box. Actually, action a2 is not
included in any bottom layer box in Figure 6.3, because we assume that two ex- clusive appliances — in this case, the hoover and kitchen electronics — should not be used by the user in the same time slot and, therefore, we do not allow MAXQ- based EURS to give this recommendation. However, in SARSA-based EURS, we still allow actiona2to be recommended to the user. By doing this, we shrink the
action space of MAXQ-based EURS, at the risk of being not able to obtain the optimal policy: if the user indeed performs this action, our MAXQ-based EURS will never be able to give the user this recommendation.
As for the termination predicate, UseCompatible, KeepAllOff and all primitive actions terminate immediately, Root terminates when a day ends, and the other three composite sub-tasks terminate when their corresponding exclusive appliance finished working.