2. Marco referencial
2.2 Antecedentes investigativos
Restless Bandits
In this section, we examine the performance of Whittle’s index policy for the stochastic deadline scheduling problem when the power limit (M) is finite. We show that when M < N, there does not exist an optimal index policy, hence
the gap-to-optimality on the performance of the Whittle’s index policy. This re- sult provides the essential ingredient for establishing asymptotic optimality of Whittle’s index policy in the next section.
3.4.1
Performance in the Finite Power limit Cases
In general, Whittle’s index policy is not optimal except in some special cases [93]. For the deadline scheduling problem, the same conclusion holds. We show in fact that no index policy exists.
Property 1. An optimal index policy for the RMAB problem formulated in (3.8) does
not exist in general when M < N. When M = N, the Whittle’s index policy is optimal.
Proof. The fact that Whittle’s index policy is optimal when M = N is intuitive and a formal proof can be found in Appendix C of [149]. To show that optimal index policy does not exist in general, it suffices to construct a counter example that no index policy can be optimal.
Set the capacity of the queue to be N = 3, the power limit M = 1, the discount- ed factor β= 0.4, the penalty function F( j) = j2, and the charging cost c[t]= 1.
Assume the arrival is busy (Q(0, 0) = 0) and the initial laxity is zero (T = j at arrival). For this small scale MDP, a linear programming formulation is used to solve for the optimal policy [117].
Consider two different states,
s= ((1, 1), (2, 2), (2, 2)) s0 = ((1, 1), (1, 1), (2, 2))
where s= ((T1, j1), (T2, j2), (T3, j3)) ∈ S is the state of the system including the
states of each arm.
For state s, the optimal action is to charge EV (2, 2). The EV (2, 2) is preferred to (1, 1) in this case. Charging (2, 2) will cause 1 instant penalty, and the state will change to ((T, j), (1, 1), (1, 2)), where (T, j) is a new arrival. In next stage, a penalty of 2 from the last two EVs will happen. If some policy charges (1, 1) alternately, there will be no penalty in the first stage and the state will change to ((T, j), (1, 2), (1, 2)). The last two EVs will at lease incur a penalty of 5.
For state s0
, the optimal action is to charge the EV (1, 1). The EV (1, 1) is preferred to (2, 2) in this case. Charging (1, 1) will cause 1 instant penalty, and the state will change to ((T, j), (T0, j0), (1, 2)), where (T, j) and (T0, j0) are new
arrivals. If some policy charges (2, 2) alternately, there will an instant penal- ty of 2 from the first two EVs in the first stage and the state will change to ((T, j), (T0, j0), (1, 1)). In this case, a penalty of 1 can be saved by charging (2, 2) in the previous stage. However, due to the discount factor, it is more profitable to charge (1, 1).
An index policy assigns each EV an index (that depends only on the EV’s current state), and charges the EVs with the highest indices [51]. Therefor, for any “index” policy, the indices of EV (1, 1) and (2, 2) are fixed and the preference of these two EVs should remain the same in these two cases, which is violated by the result here. This counter example shows that no “index” policy that is
optimal in general.
Note that, the Whittle’s index policy is an example of index policies, and thus is sub-optimal.
3.4.2
An Upper Bound of the Gap-to-Optimality
In the following lemma, we first establish a result that applies quite generally to the case for a finite queue size N and finite power limit M.
Lemma 1. Let GN
(s) be the optimal value function defined in (3.4) and GN
RMAB(s) be
the value function achieved by the Whittle’s index policy, respectively. We have GN(s) − GN
RMAB(s)
≤ 1−βC E[IN[t]|IN[t] > M] Pr(IN[t] > M),
(3.12)
where IN[t]
is the number of EVs admitted in the station with N chargers within time [t − ¯T+ 1, t], ¯T is the maximum lead time of EVs, and C is a constant determined by the charging cost and the penalty of non-completion.
The proof can be found in Appendix A.3. The gap-to-optimality is bounded by the tail expectation of the EVs admitted to the system. Note that, the con- ditional expectation on the right hand side (RHS) of (3.12) is connected to the conditional value at risk (CVaR) [115], which measures the expected losses at a certain risk level and is extremely important in the risk management.
3.4.3
Least Laxity and Longer Processing Time (LLLP) Principle
In this section, we will apply the Less Laxity and Longer remaining Processing time (LLLP) principle (originally proposed in [147]) to improve the Whittle’s index policy. The LLLP principle is a priority rule for the scheduling, which is defined as follows.
say i0
dominates i (i0 i
), if i0
has less laxity and longer remaining charging time, i.e., Li0[t] ≤ Li[t]and ji0[t] ≥ ji[t], with at least one of the inequalities strictly holds.
LLLP defines a partial order over the EVs’ states such that the EV with less laxity and longer remaining charging demand should be given priority. In [147], the authors applied an interchange argument to show that LLLP could improve the performance of any given policy along every sample path, and further, there exists an optimal stationary policy that follows the LLLP principle under mild conditions.
To apply the LLLP principle, note that the Whittle’s index policy for the multi-armed bandit problem is a stationary policy: at each time it orders (the states of) the M+ N arms, and activates the first M arms. The proposed heuristic policy re-order every pair of arms that violates the LLLP principle (cf. Algorith- m 2). As such, the proposed heuristic policy always gives priority to EVs with less laxity and longer remaining processing time.