The Decision Unit was implemented using the modified Q-Learning algorithm discussed in Section 3.3.
As mentioned before, the Q-Learning’s knowledge is stored in the Q-Table, made of States (rows) and Actions (columns). In this implementation, the States are defined as the amount of workload to be processed. The Actions are the possible choices of the algorithm, which in this context they are the V-F pairs. The number of Actions (V-F options) for this platform is 4. Given that the performance counters from the platform analysed have 32-bit registers, the maximum number of Central Processing Unit (CPU) cycles countable is 232− 1, or 4, 294, 967, 295. The number of States was decided to be 16, with each State representing a range of workloads e.g. State0 is from 0 to 1 ∗ 108cycles, State1 from 1 ∗ 108 to 1.5 ∗ 108 cycles, etc. The values for each limit were selected based on the occurrence following several runs of the Governor. The Q-Table is declared as a 2-D array of signed 16-bit integers of 16x4. The number of States and the workload ranges can be optimised for reducing memory overheads and improve power consumption. Listing 4.2 shows an auto-generated Q-Table after running the governor for 4 minutes doing video decoding. The optimisation of these parameters (number of states, ranges of each state) is left as future work.
The main functions executed by the Decision Unit are:
• initialise decision unit(fps requirement) Clear the Q-Table. Add perfor-mance constraint to the cost function.
• map state(predicted workload) Based on the predicted workload identify the current Q-Table State. Returns the State number.
• take action(state) Based on the Q-Table and the current State select the Ac-tion to be taken, which is V-F pair. The exploraAc-tion/exploitaAc-tion algorithm (Sec-tion 3.3) is used in this func(Sec-tion.
• calculate reward(real workload) This function is called after the frame has been processed. It uses the cost function to calculate the reward or penalty for the decision taken.
• update qtable(reward) The Q-Table is updated accumulating the reward for the State and Action taken.
When a new frame is about to start, the Decision Unit receives the predicted workload for that frame. The predicted workload is a 32 bit integer that represents the CPU cycles the next frame is going to consume. As mentioned before, the Q-Table of the Decision Unit consists of 16 States, each State representing a range of CPU cycles. The predicted workload is then mapped to one of the 16 States, defining which value will be used.
Given the restrictions for implementing algorithms in a Linux kernel module (Sec-tion 4.1.1), the explora(Sec-tion/exploita(Sec-tion algorithm was designed using the simple random number. The exploration/exploitation ratio is defined as an 8 bit integer. The higher the value of , the more probable exploitation is going to happen and vice versa. There-fore, at the start of the application, is set to 0. The maximum value of is 255. Every frame, is incremented, gradually reducing exploration in favour of exploitation. The determination to explore/exploit is done obtaining the simple random number r(n) and comparing it to , if ≤ rand8bit then explore, else exploit.
When exploring, another simple random number rand2bitis generated to take the Action.
Given that only 4 Actions (4 V-F pairs) are possible in this particular platform, only 2 bits are used from s(n), from a range of 0 to 3. The 2 bit random number is then used as the exploring decision.
When exploiting, the knowledge of the Q-Table is used to decide the next Action. Given the identified State based on the predicted workload, the Action is selected on that particular State. This Action is the most suitable one, based on the experience of the Decision Unit itself. The implementation of the selection of the most suitable Action is explained in Section 4.3.2.
After the decision has been taken for either exploring/exploiting, the Governor sends a cpufreq change() request to the CPU. After the V-F has been changed, the Governor becomes dormant, and control is returned to the application, where the workload will be processed. After the frame has been executed and a new frame is ready to be executed, the Shepherd Governor is called again to learn from the previous decision.
The implementation of the learning is explained in the next section (Section 4.3.1).
4.3.1 Implementation of Learning Phase in Shepherd
The new State has been identified and the Action has been taken. After the frame is executed, the Decision Unit must learn about its latest choice. The Governor emits a request to the Performance Counters Module (Section 4.4) to obtain the CPU cycles of that particular frame. The real deadline time tdeadline is calculated based on the performance requirement pr and the time overhead it takes for the Shepherd algorithm
to process toverhead(Equation 4.1). The time it took for the workload to process tworkload
is calculated with the processor cycles CP U CY CLES used for the workload, retrieved from the Performance Counter Module and the frequency f at which the workload was executed (Equation 4.2). The reward r is calculated as the difference of tdeadline and tworkload (Equation 4.3). A tworkload greater than tdeadline yields a negative reward. In order to improve execution time of the Governor and reduce overhead, the two major differences with the originally designed algorithm (shown in Figure 3.8 and Equation 3.3) are:
• In the design the reward is normalised to 1, with rewards being a fraction, and highest reward being 1. This was done differently for two reasons, fractions imply the use of floating point numbers, which are not allowed to be used naturally in a kernel module; and given that division is a CPU cycle costly function[71, 72], it was avoided. Therefore, unsigned integers were used for calculating the reward.
• Given that the reward is not normalised to 1, and to reduce the use of arith-metic operations, the larger the yielded r i.e. tdeadline tworkload, the lower the reward was in reality. Therefore, the maximum reward was implemented to be 0 i.e. tdeadline = tworkload. When exploiting, for the current state scurrent the take action(state) function takes the current action acurrent to be the lowest positive accumulated reward Q(scurrent, a), so acurrent ← arg min
a Q(scurrent, a)≥0.
tdeadline= 1
pr− toverhead (4.1)
tworkload= CP U CY CLES
f (4.2)
r = tdeadline− tworkload (4.3)
As an example, the performance requirement pr is set to 23.976f ps, equivalent to 41708µs. Giving worst case estimations for Shepherd Governor execution time, the over-head toverhead is 0.5ms, larger than the overhead recorded (presented in Section 5.4).
The real deadline is then 41208µs (Equations 4.4 and 4.5). Receiving CP U CY CLES = 25659200 from the Performance Counter Module, the time tworkload is calculated in mi-croseconds using frequency f in M Hz like the example in Equation 4.6. The reward r is then computed as the difference tdeadline− tworkload, yielding a reward of r = 9134, the time units are removed.
tdeadline= 1
pr− toverhead= 1
23.976f ps − toverhead (4.4)
tdeadline= 41708µs − 500µs = 41208µs (4.5)
tworkload = CP U CY CLES
f = 25659200cycles
800M Hz = 32074µs (4.6)
r = tdeadline− tworkload= 41208µs − 32074µs (4.7)
r = 9134 (4.8)
The calculated reward is then sent to update the Q-Table. The Q-Table is updated as in Section 3.3. The implementation is shown in Listing 4.1. The Learning Rate α (ALPHA) is set to 40%. The shifts are done to avoid overflow in the variables.
t e m p 1 = A L P H A * ( r e w a r d > > 4) ; // A L P H A = 40
t e m p 2 = ( 1 0 0 - A L P H A ) * ( Q T a b l e [ p r e v i o u s _ s t a t e ][ p r e v i o u s _ a c t i o n ] > > 4) ; t e m p 3 = t e m p 1 + t e m p 2 ;
Q T a b l e [ p r e v i o u s _ s t a t e ][ p r e v i o u s _ a c t i o n ] = ( t e m p 3 / 1 0 0 ) < < 4;
Listing 4.1: Q-Table learning code
4.3.2 Action Suitability for Exploiting Learning
As explained before, the highest accumulated reward is the lowest positive value on that particular State, which is selected as the ‘best’ action. As an example, the Listing 4.2 is a generated Q-Table from a run of Shepherd on the BBxM. As a reference, let the new State be 3. Table 4.1 shows the Q-Values of State 3. The most suitable Q-Value is Q(3, 1), for State 3 use Action 1 (marked in blue in Table 4.1).
Q T A B L E A C T I O N S
0 1 2 3
S T A T E S
0 1 5 8 7 2 1 5 8 5 6 0 0
1 0 0 0 0
2 0 -2528 6 2 7 2 6 6 8 8
3 -15520 3 7 1 2 1 1 8 5 6 1 1 2 1 6
4 -27728 -3584 8 4 8 0 1 3 1 0 4
5 -19808 -7440 8 5 6 0 1 2 8 3 2
6 -16304 -7040 4 9 4 4 9 7 9 2
7 -15232 -3024 4 7 3 6 1 3 3 1 2
8 0 0 0 0
9 -8400 0 0 0
10 0 0 0 0
11 0 0 0 0
12 -16640 0 0 0
13 -352 0 0 0
14 0 0 0 0
15 -8240 0 0 0
Listing 4.2: Auto-generated Q-Table for a video decoding application.
Actions
State 0 1 2 3
3 -15520 3712 11856 11216
Table 4.1: Q-Values of State 3 of Q-Table from Listing 4.2