This strategy implements an application of the Bayes Theorem, directing this theorem’s principles to the development of a reinforcement learning algorithm based on the estimation of success probabilities [Vale et al., 2011a].
3.6.3.1. Context
The Bayes Theorem has been applied in several scopes throughout the times, taking advantage on this probability theory’s capabilities of supporting applications directed to the most alternative contexts. One of this theorem’s advantages that has been considered interesting was its applicability to be used as a reinforcement learning algorithm [Pinto et al., 2011a]. This applicability is based on the use of a probability estimation to
determine which of the different alternatives (strategies’ suggestions) presents a higher probability of success in each context, therefore being considered as the most appropriate approach.
Bayesian networks have been developed to facilitate the task of prediction and abduction in artificial intelligence systems. Simplistically, these networks, also known as causal networks, or probabilistic networks, are graphic models for uncertainty based reasoning [Dore and Regazzoni, 2010]. A Bayesian network is represented by a directional acyclic graph in which the knots represent domain variables, and the arches represent the conditional or informative dependency between the variables. In order to represent the dependency strength, probabilities are used, associated to each network’s group of parent-son knots.
The values represented by the domain variables must be mutually exclusive and exhaustive, i.e. a variable must always assume one and only one of those values. Usually these values types are boolean, enumerated, or numeric values. All values must efficiently represent the domain, however with enough detail to be distinguished.
Two knots must be connected directly, in case of one affecting the other, with the arch’s direction pointing the effect direction. Once the network’s topology is defined, the following step is to quantify the relationships between the connected knots, specifying the conditional probability distribution for each knot [Korb and Nicholson, 2003].
The conditional probability is represented as (3.4):
𝑃(𝐴|𝐵) =𝑃(𝐴∩𝐵)𝑃(𝐵) (3.4)
This conditional probability provides the basis for the propagation of each knot’s probability given the facts observed by variables each influence them. The expected utility of the action each knot represents, is given by (3.5), being E the available evidences, A an action with possible outcomes Oi, U(Oi|A) the utility of each of the outcome states given that action A is taken, P(Oi|E,A) the conditional probability distribution over the possible outcome states, given that evidence E is observed and action A taken.
( | ) ( i| , ) ( i| )
The Bayes Theorem reinforcement learning algorithm applies the Bayes Theorem through the implementation of a Bayesian network. In this adaptation, each strategy (suggested action) is represented by a different knot. All strategy knots are connected to an output knot, which is responsible for the final decision on which is the most advantage action. This decision is based on the calculation of the success probability of each strategy, based on the observed events: if a strategy has accomplished to be the most successful amongst all in a certain negotiation period (event Yes), or not (event No). Figure 3.29 presents the topology of the Bayesian network, with an example considering three strategies.
Figure 3.29 – Bayesian network’s topology considering three strategies-
At the initial point of a simulation all strategies must be initialized with the same probability of success, value that throughout the simulation shall be updated according to the performance of each. However, a Bayesian network does not represent a temporal probabilistic model. Therefore, it has been found necessary to use a dynamic Bayesian network. This dynamism has been achieved by contemplating the inclusion of the Counting-Learning algorithm.
Counting-Learning Algorithm
The Counting-Learning algorithm considers, for each case about to be learned, an experience value, which defines in what degree the considered case will affect the conditional probability of each knot. This algorithm guarantees that all knots’ probabilities are updated.
For each new case, the updated experience value exper’, is calculated depending on the previous experience value exper, as presented in (3.6).
exper' = exper + degree (3.6)
where degree represents the weight of the current case. The common approach (applied in this implementation) is defining the degree as assuming the value 1, meaning that the case will be learnt as having the same importance as all the others. However, the Counting-Learning algorithm is prepared for other situations, in which the attribution of a higher value, e.g. 2, makes the algorithm learn this case as if it represented two similar cases at once (giving it the double of the importance). It can also be attributed a negative value, e.g. -1, so that it can “unlearn” a case.
The experience value and the degree are used in the updating process of a knot’s probability. The probability update of a knot that has observed an event is done accordingly to (3.7). A new probability probc’, is calculated, taking into account the previous probability probc.
( )
' '
probc exper degree probc
exper
× +
=
(3.7)The remaining knots i (for which the event was not observed) are updated according to (3.8), so that the probabilities vector (which contains all knots probabilities for each context) is kept normalized.
( )
Once the network’s dynamics are defined, the following step is to integrate them with the reinforcement learning algorithm, while guaranteeing the independence between probabilities concerning different contexts.
The adopted approach has been to use an existing and tested API which already implemented the basics of a Bayesian network, so that this strategy could abstract from the probabilistic calculations and from the probabilities propagation. Several Java APIs have been tested, however many of them did not work properly, others were incomplete, and others did not present the necessary and updated documentation, so that one could properly work with it and manipulate its functionalities, in order to achieve the required procedure. Some of the APIs that were testes but did not qualify as adequate were: JavaBayes [JAVABAYES, 2011], jBNC [JBNC, 2011] and Banjo [BANJO, 2011].
Finally, an adequate API, which fulfils all the requirements, was found: Bayesian Network tools in Java 3 (BNJ 3) [BNJ3, 2011]. In order to define a Bayesian network with this API, it is necessary to create the knots, and the connections between them, so that its structure is built. Afterwards, the initial probabilities for each knot must be defined, in order to indicate the way each is going to affect the network. Once these initial definitions are completed, the network is able to be executed and interacted with. With the objective of “hiding” these calculations from the Main Agent class, an auxiliary class BayesianNetworkStatsCalculator has been created, containing the network, and automatically defining its structure, initializations, probabilities, and possible actions to perform with the network.
This way, by instantiating an object of this class and initializing it with the required parameters, it is possible to create an adequate Bayesian network without the need to constantly repeat the low-level steps, what makes its use and manipulation much simpler. This class also applies the Counting-Learning algorithm’s application over the network, while hiding its complexity.
The most important methods of the BayesianNetworkStatsCalculator class are:
• public void initialize(Object args[]), which receives variables. The first two containing vectors with:
the names of the knots (strategies), and the preferred weights for each strategy; and the third indicating the number of contexts that were defined for the current simulation, so that the probability vectors are defined accordingly;
• public double getStats(), which returns the current probabilities of each knot, for the current context, so that the most adequate one can be selected as the chosen action;
• public void updateStats(boolean[] values), which receives a vector of boolean variables, indicating for each strategy if it was or not the one that effectively presented the best results in the market.
3.6.3.3. Experimental Findings
This sub-section presents three tests, in order to verify the correct performance of the Bayes Theorem
including three seller agents, for whom the results are analyzed, and seven buyer agents. The simulations were performed considering solely the first period of the day.
The first two tests present simulations using only three strategies. Figure 3.30 presents the bids each of the three strategies proposed in both simulations
Figure 3.30 – Seller agents’ proposed bids.
In the first test all strategies are assigned an equal importance weight; while in the second, the Average 2 strategy presents a double weight value comparing to the other considered strategies. Figure 3.31 presents the success probability for each strategy throughout both simulations. Note that, on the contrary of the Simple Reinforcement Learning Algorithm’s and the Roth-Erev Reinforcement Learning Algorithm’s case, the success probability of the Bayes Theorem reinforcement learning algorithm reflects the actual confidence in a strategy’s success, and not in its failure. So, the higher this value is, the better the strategy is expected to perform.
a)
b)
Figure 3.31 – Seller agents’ success probability for an Average 2 weight: a) equal to the other strategies, b) double of the other strategies’.
Analysing Figure 3.31 it is visible that when the Average 2 strategy is assigned a higher weight value, this strategy’s confidence values detach faster from the others. This detachment reflected an isolation of Average 2 as presenting the best confidence values for all days.
Figure 3.32 presents the results, in incomes, achieved by ALBidS use of the Average 2, in both simulations.
a)
b)
Figure 3.32 – Average 2’s achieved incomes with a preference weight: a) equal to the other strategies, b) double of the other strategies’.
The Average 2 detachment as the strategy with the best confidence value reflected an improvement in the incomes obtained by the use of this strategy. This is due to the constant use of this strategy’s proposal as the ALBidS system’s final action. By using the strategy more often, its contribution to the system increased its achieved results.
The third and final test intends to show the Bayes Theorem reinforcement learning algorithm’s ability to support various alternative strategies, i.e. reflected by an increase in the network’s knots. Figure 3.33 presents the evolution of the strategies’ confidence values, concerning an higher number of strategies.
Figure 3.33 – Strategies confidence values in the third test.
From Figure 3.33 the evolution of the confidence values of the several strategies can be seen. This shows the capability of the Bayes Theorem reinforcement learning algorithm in considering a high number of alternative strategies.