The previous sections assumed that the ad hoc agent already knew the underlying distributions of the arms (i.e. the POMDP’s transition function), but in many cases the ad hoc agent may not have this information. Therefore, it is desirable for the ad hoc agent to reason about trading off between exploring the domain, exploring its teammates, and exploiting its current knowledge. In this section, we prove that the ad hoc agent can optimally handle this tradeoff while planning in polynomial time. We again assume that the ad hoc agent knows its teammates’ pulls and results, either by observing them directly or by listening to its teammates’ messages.
The belief space of the POMDP is increased to track two additional values, one for the Bernoulli success probability of each arm. The probabilities of these values can be tracked using a beta distribution similar to ε in Lemma 1, resulting in an additional multiplicative factor of (nR)2. Therefore, the covering number has
size poly(R, n, 1/δ). Theorem 2 follows naturally from this result and the reasoning in Theorem 1.
Theorem 2. Consider an ad hoc agent that does not know the true arm distribu- tions, but has a uniform prior over their success probabilities, knows that its team- mates’ behaviors are drawn from a continuous set of ε-greedy and UCB teammates,
and can observe the results of their actions. This agent can calculate an η-optimal behavior in poly(n, R, 1/η) time.
Proof. We know that a POMDP can be solved in time polynomial in its covering number. From Theorem 1, we know that the ad hoc agent’s beliefs about its team- mates’ behaviors and the observed pulls can be covered in a polynomial number of points. In this setting, the ad hoc agent must also track its beliefs about the success probability of each arm. The reasoning proceeds similarly to the reasoning about ε in Lemma 1. The agent starts with uniform beliefs about each arms’ success proba- bility, which leads the posterior to be a beta distribution, which can be represented using two integer parameters. These parameters correspond to the (fully observed) numbers of successes and pulls observed; thus the integers can are be bounded by (n + 1)R for each arm. Representing the probability distribution of the two arms’ success probabilities leads to a factor of size ((n + 1)R)2. Therefore, the η-optimal behavior can still be calculated in poly(n, R, 1/η) time.
6.6
Chapter Summary
This chapter presents theoretical analysis of the PLASTIC–Model algorithm in the multi-armed bandit setting described in Section 3.2.1. These results show that PLASTIC–Model can calculate an -optimal policy for the ad hoc agent to follow in a variety of scenarios in polynomial time. The analysis proceeds by bounding the number of states and actions in the resulting MDPs and POMDPs. When the ad hoc agent is uncertain about its teammates’ behaviors or the true success probabilities of the arms, it can efficiently represent its uncertainty about these beliefs. The compactness of these beliefs and the size of the state space enables us to prove that
result suggests that empirical approaches for solving POMDPs will be effective in this domain, a hypothesis which is explored in Chapter 7.
Chapter 7
Empirical Evaluation
While the previous chapter describes the theoretical analysis of the PLASTIC al- gorithm, this chapter presents its empirical analysis. This empirical analysis covers the three domains introduced in Section 3.2: the multi-armed bandit domain, the pursuit domain, and half field offense in the 2D RoboCup simulator. An overview of the experiments is presented in Table 7.1. This table lists the domains and teammate types used in each experiment as well as whether the teammates have been previously seen or provided in HandCodedKnowledge. Furthermore, for each experiment, we describe whether the ad hoc agent knows the environment, how many teammates it is cooperating with, whether it uses communication to coop- erate with its teammates, and whether the domain provides continuous states and actions. Finally, we specify whether we test PLASTIC–Model or PLASTIC–Policy in each experiment. In the table, we bold the factors that result in extra complexi- ties and show that PLASTIC is applicable to other complex domains. Specifically,
1
This chapter contains material from four publications: [20, 17, 18, 19]. Note that the work in Section 7.1 is joint work with Noa Agmon, Noam Hazon, and Sarit Kraus in addition to my advisor Peter Stone [20]. In addition, Sections 7.2.1, 7.2.6, and 7.2.7 are joint work with Sarit Kraus and Avi Rosenfeld in addition to Peter Stone [19].
Section Domain Teammate Teammate Teammates Environment Number of Uses Continuous PLASTIC–Model Type Knowledge Previously Seen Known Teammates Comm. State/Actions or PLASTIC–Policy 7.1.3 Bandit HC Param. HC Set Yes Yes 7 Yes No Model 7.1.4 Bandit Ext. Param. HC Set No Yes 1–9 Yes No Model 7.1.5 Bandit HC and Ext. Param. HC Set Yes and No No 1–9 Yes No Model 7.2.3 Pursuit HC Known Yes Yes 3 No No Model 7.2.4 Pursuit HC HC Set Yes Yes 3 No No Model 7.2.5 Pursuit Ext. HC Set No Yes 3 No No Model 7.2.6 Pursuit Ext. Learned Set Yes and No Yes 3 No No Model 7.2.7 Pursuit Ext. Learned Set + Briefly Yes 3 No No Model
TwoStageTransfer
7.3.4 Limited HFO Ext. Learned Set Yes Yes 1 No Yes Policy 7.3.5 Full HFO Ext. Learned Set Yes Yes 3 No Yes Policy
Table 7.1: An overview of the experiments described in this chapter. We denote hand-coded by HC, Externally-created by Ext., and parameterized by Param.
we highlight when the teammates were externally created, when the teammates are previously unseen or only seen briefly, when the domain has continuous states and actions, when the environment in unknown, and when PLASTIC has to select from a set of parameterized models.
This analysis tests the hypothesis that PLASTIC is effective for enabling agents to quickly adapt to new teammates in a variety of possible ad hoc team- work scenarios. In Section 7.1, we start by evaluating whether PLASTIC–Model can efficiently communicate with teammates given a limited language as well as whether it can select good models from a set of parameterized hand-coded mod- els for HandCodedKnowledge including the case where these models do not cover its teammates’ true behaviors. Then, in Section 7.2, we test the hypothesis that PLASTIC–Model can use models it learned from previous teammates to adapt quickly to new teammates. Furthermore, we assess whether TwoStageTransfer is effective for learning models of new teammates while transferring knowledge about past teammates. Specifically, we perform this evaluation by using the learned mod- els in PLASTIC–Model for cooperating with these new teammates. Finally, we test the hypothesis that PLASTIC–Policy allows ad hoc agents to quickly adapt to new teammates in a complex domain (HFO) in Section 7.3.