This introduction chapter served as an initial exposition to the context of this thesis and the main contributions in it. In the following chapters, we elaborate on each of the contributions in the thesis. The chapters are intended to be self-contained in order to be able to be read on their own, though we point to connections to previous chapters where appropriate.
For each chapter x, our aim is to convey the following information in a similar structure. We first start with an introductory part introducing the reader to the setting; we advise the reader to also refer to the corresponding subsection in Chapter 1 as not all points are repeated in this part. Subsequently, in Section x.1, we provide background technical information that is necessary for the readability of the chapter. By Section x.2, we provide the details of the particular model and desiderata that we wish to achieve in the chapter. Section x.3 serves as an exposition of what can go wrong and why classical approaches fail to address the question under investigation. Sections x.4 and x.5 describe the main result of the chapter as well as an additional result (either a warm-up or a follow-up to the main result). Finally, Section x.6 aims to put the particular work in the broader context, discusses other works in the area, elaborates on particular assumptions, and points to important open questions.
CHAPTER 2
MITIGATING EXPLORATION IN ONLINE LEARNING
Maybe the biggest bottleneck in online decision-making is related to the in- sufficient information regarding the system at hand. In complex systems, the decision-maker often deals repeatedly with a similar task, trying to decide across a set of different alternatives. The challenge is to make effective decisions despite not knowing initially any information about the system and only receiving partial feedback determined by the selected action. We focus on settings where the re- ward or loss of different alternatives does not follow nice stochastic patterns (e.g. it is not i.i.d. across time). This non-stochasticity is often due to interactions of multiple agents whose decisions affect the performance of different alternatives.
Originated by game-theoretic considerations in multi-agent dynamics [25, 64], adversarial online learning emerged as a way to deal with online decision-making without imposing any distributional assumption on the input. This powerful framework provides a robust way to balance exploring different alternatives and exploiting ones that have been profitable in the past. Surprisingly, even without any prior information about the system and even when losses are adversarially selected, these techniques can guarantee performance asymptotically as good as the one of the best alternative in hindsight. Intuitively, despite the fact that the learner is initially clueless about different options, she can soon realize that some actions perform well, and can therefore start selecting them. In Chapter 6, we will see that this property has important consequences regarding the efficiency of complex systems with selfish participants.
In this chapter, we address an important limitation in current adversarial online learning techniques when applied in realistic settings where the learner
only has access to partial feedback. In particular, although classical partial- feedback online learning techniques achieve asymptotically good performance, the rate in which this is achieved is relatively large, which is ineffective in practice. This issue arises because these approaches need to revert to over- exploration to deal with the non-stochasticity of the environment, even when there exist actions that are really good (and therefore learning to follow them should occur easier). We will show how to mitigate this phenomenon when there exists one such good action (with small loss) without significantly sacrificing the worst-case performance of the system. This is one example of obtaining data-dependent guarantees for online decision-making that can utilize some well- behaved structure in the data while being robust to this structure not holding. In the subsequent two chapters, we will see two more examples where such data-dependent guarantees can arise.
2.1
Preliminaries on adversarial online learning
Online learning setting. We first introduce the basic online learning setting, which describes the framework in which the sequential decisions are made. The decision-maker or learner has access to a set of d alternatives that we will refer to as arms or actions a= 1, . . . , d. At round t = 1, . . . , T, the following process occurs:
1. The learner selects a probability distribution pt ∈∆(d) over the d possible
arms; this is such thatPd i=1p
t a = 1.
2. The adversary then selects losses `t = (`t 1, . . . , `
t
d)where ` t
a ∈ [0, 1]denotes
3. The learner then draws action A(t) ∼ pt
from the distribution pt
she com- mitted to and suffers the loss of the selected action `t
A(t).
4. The learner observes feedback about the losses based on a feedback model.
In the full feedback model (experts setting), the learner observes the loss of all the actions {`t
a}∀a regardless what she selected. In the bandit feedback model,
she only observes feedback only for the selected action `t
A(t). We will focus on a
general feedback model interpolating between these two extremes.
Graph-based feedback model. In this chapter, we focus on a feedback model suggested by Mannor and Shamir [96] where the learner receives partial feedback based on an undirected feedback graph G(t) that possibly varies across rounds. The learner observes the loss `t
A(t)of the selected arm A(t) and, in addition, she
also observes the losses of all arms connected to the selected arm A(t) in G(t). More formally, she observes the loss `t
a0for all the arms a0 ∈ NtA(t)where Nat denotes the set containing arm a and all neighbors of a in G(t) at round t. The full feedback setting and the bandit feedback setting are special cases of this model where the graphs G(t) are the complete and the empty graph respectively for all rounds t.
We allow the feedback graph G(t) to change each round t, but assume that the graph G(t) is known to the player before selecting her distribution pt
. This model also includes the contextual bandits problem of [12, 87] as a special case, where
each round the learner is presented with an additional input xt
, the context. In this contextual setting, the learner is offered d policies, each suggesting an action depending on the context, and each round the learner can decide which policy’s recommendation to follow. To model this with our evolving feedback graph model, we use the policies as nodes, and connect two policies with an edge in
G(t)if they recommend the same action in the context xt
of round t.
Regret. The goal of the learner is to minimize the loss of the algorithm. On its own, the loss of the algorithm is not providing enough insight of whether the algorithm is good or not. The loss of the algorithm may be large because the algorithm selects suboptimal arms, but it may also be large because no arm has good performance. As a result, to evaluate how well the algorithm is doing, we typically focus on the so called regret against an appropriate benchmark. The traditional notion of regret compares the performance of the algorithm to the best fixed action f in hindsight. For an arm f we define regret as:
Regret( f )= T X t=1 h`t A(t)−` t fi.
To evaluate performance, we consider regret against the best arm:
Regret= max
f Regret( f )
Note that both Regret( f ) and Regret are random variables, depending on the randomness in the algorithm.
A slightly weaker notion of regret is pseudoregret (c.f. [35]), which compares the expected loss of the algorithm to the expected loss of any fixed arm f , fixed in advance and not in hindsight. More formally, this notion of expected regret is:
PseudoReg= max
f A(1)...A(t)E Regret( f )
This is weaker than the expected regret EA(1)...A(t)Regret= EA(1)...A(t)
h
maxfRegret( f )
i .1.
1To see the difference, consider n arms that are similar but have high variance. Pseudoregret
compares the algorithm’s performance against the expected performance of arms, while regret compares against the “best” arm depending on the outcomes of the randomness. This difference can be quite substantial, like when throwing n balls into n bins the expected load of any bin is 1,
We aim for an even stronger notion of regret, guaranteeing low regret with high probability, i.e. for all δ > 0 with probability 1 − δ, instead of only in expectation, at the expense of a logarithmic dependence on 1/δ in the regret bound for any fixed δ. Note that any high-probability guarantee concerning Regret( f ) for any fixed arm f with failure probability δ0
can automatically provide an overall regret guarantee with failure probability δ= dδ0
. A high-probability
guarantee on low Regret also implies low regret in expectation.2