In this section, we will experiment with several settings for each of the four parameters identified in the previous section. Graphs will be provided to show how the choice of each parameter affects the resulting FBR counter-strategies, and we will state and justify our choices for each parameter.
4.4.1
Parameter 1: Collecting Enough Training data
Figure 4.1 shows the effect that the amount of training data observed has on the resulting FBR counter-strategies. In this graph, we have created a series of FBR counter-strategies against the PsOpti4, Smallbot2298, and Attack80, using different amounts of training data. Against all three opponents, increasing the amount of training data results in stronger counter-strategies. Depending on how exploitable σoppis, we may require considerably more or less training data. Attack80 (a very
-200 0 200 400 600 800 1000 1200 10000 100000 1e+06
millibets/game won by FBR counter-strategy with 95% confidence interval
Training Time (games)
Performance of FBR Counter-strategies to Several Opponents as Training Hands Varies FBR(PsOpti4)
FBR(Smallbot2298) FBR(Attack80)
Figure 4.1: Performance of FBR counter-strategies to PsOpti4, Smallbot2298, and Attack80, using different amounts of training data. The x-axis is the number of training games observed, the y-axis is the utility in millibets/game of the FBR strategy, and the error bars indicate the 95% confidence interval of the result.
it can be defeated by a much larger margin when more data is used. Against Smallbot2298 (an -Nash equilibrium strategy), the FBR counter-strategy requires 600,000 games just to break even.
For our five-bucket abstraction, we typically generate and use one million hands of training data. If we use more training data for this abstraction, the counter-strategies do not improve by an appreciable amount.
4.4.2
Parameter 2: Choosing An Opponent For σ
oppIn Figure 4.2, we present the results of several different choices for σtrain used to create FBR
counter-strategies to PsOpti4. “Probe” is a simple agent that never folds, and calls and raises equally often. “0,3,1” is similar to Probe, except that it calls 75% and raises 25% of the time. “0,1,3” is the opposite, raising 75% and calling 25%. “PsOpti4” indicates an FBR strategy created with self- play data. From this experiment, we find that even a large amount of self-play data is not useful for creating an FBR counter-strategy to PsOpti4. Instead, if we use strategies like Probe for σtrain,
FBR is able to create effective counter-strategies. Unless otherwise noted, the results presented in this chapter will use Probe as σtrain, and we will refer to the Probe strategy as σprobe.
4.4.3
Parameter 3: Choosing the Default Policy
Figure 4.3 shows the effect that the default policy used in unobserved states has on the resulting strategy. “0,1,0” is the normal policy that is used in FBR; in unobserved states, it always calls.
-500 -400 -300 -200 -100 0 100 200 10000 100000 1e+06
millibets/game won by the FBR counter-strategy with 95% confidence interval
Training Time (games)
Performance of FBR(PsOpti4) with Different Training Strategies Probe
0,1,3 0,3,1 PsOpti4
Figure 4.2: Performance of an FBR counter-strategy to PsOpti4, using different training opponents. The x-axis is the number of training games observed, the y-axis is the utility in sb/g of the FBR strategy, and the error bars indicate the 95% confidence interval of the result.
-200 -150 -100 -50 0 50 100 150 200 10000 100000 1e+06
millibets/game won by FBR counter-strategy with 95% confidence interval
Training Time (games)
Performance of FBR Counter-Strategies with Different Default Actions 0,1,0
0,0,1 0,1,1 0,3,1
Figure 4.3: Performance of FBR counter-strategies to PsOpti4, using different default actions for unobserved states. The x-axis is the number of training games observed, the y-axis is the utility in sb/g of the FBR strategy, and the error bars indicate the 95% confidence interval of the result.
-100 -50 0 50 100 150 200 10000 100000 1e+06 1e+07
millibets/game won by the FBR counter-strategy with 95% confidence interval
Training Time (games)
Performance of FBR(PsOpti4) Counter-strategies using Different Abstractions 5 E[HS]
5 E[HS^2] 6 E[HS^2] 8 E[HS^2]
Figure 4.4: Performance of FBR counter-strategies to PsOpti4, using different abstractions. The x-axis is the number of training games observed, the y-axis is the utility in sb/g of the FBR strategy, and the error bars indicate the 95% confidence interval of the result.
“0,0,1” always raises, “0,1,1” calls and raises with equal probability, and “0,3,1” calls 75% and raises 25% of the time.
From the graph, we see that the “always call” default policy consistently performs the best among these options. As we increase the amount of training data, the default policies are used less often, and the difference between the best and worst of the policies diminishes. Unless otherwise noted, the results presented in this chapter will use the “always call” default policy.
4.4.4
Parameter 4: Choosing the Abstraction
Figure 4.4 shows the effect that the choice of abstraction has on the performance of the FBR counter- strategy. In this figure, we have computed several FBR strategies for the 5 bucket E[HS] abstraction and each of the 5, 6, and 8 bucket E[HS2] abstractions.
First, we notice that the 5 bucket E[HS2] counter-strategy outperforms the 5 bucket E[HS] counter-strategy at all points on the curve. Both of these abstractions use percentile buckets and his- tory, as described in Section 2.5.5. In that section, we explained that E[HS2] bucketing represented potential better than E[HS] bucketing, and this graph shows how this representation of potential can be used to produce stronger strategies.
Second, we notice that using the larger 6 and 8 bucket abstractions produces counter-strategies that are better able to exploit the opponent. With 10 million games of training data, all four types of counter-strategies have stopped improving, and are in the order we predicted: the performance of the counter-strategy increases as the abstraction grows.
PsOpti4 PsOpti6 Attack60 Attack80 Smallbot1239 Smallbot1399 Smallbot2298 Average
FBR 137 330 2170 1048 106 118 33 563
CFR5 36 123 93 41 70 68 17 64
Table 4.1: Results of FBR counter-strategies and an -Nash equilibrium strategy against a variety of opponent programs in full Texas Hold’em, with winnings in millibets/game for the row player. Results involving PsOpti4 or PsOpti6 used 10 duplicate matches of 10,000 games and are significant to 20 mb/g. Other results used 10 duplicate matches of 500,000 games and are significant to 2 mb/g.
Unless otherwise noted, the results presented in this chapter will use the 5 bucket E[HS2] ab- straction. This choice is made for practical reasons, as the 5 bucket FBR counter-strategies can be produced faster and require a smaller amount of memory to store on disk or use in a match than the 6 or 8 bucket FBR counter-strategies. When we evaluate a new poker program, we typically use the largest abstraction available.