• No se han encontrado resultados

CAPÍTULO 3. CONCLUSIONES A MANERA DE RESULTADOS

3.2 INTERESES POLÍTICOS Y ECONÓMICOS

3.2.1 Presión de grupos de poder

Through an example, we illustrate the decisions managers have to make about their experiments, and the business implications of these issues. We do this to bring key di- mensions of the bandit problem “to life” concretely. Suppose an online retailer wants to optimize how it manages relationships with newly acquired customers, to improve con- version rates from trial to repeat purchase. Under the current policy, they always send customers the same email one week after a customer’s first purchase. So they propose a test to redesign that email with the goal of generating another purchase in the next month. There are four binary factors that they vary to form 16 different emails in a 2x2x2x2 design, e.g., 2 (promotional discount or not) x 2 (personalized or not) x 2 (thank you message or not) x 2 (customer service link or not). One of the 16 conditions is actually not sending an email, to serve as a control. Another of the 16 conditions is the email they currently use.

Before the test launches, the firm considers a few issues for planning. The goal is to maximize the number of customers making their second purchase within a month after the first purchase (i.e., conversion from trial to repeat; for this illustration, we do not consider profit contribution from purchase or future customer value). With the existing email, the firm has observed that 5% of its customers convert from trial to repeat. They hope to increase that percentage with a better email follow-up.

But the firm is uncertain about the impact on conversion rate that each email may have in total, and each email attribute may have separately. While the firm hopes to find an email that dramatically increases trial to repeat conversion rate, it is unclear how much lift the best email will really bring. They are conservative and believe that the improvement is going to be modest, such as a 10% increase over the current email (raise conversion rate from 5.0% to 5.5%) as opposed to a large lift, like a 100% increase (5% to 10%). The firm is also interested in how each email attribute affects that conversion rate, but they don’t believe there are interaction effects between those attributes. So the firm will analyze the attributes in the multivariate test, and even though the full-factorial data are available, the firm believes that only the four attributes’ main effects will be sufficient to include in their regression-based model of conversions.

The firm would typically run the experiment as a balanced design, with equal allo- cation of customers to all 16 emails for 10 weeks. Now, however, the firm’s email marketers plan to run an adaptive experiment, changing the allocation of new customers to emails as soon as they start seeing results. Further, they plan to do this using an attribute-based and batched multi-armed bandit policy. But how frequently should they adapt and send emails in different proportions to another batch of newly acquired customers? In particular, how many initial customers should they observe during an equal allocation policy before making their first adjustment? And how many customers should be in each subsequent batch?

The question underlying all of these questions is: how robust is the adaptive exper- iment policy they are going to use? That is, how many more (or fewer) conversions will

they get if they use 7,000 customers, but make either 10 weekly updates of 700 new cus- tomers each week, or make 70 daily updates of 100 customers each day? Or should they be using more than 7,000 total customers, e.g., 21,000 customers, even though it will take three times longer?

Certain dimensions of the problem are either under the firm’s control or already predetermined (e.g., design of experiment / attribute structure, number of total observations, number of decision periods, batch size). Other dimensions are definitely preset (e.g., true distribution of means) or definitely controllable by the experimenter (e.g., model to use, allocation rule to use). But we will consider all of these to be predetermined before the experiment begins.

The key decision always under the manager’s control is: which bandit policy should be used? A balanced design (equal allocation across all actions) will yield, on average, a number of conversions proportional to the average of all actions’ conversion rates. So any reasonable bandit policy should be better than that. On the other hand, the best hypothetical policy would be to send only the truly optimal email to everyone all of the time, yielding a total reward proportional to the conversion rate of the best email. Of course, the identity of the truly best email is unknown (and this is what needs to be learned). However, the average and the maximum of those conversion rates (the actions’ mean rewards) establish the range of performance of any bandit policy. Since all other policies fall in that range, we keep this in mind for the empirical results. While this is intuitive, it provides better framing of the results and when/why different policies may perform well. For instance, one simple

heuristic is a test-rollout policy. Suppose the firm runs a balanced design test for 20% of the planned experimental period. They identify a winner with the highest observed mean, and then they only use that winning treatment for the remaining 80% of the time. We anticipate this policy will be most effective when there is enough information revealed in the results during the first 20% of the test, so best action can be correctly identified and used for the remaining 80% of the test. One obvious case of this occurs when the sample size is large enough, given the incidence rate, for even a simple ANOVA for proportions to uncover a significant difference between the best performing action and all others.

The reason we review a variety of MAB policies is because choosing a “good” one is important. When a manager faces a MAB bandit problem, she selects a MAB policy to follow. This is similar to the way a manager faces a dataset and chooses a model to analyze it. In the literature, however, the bandit problem and policy are often stated together. This confound is problematic because the lines between the challenges of the problem and fea- tures of the solution are blurred. We will not only disentangle MAB problem from MAB policy, but we will also further breakdown each MAB policy into its model (if it has one) and its allocation rule.

Further, while the model and allocation are typically tied together, even those two ingredients of the bandit policy can be chosen as “almost independent” decisions. Given a bandit problem, described by the above dimensions, different bandit policies yield different results, and the relative improvement of one policy over another is moderated by those dimensions.

We also note that bandit policies are not solutions that exactly optimize an objective function, since no such exact solution exists for the common problem; rather the policies are better called algorithms, heuristics, or decision rules for managing the challenging problem. For each bandit problem that we create in the numerical experiment, we run several bandit policies: from simple heuristics to more sophisticated policies. This includes: the standard benchmarks like greedy and epsilon-greedy algorithms in reinforcement learn- ing (Auer et al. 2002; Sutton and Barto 1998); slightly more advanced heuristics, such as randomized probability matching, assuming actions are independent without an attribute structure (Thompson 1933; Berry 1972, 2004); and versions of those allocation methods with appropriate binomial regression models accounting for attributes (Chapelle and Li 2011; Scott 2010).