Every model, every prediction is designed to improve how you interact with your customers, to send them better-targeted messages, more personalized content and make them spend more in general. We have segmentation to tell us customer’s profile, whether he is a family type or an adrenaline seeker. We know which product he should be interested in and we know whether he is losing interest in your products. We know what we should say, so the remaining question is how do we say it? We can, for example, do an email campaign or contact him via phone. If it’s email, which template should we use, at what time of the day, which message style?

Marketers have been doing A/B testing for a long time, and it’s a great way to be statistically confident that your A is better than your B, but think of the number of letters we have here. Let’s say you have 3 different templates, and 3 different subject lines, and 3 options for when to send an email and 3 different options for email text. The number of experiments is already 81. Some of these variables might show to be irrelevant but testing is the only way to be certain. With typical click-rate of 0.3%, if you want to be 95% certain that A is 10% better than B, you need approximately 100k emails sent. And then do that 80 times. If you don’t have millions of customers, it could take a looong time.

Without disputing the benefits of A/B testing, it is time to move to some more advanced stuff. Reinforcement learning has a nice concept of exploitation vs exploration, that is, let’s keep sending emails to explore different options, but while doing so, let’s use whatever we know so far to optimize our target (click rate, conversion, expected revenue,…). Multi-armed bandit gained a lot of popularity over time and there are many algorithms and even more implementations... but let’s cover the basics.

Let’s say we have 200 customers and 2 experiments, templates A and B. Initially we have no reason to believe any of them is better so we send each experiment to 100 people.

Let’s say that we receive 5 clicks for A and 7 clicks for B (left side image). Now 100 people doesn’t seem like a lot but already with these results, the probability that B outperforms A is over 72% (area above zero on right side image). So what we can do now is, first of all, keep learning, because 12 clicks in total is really not a lot, so we want to keep sending both A and B. But what multi-armed gives us is also exploitation, so we can optimize while we learn. Trivially we can send B to two thirds of customers and A to only one third, in the next round of emails and assuming that the click-rate doesn’t change, average click rate jumps from 6% to 6.33% (that is 5% more).

There are many multi-armed bandit algorithms and a quick list would be:

Epsilon-Greedy: we send to (1-epsilon) ratio of customers, whichever is currently the best experiment, while to the epsilon ratio, we send randomly chosen experiment.

Upper Confidence Bound: based on Optimism in the Face of Uncertainty, counting on the highest possible unknown average response, given the data.

Thompson Sampling: each experiment gets the audience size proportional to the probability that it's the best experiment.

Final word, with ML-driven marketing campaigns it is imperative that the learning never stops. We could of course train a supervised learning classifier after we have gathered enough data and hold tightly to its accuracy in a hope that it will make some money. The fact is that expectations evolve, new ideas are born, people want fresh thoughts and we can't respond to upcoming trends by sticking to what we have in our past data.

## Comentários