Note: This is part 1 of a series of posts I will be writing on Bayesian A/B Testing. This part is about common errors in application of Bayesian inference. The next one will use simulations to illustrate the point.

Note 2: The statement about having a Type 1 Error Rate of 50% was removed in an edit, but is addressed in a separate post here.

Introduction

It’s a common occurrence in a lot of product experimentation and CRO optimisation to suggest that Bayesian A/B Testing can lead to a reduction in the required sample-size. This is followed by a prompt suggestion to use the beta-binomial model of comparing two independent conversion rates with a prior slapped on. Often this prior is calculated from averaging the historical conversion rate of a metric of interest (KPI) over the last ‘n’ days. We’ll be examining the flaws of this method in details. This is not to say that Bayesian methods are not usable, but rather that they are being applied incorrectly.

Frequentist A/B Testing: A quick recap

Please note that this will not be a full recap of the math of a frequentist A/B test. There are enough resources on that on the web. However, I will be taking that up later. We are going to examine the null hypothesis in some detail.

This will also not be a mathematically rigorous derivation of the Average Treatment Effect (ATE). For a detailed view from that perspective, please see this book; refer the chapter on Randomized Control Trials.

Briefly, the null hypothesis for a usual A/B Test is that there is no material difference between conversion rates of two designs, as measured by independent and randomised assignment of users to one of the two designs. Please note two key ideas here:

The phrase material difference indicates that the observed difference is not large enough. This doesn’t mean that the observed difference has to be zero. It means that the deviation from zero is not sufficient to justify calling out a meaningful difference in conversion rates. For the amount of data collected, the evidence isn’t strong enough to say that the non-zero difference is in-fact attributable to design improvement. The extent of meaningful difference is defined of course by our choice of the ‘minimum detectable effect’.
Statistically, we are not looking at two separate conversion rates. The statistic under test is the difference (we can also formulate it as a ratio) of the two conversion rates. Mathematically, \(H_0 : \lvert p_b-p_a \rvert = 0\) and the statistic under test is \(\lvert p_b-p_a \rvert\). If we assume the statistic under test to be normally distributed, then, under null: \[\hat{\delta} = \frac{\lvert\hat{p_b} - \hat{p_a}\rvert - 0}{\sqrt{\frac{\hat{p_a}*\hat{q_a}}{n_a} + \frac{\hat{p_b}*\hat{q_b}}{n_b}}} \]

where \(q_i = 1 - p_i, i \in [a,b]\) and typically, \(\hat{p_b} = \hat{p_a}*(1+lift_{observed})\). 𝛿 is a normally distributed random variable, under the central limit theorem, so long as we collect enough data.

The ‘hat’ sign here indicates that the rates are estimators of the unobservable underlying conversion rates. We have made no assumptions about the distributions of the conversion rates. More correctly, we can say that 𝛿 hat represents the expected value of the average treatment effect. Which is a fancy way of saying that under reasonable assumptions, it is an unbiased estimator of observed lift.

The difference is subtle and counter-intuitive but crucial. Combined with the type I error rate, this implies that there is a zone of uncertainty. This is to say that if the observed value of 𝛿 is not sufficiently away from zero, we cannot conclusively say that there is a material difference in performance, or that the observed lift for a given setup of parameters indicates reliable evidence of impact of design (campaign/back-end whatever you are testing) on the conversion rate. Let’s use a visualisation to understand this better.

For simplicity we will assume that we are planning a fixed-sample A/B test and that our planning parameters are as follows: historical rate = 10% (this will substitute for \(p_a\)), MDE = 5%, Type 1 Error Rate = 11%, Power = 90%. We’ll considered a 2-tailed parametric test of proportions. We’ll assume that the split of traffic is even (\(n_a = n_b = n\)).

Show the code

library(plotly)
library(tidyverse)
library(ggthemes)
library(extrafont)

p1 = .1
mde = .05
alpha = .11
power = .9

p2 = p1*(1+mde)
q1 = 1 - p1
q2 = 1 - p2

n = power.prop.test(p1 = p1,p2 = p2,sig.level = .11,power = power,alternative = 'two')$n

delta_sd = sqrt(p1*q1/n+p2*q2/n)

crit_points = qnorm(mean = 0, sd = delta_sd, p = c(alpha/2, 1 - alpha/2))

crit_lift_bounds = crit_points/p1

rand_data = density(rnorm(mean = 0,sd = delta_sd, n = 10^5)/p1,from = -.1, to =.1,n = 1024)

dt <- tibble(x = rand_data$x, y = rand_data$y) |> 
  mutate(loc = factor(x>=crit_lift_bounds[1] & x<=crit_lift_bounds[2],ordered = F))

ggplot(data = dt, aes(x = x, ymin = 0, ymax = y, fill = loc)) +
  geom_ribbon(col='black') + geom_vline(xintercept = crit_lift_bounds, lwd=.1) +
  annotate(geom='text',x=crit_lift_bounds,y = 15, label = c('Negative Lift\nDetectable','Positive Lift\nDetectable'), hjust = c(1.2,-.2)) +
  annotate(geom='text',x=0,y = 10, label = 'Inconclusive',col='white') +
  scale_x_continuous(breaks = round(c(-.1,-.05,0,.05,.1,crit_lift_bounds),3)) +
  labs(
    title = "Zone of uncertainty for an A/B Test",
    subtitle = "Base Rate = 10%, MDE = 5%, alpha = 11%, Power = 90%",
    caption = "Note: Plot is under the null hypothesis that a priori expected lift is zero.",
    x = "Observable Lift",
    y = NULL,
    parse = T) + theme_clean() +
  theme(text=element_text(family="Georgia"),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank(),
        legend.position='none')

From the chart above we can observe that when attempting to establish a meaningful difference in lift for an A/B Test, it has to be above/below a certain threshold, i.e. in the red shaded area on the x-axis. The area indicates the probability of lift being on that zone of the x-axis.

This long winded recap was to point out a critical flaw in the way a lot of people misinterpret A/B tests: a non-zero value doesn’t automatically mean that a lift is statistically significant (for a given sample-size determined by the chosen power threshold). It’s also not about claiming that the magnitude of change is ineffectual. It’s about saying that given the amount of data I have, I cannot be reasonably certain that this non-zero observed lift is not a fluke event.

I should also point out how many people misinterpret the null hypothesis when expressed as \(H_0 : p_b = p_a\), either implicitly running a 1-tailed Bayesian test or causing a set-up where an effect is always present.

Essentially, if the only time the null is rejected is when the two random variables are equal, the zone of uncertainty vanishes.

Yet, this is a common error in how Bayesian A/B Tests are interpreted! We look at this error in decision rule in this post. The more common error is that often Bayesian A/B Tests are implicitly run for a 1-tailed comparison and then compared to a 2-tailed test of proportions.

Let me reiterate that I am not insisting that Bayesian inference methodologies are meaningless. I am in fact a proponent of Bayesian inference so long as you have useful prior information that can be correctly encoded in the estimators. I am critical of a common implementation that is incorrect.

And this takes us into the next section.

The Bias-Variance Trade-off

In my view, the prior information can be looked at as a bias-variance trade-off. Adding prior information essentially adds information, i.e. bias. We can also reasonably say that collecting data is essentially about increasing the amount of available information. Adding a bias helps us get away with collecting less data. Or more correctly in the case of Bayesian estimation, using prior information to augment currently collected data helps us be less uncertain.

However, this certainty comes at the cost of either amplifying the new results or dampening them (positive or negative bias). Assuming the same prior performance for both designs is a reasonable assumption but it also means that we are dampening the nature of the new data collected in an A/B test. This is almost always a good thing however, as Bayesian A/B testing is most often used in scenarios where currently collected data-points are somewhat insufficient.

There is therefore no free lunch, a really strong prior can bias the results towards a non-zero lift and drown out any contrary evidence from new data, or over-amplify it and lead to an incorrect decision.

Our aim in Bayesian methodologies is to choose a reasonably informative prior and therefore make a reliable bias-variance trade-off.

Bayesian A/B Testing: The flawed way

The mathematical model: Independent Beta Estimation (Hoffmann, Hofman, and Wagenmakers 2022)

I’ll try to limit the amount of time spent of theory. For a detailed view, please check the citation above or this link by the same authors.

Briefly,

It is assumed that n_A and n_B samples are collected for the two designs A and B respectively, with successful conversion denoted as y_A and y_B .
y_A and y_B are assumed to be binomially distributed, with respective underlying (successful) conversion rate parameters as θ_A and θ_B.
θ_A and θ_B are assumed to be independently beta distributed (Beta Distribution), with their individual hyperparameters 𝛂 and 𝛃. They can be interpreted as counts of hypothetical ‘𝛂’ successes from a total of ‘𝛂+𝛃’ data points.

The choice of beta dsitribution is useful as it is the conjugate prior of the binomial distribution. In lay person terms, it allows the posterior estimation of conversion rates of designs A and B to follow a Beta distribution as well! Posterior rates are essentially \(Beta(y_i+𝛂_i , (n_i-y_i)+𝛃_i)\) where i is A or B. Neat!

For a more detailed explanation on conjugate priors in Bayesian estimators, please check out this excellent website. I can’t recommend it enough, it has collected all key ideas of probability and statistics in one place.

To summarise, \[y_A∼Binomial(n_A,θ_A)\]

\[y_B∼Binomial(n_B,θ_B)\] \[\theta_A∼Beta(α_A,β_A) \]

\[\theta_B∼Beta(α_B,β_B) \]

\[p(\theta_A|y_A,n_A) = \frac{p(\theta_A)*p(y_A,n_A|\theta_A)}{p(y_A,n_A)}\]

\[p(\theta_B|y_B,n_B) = \frac{p(\theta_B)*p(y_B,n_B|\theta_B)}{p(y_B,n_B)}\]

These last equations simplify as follows:

\[p(\theta_A|y_A,n_A) = Beta(y_A + α_A,(n_A - y_A) + β_A)\]

\[p(\theta_B|y_B,n_B) = Beta(y_B + α_B,(n_B - y_B) + β_B)\]

Generally, we also assume that 𝛂_A = 𝛂_B = 𝛂 and 𝛃_A = 𝛃_B = 𝛃. This is like saying that the prior for both designs is the same.

This itself is a hint that things are a bit off. Why would I need to explicitly define two priors and then state that they are the same?

If you have been able to follow along so far, you’ll see that the problem is obvious. Earlier we were dealing with modelling a difference in rate or a lift. Now, we are modelling two independent rates. A more accurate Bayesian model would define a prior on the lift itself. Fortunately, some experimentation SaaS vendors out there implement precisely such a model. But we’ll leave that for subsequent posts.

Referring to the paper cited earlier, we’ll look at two key assumptions of this model.

The first assumption is the independence of success probabilities: learning about design A tells us nothing about design B. In practice, this assumption is rarely valid. Unless something is terribly off, if the conversion rate of design A (historical design) is around 10%, it is unlikely that conversion rate of design B is beyond the range of 8 to 12%. And i would be damned if it were in the range of 80 to 90%.

When defining an MDE or lift in the frequentist approach, we are placing an expectation on the difference in conversion rates from a common baseline. Yet somehow this isn’t the case in this model. This isn’t particularly problematic, as we can very easily use the same prior for both designs. But how do we place a prior on our expectation of there being some difference in the conversion rates of two designs if we set the two hyperparameters of the prior to be equal? There is no reason why we can’t have an expectation on the range of the possible lifts even if a priori we may expect its median value to be 0.

This too happens implicitly. When modeling the ratio or difference of these priors, the choice of 𝛂_A = 𝛂_B = 𝛂 and 𝛃_A = 𝛃_B = 𝛃 implicitly defines the median difference to be zero and imposes a certain ‘most probable’ range for the a priori lift.

This isn’t exactly a very stellar way of approaching an A/B test where after all the aim is to establish a post-experiment non-zero lift.

The other key assumption is that an effect is always present, that is one of design A or B is always better and there is never a situation where the evidence is inconclusive. This assumption follows from the fact that we aren’t able to assign some prior probability to a specific point-value such as 𝝳 = 0. In fact, we didn’t define a hypothesis test anywhere.

In practice, practitioners and proponents often draw posterior draws from the two independent distributions, take their ratio as a derived empirical random variable, and then check to see how often the ratio is greater than one (i.e. how often the conversion rate of design B is better than that of historical design A). This is essentially an implicit inflation of the type 1 error rate. This is what leads to a reduction in required sample-size w.r.t. 2-sample test of proportions, or equivalently, a viable but unreliable decision from insufficient data.

It biases the model to detect effects and fails to place appropriate weight on no effect being present.

With an uninformative prior combined with an appropriate decision boundary, this is the same as running a 2-sample test of proportions. So why even bother with a Bayesian method? Clearly, we must use a different Bayesian model that is an improvement over the classical A/B test.

Thanks, but my brain hurts, can you show this through code and visualisations?

That is exactly what I have done in the next post. You can check it here.

For a head-start on how the IBE methodology is usually applied, you can refer this post by Will Kurt on his blog countbayesie.com.

For my gripes with flaws in the aforementioned methodology, refer here.

References

Hoffmann, Tabea, Abe Hofman, and Eric-Jan Wagenmakers. 2022. “Bayesian Tests of Two Proportions: A Tutorial with r and JASP.” Methodology 18 (4): 239–77. https://doi.org/10.5964/meth.9263.