I’ll try to limit the amount of time spent of theory. For a detailed view, please check the citation above or this link by the same authors.
Briefly,
- It is assumed that nA and nB samples are collected for the two designs A and B respectively, with successful conversion denoted as yA and yB .
- yA and yB are assumed to be binomially distributed, with respective underlying (successful) conversion rate parameters as θA and θB.
- θA and θB are assumed to be independently beta distributed (Beta Distribution), with their individual hyperparameters 𝛂 and 𝛃. They can be interpreted as counts of hypothetical ‘𝛂’ successes from a total of ‘𝛂+𝛃’ data points.
The choice of beta dsitribution is useful as it is the conjugate prior of the binomial distribution. In lay person terms, it allows the posterior estimation of conversion rates of designs A and B to follow a Beta distribution as well! Posterior rates are essentially \(Beta(y_i+𝛂_i , (n_i-y_i)+𝛃_i)\) where i is A or B. Neat!
For a more detailed explanation on conjugate priors in Bayesian estimators, please check out this excellent website. I can’t recommend it enough, it has collected all key ideas of probability and statistics in one place.
To summarise, \[y_A∼Binomial(n_A,θ_A)\]
\[y_B∼Binomial(n_B,θ_B)\] \[\theta_A∼Beta(α_A,β_A) \]
\[\theta_B∼Beta(α_B,β_B) \]
\[p(\theta_A|y_A,n_A) = \frac{p(\theta_A)*p(y_A,n_A|\theta_A)}{p(y_A,n_A)}\]
\[p(\theta_B|y_B,n_B) = \frac{p(\theta_B)*p(y_B,n_B|\theta_B)}{p(y_B,n_B)}\]
These last equations simplify as follows:
\[p(\theta_A|y_A,n_A) = Beta(y_A + α_A,(n_A - y_A) + β_A)\]
\[p(\theta_B|y_B,n_B) = Beta(y_B + α_B,(n_B - y_B) + β_B)\]
Generally, we also assume that 𝛂A = 𝛂B = 𝛂 and 𝛃A = 𝛃B = 𝛃. This is like saying that the prior for both designs is the same.
This itself is a hint that things are a bit off. Why would I need to explicitly define two priors and then state that they are the same?
If you have been able to follow along so far, you’ll see that the problem is obvious. Earlier we were dealing with modelling a difference in rate or a lift. Now, we are modelling two independent rates. A more accurate Bayesian model would define a prior on the lift itself. Fortunately, some experimentation SaaS vendors out there implement precisely such a model. But we’ll leave that for subsequent posts.
Referring to the paper cited earlier, we’ll look at two key assumptions of this model.
The first assumption is the independence of success probabilities: learning about design A tells us nothing about design B. In practice, this assumption is rarely valid. Unless something is terribly off, if the conversion rate of design A (historical design) is around 10%, it is unlikely that conversion rate of design B is beyond the range of 8 to 12%. And i would be damned if it were in the range of 80 to 90%.
When defining an MDE or lift in the frequentist approach, we are placing an expectation on the difference in conversion rates from a common baseline. Yet somehow this isn’t the case in this model. This isn’t particularly problematic, as we can very easily use the same prior for both designs. But how do we place a prior on our expectation of there being some difference in the conversion rates of two designs if we set the two hyperparameters of the prior to be equal? There is no reason why we can’t have an expectation on the range of the possible lifts even if a priori we may expect its median value to be 0.
This too happens implicitly. When modeling the ratio or difference of these priors, the choice of 𝛂A = 𝛂B = 𝛂 and 𝛃A = 𝛃B = 𝛃 implicitly defines the median difference to be zero and imposes a certain ‘most probable’ range for the a priori lift.
This isn’t exactly a very stellar way of approaching an A/B test where after all the aim is to establish a post-experiment non-zero lift.
The other key assumption is that an effect is always present, that is one of design A or B is always better and there is never a situation where the evidence is inconclusive. This assumption follows from the fact that we aren’t able to assign some prior probability to a specific point-value such as 𝝳 = 0. In fact, we didn’t define a hypothesis test anywhere.
In practice, practitioners and proponents often draw posterior draws from the two independent distributions, take their ratio as a derived empirical random variable, and then check to see how often the ratio is greater than one (i.e. how often the conversion rate of design B is better than that of historical design A). This is essentially an implicit inflation of the type 1 error rate. This is what leads to a reduction in required sample-size w.r.t. 2-sample test of proportions, or equivalently, a viable but unreliable decision from insufficient data.
It biases the model to detect effects and fails to place appropriate weight on no effect being present.
With an uninformative prior combined with an appropriate decision boundary, this is the same as running a 2-sample test of proportions. So why even bother with a Bayesian method? Clearly, we must use a different Bayesian model that is an improvement over the classical A/B test.