Introduction

This will be a relatively short post. We are generally used to defining the null hypothesis in an A/B test as there being no difference in performance (most often the conversion rate) between two designs. But from a strictly mathematical point of view, this doesn’t necessarily have to be the case.

We can instead define a non-zero null hypothesis and run an experiment to see if the new design results in a better performance by some non-zero margin. This may be useful, for example, in a situation where a new design must guarantee a certain minimum improvement to be viable.

The more useful scenario is when running sequential tests, which will be discussed in a separate post. Now, on to the math.

The Math

Somewhat informally, statistical power is the probability of the statistic under test being in the rejection region under the alternate hypothesis. Unless we are dealing with average power across multiple possible alternative thresholds, this boils down to establishing the required sample-size for a fixed-sample statistical test for a given choice of the alternative hypothesis, i.e. a point alternative.

We’ll denote the impact of change as the random variable \(\Delta\), whose realised value will be denoted as \(\delta\). \(\hat{\Delta}\) is the estimator for \(\Delta\). Under the usual i.i.d. assumptions and random allocation, \(\hat{\Delta}\) is considered to be normally distributed with some mean \(\delta\) and sampling variance \(\tau^2\).

\[ \Delta \mid \hat{\Delta} = \delta, \tau^2 ~ \mathcal{N}(\delta,\tau^2) \tag{1}\]

The sampling variance for the estimator is usually considered to be a constant.

Usually, the null is defined as \(\mathcal{H}_0:\delta_0 = 0\) but we’ll define the null and effect-size (i.e. alternative at a fixed point) as:

\[ \mathcal{H}_0:\delta_0 \ne 0 \;\;\; \& \;\;\; \mathcal{H}_a:\delta_a \ne 0 \tag{2}\]

This causes the statistical power – for a given type 1 error rate \(\alpha\) – to be modified as follows:

\[ \Pi(\delta_0,\delta_a,\tau^2,\alpha) = \Big[1 - \Phi\Big(\Phi^{-1}(1-\alpha/2) - \frac{\delta_a-\delta_0}{\tau}\Big)\Big] + \Big[\Phi\Big(\Phi^{-1}(\alpha/2) - \frac{\delta_a-\delta_0}{\tau}\Big)\Big] \tag{3}\]

We can also denote power as \(\Pi = 1 - \beta\).

The key thing to notice here is that the effect size has changed from being defined by just the alternate point-hypothesis to both the null and the alternate point-hypotheses.

It’s easy to see that this reduces to the usual form for \(\delta_0=0\). The tricky part, however, is analysing the impact on sample-size and the potential lift values where the null will be rejected.

The curious case of CRO.

We’ll study non-zero null hypothesis testing for conversion rates. For CRO, we usually define an MDE¹. With the usual zero null, \(\delta_a = p_1*MDE.\), where \(p_1\) represents the historical rate. For convenience, we’ll be defining the non-zero null in a similar fashion. Ergo, \(\delta_0 = p_1*TE\), where TE represents the threshold effect under the null hypothesis.

We’ll look at two cases of the alternative hypothesis with a non-zero null:

Inclusive of null: The (scaled) effect-size is the difference between the null and the alternative. Therefore, \(\delta_a - \delta_0 = p_1*(MDE-TE)\).
Over-and-above null: \(\delta_a\) is defined as a difference over and above the null. Therefore, \(\delta_a -\delta_0 = p_1*(MDE)\) in the usual way.

This distinction is necessary due to the fact that the sample-size is a function of the effect-size and the choice of the definition of \(\delta_a\) will have implications for the amount of data that needs to be collected.

It’s common practice to estimate a sample-size from a historical rate, so we’ll use \(p_1\) as the historical conversion rate (for design A). For the purpose of sample-size estimation, the conversion rate for design B will be \(p_2 = p_1*(1 + MDE)\). Generally we can ignore either the first or the second term in the RHS of equation Equation 3, so the calculation of total sample-size simplifies to:

\[ N = \Big[ \frac{\Phi^{-1}(1-\beta)+\Phi^{-1}(1-\alpha/2)}{E.S.}\Big]^2 \tag{4}\]

where E.S. represents the effect size. For conversion rates,

\[ E.S. = \frac{\delta_a-\delta_0}{\sqrt{\frac{p_1*q_1}{\mathcal{v}}+\frac{p_1*q_1}{\mathcal{1-v}}}} \]

where \(q_i=1-p_i\) and \(\mathcal{v} \in (0,1)\) represents the share of the sample-size for design A. The optimal of course is at \(\mathcal{v}=50\%\). Okay, let’s look at the implications.

The Implications

Sample-size

It’s easy enough to see that the sample-size increases if the effect-size decreases, which will decrease if the point alternative is ‘inclusive of null’ as defined earlier. If the point alternative is ‘over-and-above’ the null, the sample-size should then remain the same as what we would get for a zero-null.

Note that m.d.e. should not equal t.e. as that would set the effect-size to zero.

The more crucial implications are about how the null hypothesis gets rejected.

Rejection Region

Rejection region² is the set of values for the observed statistic \(\hat{\Delta}\) where the null hypothesis gets rejected. Shifting the null hypothesis changes the rejection region.

Visualising the implications

Show the code

library(tidyverse)
library(glue)
#library(furrr)
library(ggthemes)
#library(progressr)
library(extrafont) #You can skip it if you like, it's for text on charts.
library(gridExtra)
library(knitr)
library(rlang)

##Modify this as per your computing setup
#plan(strategy = 'multisession',workers = 16)

Let’s take some example values as follows:

\[ p_1 = 10\% \quad MDE = 5\% \quad TE = 2\% \quad \mathcal{v}=0.5\;;\; \alpha=10\% \quad \beta=20\% \] We’ll use a 2-tailed test of proportions.

Show the code

p1=.1; q1 = 1-p1

mde=.05
p2 = p1*(1+mde); q2 = 1-p2

te=.02; v=.5; a=.10; b=.2

ab_const = (qnorm(1-a/2)+qnorm(1-b))^2
t2_const = (p1*q1/v)+(p2*q2/(1-v))

ss = ab_const/(c(es0 = (p1*mde)^2,es1 = (p1*(mde-te))^2,es2 = (p1*mde)^2)/t2_const)

scen_labels = c('Zero Null', 'Inclusive non-zero Null', 'Over-n-above Non-zero Null')

dt1 = tibble(
  Scenario = scen_labels, 
  `Estd. Sample-size` = format(ceiling(ss), big.mark = ",", big.interval = 3L))

kable(dt1,digits = 4,align = c('l','r'))

Table 1: Required total sample-size under various scenarios

Scenario	Estd. Sample-size
Zero Null	90,995
Inclusive non-zero Null	252,764
Over-n-above Non-zero Null	90,995

Show the code

crit_bounds = map2(.x = ss, .y = c(0,p1*te,p1*te), 
                   .f = ~glue("{round(qnorm(c(a/2,1-a/2),.y,sqrt(t2_const/.x))/p1*100,2)}%")) |> bind_cols() |> pivot_longer(1:3) |> 
  mutate(`Bound Type` = rep(c('Lower Bound','Upper Bound'),each=3)) |> 
  pivot_wider(id_cols = `name`,names_from = `Bound Type`,values_from = `value`) |>
  mutate(Scenario = scen_labels) |> select(Scenario, `Lower Bound`,`Upper Bound`)

kable(crit_bounds,align = c('l','c','c'))

Table 2: Critical bounds for observed lift where the null will be rejected

Scenario	Lower Bound	Upper Bound
Zero Null	-3.31%	3.31%
Inclusive non-zero Null	0.02%	3.98%
Over-n-above Non-zero Null	-1.31%	5.31%

When the point alternative hypothesis is defined as ‘over-and-above’ the null hypothesis, we see that the critical bounds shift by the value of the threshold effect (i.e., the non-zero null). So while we retain the original sample-size requirement, the bound at which the null gets rejected in favour of the new design are harder to breach – we will need to observe a larger positive lift. On the other hand, a smaller lift will allow us to reject the null in favour of the incumbent design much faster — owing to a potential harm on the conversion rate.

Hence, this is quite a subjective call as this is a form of bias in favour of the incumbent design (usually denoted as design A).

In case of the inclusive shift however, there is sizeable increase in the sample-size requirement. The same conditions of there being a bias in favour the incumbent design apply here as well. But owing to the increased sample-size requirement, the critical bounds are tighter. It may therefore seem rather irrelevant to incorporate an inclusive shift.

Its true utility comes through in sequential testing which will be (hopefully) discussed in a later post.

…

We briefly looked at the implications of defining a non-zero point null hypothesis, even if the motivating reason to explore these scenarios was simply the question, “why do I never see a non-zero null hypothesis? Is it even feasible?”

Footnotes

MDE: Minimum Detectable Effect. The smallest change in a metric that an A/B test can detect for a given sample-size. Please refer here for a detailed discussion.↩︎
https://www.statisticshowto.com/rejection-region/↩︎