In my work as a data scientist at Splitwise, I often need to estimate the relative improvement (relative lift) of key metrics in experiments. Doing these calculations naturally leads to the question of how to characterize the relative lift that is measured from some sample of users. In other words: how do you think about putting a confidence interval on relative lift? Trying to find an accurate and closed form expression for such confidence intervals led me down a bit of a rabbit hole that converged on something called the Fieller interval. If you just want the formula, here it is for the common case of equally-sized treatment and control groups (with notation as explained in the post below):
$$ \frac{1}{1- g} \left( \frac{\overline{X}_T}{\overline{X}_C} - 1 + g \pm \frac{q}{\overline{X}_C \sqrt{n}} \sqrt{(1 - g)\text{var}(X_T) + \left( \frac{\overline{X}_T}{\overline{X}_C} \right)^2 \text{var}(X_C)} \right). \tag{1} $$
Links to article sections:
Introduction
A primary measure of interest in many A/B tests is relative lift, or the relative percent increase (or decrease!) of the treatment cohort over the control cohort, evaluated for one or more key metrics. Taking $\overline{X}_T$ as the mean of the treatment observations and $\overline{X}_C$ as the mean of the control observations, the relative lift is defined as
$$ \text{relative lift} = \frac{\overline{X}_T - \overline{X}_C}{\overline{X}_C} = \frac{\overline{X}_T}{\overline{X}_C} - 1. $$
The question of putting confidence intervals on relative lift turns out to be somewhat complicated. If the true value of the denominator is close to zero, the measured value of the relative lift can fluctuate quite a bit over different sample realizations. One approach to such confidence intervals was developed by Fieller in the mid twentieth century, in the context of the more general problem of calculating confidence intervals for the ratio of two (normally distributed) random variables. However, the translation from the ratio of two random variables to “the ratio minus one” as required for the relative lift can take a bit of algebra, and sometimes mistakes are made.
Some problematic implementations
One implementation of Fieller intervals is described in the 2018 paper “Applying the Delta Method in Metric Analytics” by Deng, et al. (https://arxiv.org/abs/1803.06336). To be fair, here the authors are presenting the Fieller method as a foil for their preferred method, the delta method, which is the main topic of their paper. However, this paper does get referenced as providing an implementation of the Fieller method, so it is an instructive case.
Deng et al. first define a quantity $g$ as
$$ g = \frac{n s_x^2 t_{\alpha/2}^2(r)}{\overline{X}^2}. $$
Here $n$ is the number of observations (assumed to be the same for both the treatment and control groups), $s_x^2$ is the sample variance of the control observations, $\overline{X}$ is the mean of the control observations, and $t_{\alpha/2}(r)$ is the $(1 - \alpha/2)$ quantile of the $t$-distribution with $r$ degrees of freedom. (Deng et al. do not specify the value used for $r$: I’ll provide a formula with this specified later on.) The Fieller interval for the relative lift is then given by Deng et al. as follows:
$$ \frac{1}{1-g} \left( \frac{\overline{Y}}{\overline{X}} - 1 - \frac{g s_{xy}}{s_x^2} \pm \frac{t_{\alpha/2}(r)}{\sqrt{n}\overline{X}} \sqrt{s_y^2 - 2 \frac{\overline{Y}}{\overline{X}} s_{xy} + \frac{\overline{Y}^2}{\overline{X}^2} s_x^2 - g \left( s_y^2 - \frac{s_{xy}^2}{s_x^2} \right)}\right). $$
Staring at this formula and the prior expression for $g$ for a bit and thinking about various limits shows a problem. In the limit of large sample sizes ($n \rightarrow \infty$), $g$ blows up. This drives everything to zero for large $n$, including the central value (aside from a term that is independent of the actual ratio), which can’t be right. Based upon other references that I’ll discuss later, it is clear that Deng et al. have a typo here, and $g$ should actually be defined as follows, with $n$ in the denominator instead of in the numerator:
$$ g = \frac{s_x^2 t_{\alpha/2}^2(r)}{n \overline{X}^2}. $$
Then $g \rightarrow 0$ in the large $n$ limit, so the overall factor of $1/(1 - g)$ goes to 1 as $n \rightarrow \infty$, which makes more sense.
There is remaining problem with this formula that is more subtle. In a footnote on page 2 of their paper, Deng et al. say the following: “For A/B testing where the treatment and control groups are independently sampled from a super population, $\sigma_{xy} = 0$.” In this case, we can set the sample covariance $s_{xy}$ to zero, and arrive at the following simplification for the Fieller confidence interval on the relative lift:
$$ \frac{1}{1-g} \left( \frac{\overline{Y}}{\overline{X}} - 1 \pm \frac{t_{\alpha/2}(r)}{\sqrt{n}\overline{X}} \sqrt{(1 - g) s_y^2 + \frac{\overline{Y}^2}{\overline{X}^2} s_x^2 }\right). \tag{2} $$
This formula looks plausible. The large-$n$ scaling makes sense, and the width of the confidence interval converges as $1/\sqrt{n}$, just like “normal” confidence intervals. And in fact, this formula will work for much of the common parameter space encountered in experiments “in the wild.” Deng et al. provide sampling results showing that the nominal coverage is achieved for various scenarios in Table 1 of their paper. However, there are places where something odd seems to be going on. For example, take $X_i \sim \text{Bern}(p = 0.06)$ and $Y_i \sim \text{Bern}(p = 0.065)$, which is an actual relative lift of 8.3%. Naively applying the formula and sampling 100K times with sample sizes of 150 observations and $\alpha = 0.05$ shows a coverage of 91.4% as compared to the desired coverage of 95%. I’ll show what the formula provided by Deng et al. is missing here in the next section of this post.
Another incorrect implementation of Fieller intervals that I came across is from the company Statsig (they have since corrected it after some friendly discussion). In their documentation they had the following formula:
This formula has a couple of problems. First, the scaling with $n$ cannot be correct. Assuming equally-sized control and treatment samples for simplicity, the width of the interval will decrease as $n^{-3/2}$, much faster than the expected $n^{-1/2}$ scaling for confidence intervals. Second, this formula does not square the ratio of the treatment and control means in the second term under the square root. Some sampling quickly verifies that these issues do indeed cause problems. (The issue with $n$ causes immediate problems, and the issue with the square of the ratio of means is more subtle). This formula is also missing the same piece as the Deng et al. formula, which I will now describe.
Deep dive
To isolate the subtle piece that is missing from both of the above formulas (and to more clearly demonstrate the problems this can cause via sampling methods) requires one to go back to the derivation of the Fieller interval formula. I won’t derive the formula entirely from scratch here. Instead I’ll start in medias res with a 2007 paper by Luxburg and Franz titled “A Geometric Approach to Confidence Sets for Ratios” (https://arxiv.org/abs/0711.0198). This paper presents a formulation of the Fieller interval for the ratio of two normally distributed random variables as follows:
$$ l_{1, 2} = \frac{1}{\hat{\mu}_1^2 - q^2 \hat{c}_{11}} \left( ( \hat{\mu}_1 \hat{\mu}_2 - q^2 \hat{c}_{12}) \pm \sqrt{ ( \hat{\mu}_1 \hat{\mu}_2 - q^2 \hat{c}_{12}) - (\hat{\mu}_1^2 - q^2 \hat{c}_{11})(\hat{\mu}_2^2 - q^2 \hat{c}_{22})} \right) \tag{3} $$
The confidence set for the ratio $\rho = \mu_2/\mu_1$ is then defined as:
$$
R_{\text{Fieller}} = \begin{cases}
(-\infty, \infty) & \text{if } q^2_{\text{complete}} \leq q^2 \newline
(-\infty, \text{min}(l_1, l_2)] \cup [\text{max}(l_1, l_2), \infty) & \text{if } q^2_{\text{exclusive}} < q^2 < q^2_{\text{complete}} \newline
[\text{min}(l_1, l_2), \text{ max}(l_1, l_2)] & \text{otherwise} \
\end{cases} \tag{4}
$$
Note that there are three possible cases for these intervals: completely unbounded (the confidence interval is the entire real line), exclusive unbounded (the confidence interval is everything outside of a specified interval), and bounded (“normal” confidence intervals). Luxburg and Franz note that “different researchers…have shown that any method which is not able to generate such unbounded confidence limits for a ratio leads to arbitrary [sic] large deviations from the intended confidence level.”
Some notation definitions are in order before we can translate this result to the specific case we are interested in of relative lift. First, the labeled $q$ variables are defined as follows:
$$ \begin{aligned} q^2_{\text{exclusive}} &= \frac{\hat{\mu}_1^2}{\hat{c}_{11}} \newline q^2_{\text{complete}} &= \frac{\hat{\mu}_2^2 \hat{c}_{11} - 2 \hat{\mu}_1 \hat{\mu}_2 \hat{c}_{12} + \hat{\mu}_1^2 \hat{c}_{22}}{\hat{c}_{11}\hat{c}_{22} - \hat{c}_{12}^2}. \end{aligned} $$
Next, the unlabeled $q$ variable is the $(1 - \alpha/2)$ quantile of the Student-$t$ distribution with $(n -1)$ degrees of freedom, where $n$ is the number of (paired) observations. Also, $\hat{\mu}_{1}$ is the sample mean of the denominator of the ratio, and $\hat{\mu}_2$ is the sample mean of the numerator. Finally, $\hat{c}_{11}$ is the sample variance of the denominator, divided by $n$, and $\hat{c}_{22}$ is defined similarly, so
$$ \hat{c}_{11} = \frac{\text{var}(X_1)}{n} \text{ and } \hat{c}_{22} = \frac{\text{var}(X_2)}{n}, $$
and $\hat{c}_{12}$ is the sample covariance of the numerator with the denominator, divided by $n$, so
$$ \hat{c}_{12} = \frac{\text{cov}(X_1, X_2)}{n}. $$
To translate this formula to the case of relative lift, note that in this case the denominator of the ratio is $X_1 = X_C$ (the control), and the numerator is $X_2 = X_T - X_C$. The sample means then become (in our earlier notation)
$$ \begin{aligned} \hat{\mu}_1 &= \overline{X}_C \newline \hat{\mu}_2 &= \overline{X}_T - \overline{X}_C, \end{aligned} $$
and then
$$ \begin{aligned} \hat{c}_{11} &= \frac{\text{var}(X_C)}{n} \newline \hat{c}_{22} &= \frac{\text{var}(X_C) + \text{var}(X_T)}{n} \newline \hat{c}_{12} &= \frac{\text{cov}(X_C, X_T - X_C)}{n} = -\frac{\text{var}(X_C)}{n} \end{aligned} $$
This last equality is key for the translation, and is quite easy to miss when working through the algebra. Even though the experiment and control are indeed independent in the case of a standard A/B test (so that $\text{cov}(X_C, X_T) = 0$), the numerator and denominator of the relative lift are not independent!
Making these replacements in equation 3 and doing some algebra (there are some very pleasing cancellations!) gives the following result for the case of relative lift:
$$ \frac{1}{1- g} \left( \frac{\overline{X}_T}{\overline{X}_C} - 1 + g \pm \frac{q}{\overline{X}_C \sqrt{n}} \sqrt{(1 - g)\text{var}(X_T) + \left( \frac{\overline{X}_T}{\overline{X}_C} \right)^2 \text{var}(X_C)} \right). \tag{1} $$
This is the equation shown at the top of this post. Here the variances are sample variances, and $g$ is defined as in the Deng et al. paper (after correcting for their typo). In the current notation:
$$ g = \frac{q^2 \text{var}(X_C)}{n \overline{X}_C^2}. $$
Note that the key difference between the above expression for the interval and equation 2 given in Deng et al. is the addition of a $+g$ term to the central point of the interval. Both this term and consideration of the completely unbounded and exclusive unbounded cases outlined above are necessary for achieving the nominal coverage in certain situations.
Before we can implement this result, we also need to translate the boundaries given in equation 4 that determine which of the three different interval types (bounded, completely unbounded, or exclusive unbounded) should be used. Translating the labeled $q$ variables to the relative lift case gives
$$ \begin{aligned} q^2_{\text{exclusive}} &= \frac{n \overline{X}_C^2}{\text{var}(X_C)} \newline q^2_{\text{complete}} &= \frac{n \overline{X}_C^2}{\text{var}(X_C)} + \frac{n \overline{X}_T^2}{\text{var}(X_T)} \end{aligned} $$
The condition for a completely unbounded interval then becomes
$$ \frac{n \overline{X}_C^2}{\text{var}(X_C) q^2} + \frac{n \overline{X}_T^2}{\text{var}(X_T) q^2} \leq 1. $$
This can be expressed quite simply if we define a new variable $g_T$ as follows:
$$ g_T = \frac{q^2 \text{var}(X_T)}{n \overline{X}_T^2}. $$
Then the condition for a completely unbounded interval is
$$ \frac{1}{g} + \frac{1}{g_T} \leq 1. $$
The condition for an exclusive unbounded interval can similarly be expressed quite simply:
$$ \frac{1}{g} + \frac{1}{g_T} > 1 \text{ and } g > 1 $$
All other cases ($g \leq 1$) are bounded intervals. There is a heuristic way of thinking about these boundaries. In a sense $g$ is measuring how close the denominator of the relative lift is to zero. If $g$ is small ($g < 1$), the denominator is safely away from zero, and we can give a “nice” confidence interval. If $g$ is large, we still might be able to say something as long as $g_T$ is sufficiently small, which means that $\overline{X}_T$ is somewhat away from zero. In this case we don’t know the sign, but we do know that the magnitude of the whole expression is large. If both $g$ and $g_T$ are large enough, then both $\overline{X}_T$ and $\overline{X}_C$ are too close to zero for any conclusions to be drawn, so the confidence interval is the entire real line.
Sampling results
Running 100K iterations of the situation outlined above that seemed to indicate a problem with the Deng et al. formula ($X_i \sim \text{Bern}(p = 0.06)$, $Y_i \sim \text{Bern}(p = 0.065)$, 150 observations from each) gives exactly 95% coverage. Most of the samples result in bounded intervals: 98.1% are bounded, 1.87% are exclusive unbounded, and 0.007% are completely unbounded. The coverage for just the bounded cases is slightly high (95.8%), while the coverage for the exclusive unbounded cases is quite low (53.8%). This is fine though–what matters is the coverage of the entire procedure, not just selecting data that falls into one of the cases (if you only select sample realizations that fall into the completely unbounded case, the coverage will always be 100% instead of the nominal 95%!).
Below is a plot showing sampling results for several different cases. This uses $X_C \sim \text{Binomial}(p)$ and $X_T \sim \text{Binomial}(p(1 + 0.1))$ so that there is always a true relative lift of 10%. The success probability $p$ ranges from 5% to 95% in steps of 5%. The methods used are labeled as follows:
correct_fieller
: the correctly derived formula given abovemissing_g
: the correct formula but missing the factor of $+g$ in the central pointno_square
: the correct formula including the $+g$ factor, but not squaring $\overline{X}_T/\overline{X}_C$ in the square root
For each sampling scenario, 100 observations are sampled from both control and experiment
(so $n = 100$), 10K sampling iterations are taken, and $\alpha = 0.05$, so that
the nominal coverage is 95%. In the case of the no_square
method, it can happen that
the number under the square root is negative as a result of not squaring the ratio.
I counted these cases as failing coverage.
Note that the incorrect methods (missing_g
and no_square
) fail the most
at low values of the true success probability $p$. In particular, missing_g
fails
here because $g$ scales as $(1 - p)/p$ for the binomial distribution,
thus becoming larger as $p \rightarrow 0$.
Also note that the correct
method looks to be slightly high at the lowest value of $p$, and slightly
low at the highest value. This is likely due to deviations from normality
becoming more pronounced at the extremes of the binomial distribution. This can be
verified by decreasing the sample size so that the distribution of the sample means
are further from normality. Going down to 20
observations gives:
Here the correct method is overconfident for low values of $p$, and
under performs for high values of $p$. The missing_g
method has interesting behavior.
The worst performance is not at the lowest values of $p$ as it was before.
This is because we now move into a region where the intervals are mostly
completely unbounded, so it doesn’t matter what the expression is! Here is the distribution of the various interval types
for the different values of $p$:
For samples drawn from normal distributions, the behavior is very nice even at such low sample sizes. Here are sampling results for $X_C \sim N(\mu, 0.2^2)$ and $X_T \sim N(\mu (1 + 0.1), 0.2^2)$, taking 20 observations from each distribution:
Note that the region tested in the Deng et al. paper is on the right (they tested normal distributions with means of 1 and 1.1), and would not have shown any anomalous effects due to their lack of the $+g$ term. Here is the breakdown of the different interval types for the above normal samples:
Another approach
After discussing this problem with my Splitwise colleague and fellow data scientist Andy Biekert, Andy suggested an alternate approach, which is to calculate the confidence interval for the ratio of treatment and control means, and then just subtract one from the result (in other words, once you have the upper limit and lower limit for the raw ratio, subtract one from the upper limit to get the upper limit for the relative lift, and do the same with the lower limit). This is simpler than the approach I used above, and I was initially skeptical that it would work. However, the algebra does indeed work out, and this gives the same result as in equation 1.
In a bit more detail, use the result from Luxburg and Franz (equation 3) to directly calculate the confidence intervals for the ratio $\overline{X}_T/\overline{X}_C$. This requires going through and redefining all of the variables I used above (for example now $\hat{c}_{12} = 0$ for the A/B testing case we are interested in). The result is the same $\pm$ terms shown above, but now with a central value of
$$ \frac{1}{1 - g} \frac{\overline{X}_T}{\overline{X}_C}. $$
Subtracting 1 from the limits defined in this way to get intervals for the relative lift will give the same intervals as in equation (1), since
$$ \frac{1}{1 - g} \frac{\overline{X}_T}{\overline{X}_C} - 1 = \frac{1}{1 - g} \left( \frac{\overline{X}_T}{\overline{X}_C} - 1 + g \right). $$
The boundaries in equation 4 also work out to be the same as in the direct relative lift calculation. Again, thanks to Andy for pointing this out (and if you’re interested in working with us at Splitwise, check out our jobs page!).