Monday, March 6, 2017

Hypothesis testing: Using the t-test to evaluate an A/B test

A/B testing is a common tool to improve the customer engagement with a website. The objectives are often to increase the sign up or download rate.

Let's assume we have a website and we want to test it against a modification $A$ of that website (we changed the download button for example). How do we determine whether the new layout does better than the old?

Well we divide the website traffic up into 2 groups, each directed to one of the layouts. We then check whether the new layout increases our measure of interest (download rate).

Before starting the A/B test we have to clearly specify:

(1) What is the null hypothesis?
(2) What is the minimum/expected improvement we are looking for?
(3) What is the level of significance we are content with?

Let's start with (1), what is the null hypothesis that we want to disprove. Usually the null hypothesis is that the new layout does not do any better than the original layout, which can be written as
\[
H_0 = \mu_A - \mu_0 \leq 0,
\]where $\mu$ is the mean download rate. The alternative hypothesis is that $\mu_A > \mu_0$, meaning that there is an actual improvement. However, we can't just look for any improvement; we need to look for a clearly specified goal. That is the point of (2).

Let's assume we are looking for an improvement of at least $10\%$ in the download rate. This is a clear goal and we can write the alternative hypothesis as $\mu_A - \mu_0 > 0.1\mu_0$.

Finally (3), we have to specify a level of confidence which is enough for us to make a decision. A common value people like to choose is $95\%$ confidence level, meaning that if we claim at the end that the null hypothesis has been ruled out, there is a $5\%$ chance that we are wrong and we just observed a random fluctuation in the data (type 1 error). This might be too risky for some applications, e.g. clinical trials. In that case one can require a higher confidence level like $99.9\%$, which reduces the chance for a type 1 error to $0.1\%$. Note that statistics will never give you $100\%$ certainty, but if you are willing to collect more and more data, you can approach $100\%$ as closely as you want.

After specifying the setup of the A/B test we can now calculate the number of users required to rule out or confirm the null hypothesis. This is usually required in an industry environment, since our A/B test is always connected to a cost.

Before we calculate the number of users we have to specify how we want to evaluate the A/B test. There are at least two options here, a Z-test or a t-test, so what is the difference between the two?

Difference between Z-test and t-test


The Z-test and t-test are both used to determine the hypothesis that a population mean equals a particular value or that two population means are equal. In both cases we only approximate the confidence level, since we calculate the confidence under certain assumptions for the probability distribution. The Z-test for example assumes that the probability distribution around the null hypothesis is Normal distributed. The t-test assumes that the probability distribution around the null hypothesis follows a Student-t distribution. Note that in both cases we assume that the null hypothesis is true.

Both the Z-test and the t-test rely on other assumptions, which are often broken in real data, so it is important to be aware of these assumptions (and test them):

(1) The samples we look at are random selections from the population
(2) The samples are independent
(3) The sample distribution is approximately Normal

The t-test accounts for additional variations because of small number statistics by using the degrees of freedom (dof) in the probability distribution. For a large number of degrees of freedom the t-test becomes very similar to the Z-test, since the Student-t distribution approximates a Normal distribution. So in general one should use the t-test if the sample is small, while the Z-test can be used if the sample is large.

Here is the definition for the Z-test:
\[
Z = \frac{x - \mu_p}{\frac{\sigma_p}{\sqrt{N}}},
\]where $\sigma_p$ is the population standard deviation, $\mu_p$ is the population mean and $N$ is the sample size. The Z-value calculated with this equation has to be compared to a Normal distribution to get the p-value.

The t-test is defined as
\[
t = \frac{x - \mu_s}{\frac{\sigma_s}{\sqrt{N}}},
\]where $\mu_s$ is the mean of the sample and $\sigma_s$ is the standard deviation of the sample. The t-test has to deal with the additional uncertainty from the fact that the standard deviation $\sigma_s$ has been derived from the sample itself. It does this by using the Student-t distribution instead of the Normal distribution, which is generally broader than a Normal distribution, but approximates the normal distribution for large samples $N$.
Figure 1: Several Student-t distributions with different degrees of freedom (black lines) compared to a normal distribution (blue line).

So the claim here is, that if we pick many samples $X_i$ from the population, construct a t-test for all samples and plot their distribution, they will follow a t-distribution, if the null hypothesis is true.

The number of degrees of freedom is given by the number of samples minus the number of constraints. Since for these tests we usually use the data to constrain two parameters (the mean and standard deviation), we get $N-2$. However, you should check in your particular case whether this is still true. We will talk more about the exact treatment later.


Example


Here we will look at the download rate of a website, meaning the ratio of users who download our product, relative to the total number of users who come to the site. We have two layouts for our website, the new layout $A$ and the old (control) layout. First we specify the setup of our test:

(1) The null hypothesis is that the new layout is not any better than the control $H_0 = \mu_A - \mu_0 \leq 0$
(2) We are looking for a $10\%$ improvement in the download rate
(3) We want to see a confidence level of $95\%$

The next question is, do we use a Z-test or a t-test? Well first of all do we know the standard deviation of the population? Not really... we know that it is a binomial distribution with a probability to click on the download button of $p$ and a probability to not click of $(1 - p)$. This means the standard deviation of the download rate is
\[
\sigma_s = \sqrt{p(1-p)}.
\]And usually $p$ is estimated from the sample itself. So the t-test seems more appropriate, since we do not have much information on the total population.

Before we move on we should note that $\sigma_s$ for this test is not really given by the equation above. In our case, both the sample testing the new layout $A$ and the control sample have an error associated, and hence the combined error is given by
\[
\sigma_{\mu_A - \mu_0}\sqrt{\frac{1}{N_A} + \frac{1}{N_0}} = \sqrt{\frac{(N_A - 1)\sigma_{s,A}^2 + (N_0 - 1)\sigma_{s,0}^2}{N_A + N_0 - 2}}\sqrt{\frac{1}{N_A} + \frac{1}{N_0}} \approx \sqrt{\frac{\sigma^2_{s,A}}{N_A} + \frac{\sigma^2_{s,0}}{N_0}},
\]where the approximation on the right is justified for samples with $N>30$.

Using what we have established so far, we can formulate our t-test as
\[
t = \frac{(\mu_A - \mu_0)}{\sqrt{\frac{\sigma^2_{s,A}}{N_A} + \frac{\sigma^2_{s,0}}{N_0}}}.
\]Here we want to test a specific case, we want to know whether layout $A$ provides a $10\%$ improvement over the control sample. So we want to know when the statistics are good enough to rule out such an improvement.

We can re-write the t-test to answer that question
\[
t = \frac{(0.1\mu_0)}{\sqrt{2\sigma_s^2/N}},
\]where we assumed $N_A = N_0$ (because any other choice is sub-optimal) and $\sigma^2_{s,A} = \sigma^2_{s,0}=\sigma^2_{s}$ for simplicity. Now we solve this equation for the sample size
\[
N = \frac{2t^2\sigma_s^2}{(0.1\mu_0)^2}.
\]Before we can calculate $N$ we have to know $t$ corresponding to a $2\sigma$ ($95\%$) confidence level. To get this from the Student-$t$ distribution we need the number of degrees of freedom, which can be calculated as
\[
dof = \frac{\left(\frac{\sigma_{s,A}^2}{N_A}+ \frac{\sigma_{s,0}^2}{N_0}\right)^2}{ \left[ \frac{\left(\frac{\sigma_{s,A}^2}{N_A}\right)^2}{(N_A - 1)} \right] + \left[ \frac{\left(\frac{\sigma_{s,0}^2}{N_0}\right)^2}{(N_0 - 1)} \right] }.
\]If we assume $N_A = N_0$ and $\sigma_{s,A} = \sigma_{s,0}$, this simplifies to
\[
dof = 2(N - 1).
\]This means there is a problem here. To calculate $N$ we need to know $t$ and to calculate $t$ we need to know the degrees of freedom, which depend on $N$.

However, we can just assume that we are going to deal with a sample $N > 30$, which means the Normal distribution is a good approximation to the Student-$t$ distribution. For the normal distribution we do not need the degrees of freedom, hence
print("The 90\% confidence interval is = (%0.2f %0.2f)" % (norm.interval(0.9, loc=0, scale=1))
>> The 90% confidence interval is = (-1.64 1.64)

We calculated the $90\%$ confidence interval, instead of the $95\%$, since we are interested in a one tail distribution. We do not care about the probability that the new layout $A$ is worse than the control. We just want to know when do we have $95\%$ confidence that layout $A$ is not doing any better than the control. Since the distribution is symmetric, $5\%$ of the probability is above $1.64$ and $5\%$ is below $-1.64$. So we ignore the lower limit and set $t = 1.64$.

Putting this information in our equation for the Number of users we get
\[
N \geq \frac{5.3792 \sigma_s^2}{(0.1\mu_0)^2}.
\]We still need to estimate the standard deviation $\sigma_s$ somehow. One could use the download rate in the past, but one should be aware that this might have a time dependence... so think carefully where you get this estimate from. Let's assume we estimate the download rate to be $2\%$ so
\[
\sigma_s^2 = 0.02(1 - 0.02) = 0.0196.
\]With this we can calculate $N \geq 26359$.

Ok so let's run the test with this many users... imagine that after a day or so you get this table back:

sample
visitors
downloads
fraction
t
control
26359
500
0.0187
N/A
A
26359
560
0.0210
1.893

This looks great! Because of $t_A > 1.64$ we have ruled out the null hypothesis with $> 2\sigma$ and we should probably switch to the new layout $A$.

Even though there wasn't much programming included in this exercise, below I attached a small python script producing the plot above and the numbers used in this example (also available on Github).
cheers
Florian

'''
This program goes through the different steps of an A/B test analysis
'''
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t as student_t
from scipy.stats import norm


def main():

    mu = 0
    variance = 1
    sigma = np.sqrt(variance)
    x = np.linspace(-10, 10, 1000)
    linestyles = ['-', '--', ':', '-.']
    dofs = [1, 2, 4, 30]

    # plot the different student t distributions
    for dof, ls in zip(dofs, linestyles):
        dist = student_t(dof, mu)
        label = r'$\mathrm{t}(dof=%1.f, \mu=%1.f)$' % (dof, mu)
        plt.plot(x, dist.pdf(x), ls=ls, color="black", label=label)

    plt.plot(x,norm.pdf(x, mu, sigma), color="green", linewidth=3, label=r'$\mathrm{N}(\mu=%1.f,\sigma=%1.f)$' % (mu, sigma))
    
    plt.xlim(-5, 5)
    plt.xlabel('$x$')
    plt.ylabel(r'$p(x|k)$')
    plt.title("Student's $t$ Distribution approximates Normal")

    plt.legend()
    plt.show()

    print "The 90%% confidence interval is = (%0.2f %0.2f)" % (norm.interval(0.9, loc=0, scale=1))
    download_rate_estimate = 0.02
    sigma_s = download_rate_estimate*(1. - download_rate_estimate)
    N = 5.3792*sigma_s/(0.1*download_rate_estimate)**2
    print "estimate of N = %d" % round(N)

    # Calculate the t value given the measured download fractions 
    download_fractions = [0.0187, 0.0210]
    print "t_A = %0.2f (for a measured download rate of 2.1%%)" % t_test(N, download_fractions[1], N, download_fractions[0])

    return 


# function to calculate the t value
def t_test(N_A, p_A, N_0, p_0):
    variance_A = p_A*(1. - p_A)
    variance_0 = p_0*(1. - p_0)
    return (p_A - p_0)/np.sqrt(variance_A/N_A + variance_0/N_0)

if __name__ == '__main__':
  main()

No comments:

Post a Comment