AB Testing (Udacity) Learning Notes (1)

Introduction to AB Testing.

1. Concepts

1.1 What is AB Testing

AB testing is a general methodology used online when need to test out new products or new features. Basically it takes two sets of users, show the control set with existing product / feature, and experiment set with new version. And observe how do these users respond differently, finally determine which version is better.

1.2 What to test

There are variety of things to test:

visible changes: new features, new additions to UI, ranking list, different look of the website;
invisible changes: page load time, new recommendation algorithms.

However there are also things that AB testing is not able to test. AB testing isn’t useful in testing out new experience:

user who don’t like change too much would prefer old version;
user will feel excited and test out everything (novelty effect).

The issues of using AB testing in these cases include:

AB testing needs baseline - here it is hard to define
AB testing cannot run forever - here it is hard to know when to stop

1.3 What are Complimentary for AB Testing

Retrospective Analysis
User Experience Research
Focus Group
Surverys

2. One Example

2.1 Background

Audacity is a company provides online finance courses. The user flow of Audacity, i.e. user behaviors on Audacity website, could includes these:

Homepage visit
Explore site (click pages and view more details)
Create user account
Complete courses

The number of events is decreasing from top to the bottom, which is also know as customer funnel.

Experiment (test the hypothesis): there is a new feature - whether changing the “Start Now” button color from orange to pink will increase more users explore the website.

2.2 Metrics

There are multiple selections to measure the level of user engagement on website:

total number of courses complete
total number of clicks
click-through rate (total number of clicks / total number of pageviews)
click-through probability (unique user who clicks / unique user who visits page)

Here click-through probability is selected. And the hypothesis could be more specific: whether changing the “Start Now” button color from orange to pink will increase the click-through probability.

In general, use a rate metric when you want to measure the usability of the website; use probability metric when you want to measure the total impact of the change.

2.3 Hypothesis Test

Here the statistical hypothesis test part (distribution of metrics, confidence intervals, statistical significance, null / alternative hypothesis, standard error) is skipped.

In real world, statistical hypothesis test is just the first step. After this, more people will involved to determine what is the practically significance.

2.4 Size and Power (Trade-Off)

Before running test, we need to decide when to stop the test, i.e. how many pageviews required in order to get statistical significant (statistical power). Here is the trade-off: the smaller change we want to detect, the larger samples we need to get, hence increase the time to collect data.

2.5 Confidence Interval Analysis

Here is the list of possible confidence interval versus the significance level (solid line is 0),

Recommendations for conclusions:

the change is significant, it is helpful to add the new change
the change is not significant, since it overlaps with 0
the change is helpful but not significant enough
better to run new test
better to run new test
better to run new test

How to make decisions if no time to run new test? Usually ask the decision maker, be aware to take the risk because the data is uncertain; or use other factors besides the data.