Pitfall 2: Declaring winners of multiple offer tests with no statistically significant difference
With multiple offer testing, marketers often declare the offer with the highest lift as the test winner, even though there is no statistically significant difference between the winner and the runner-up. This situation occurs when the difference between the alternatives is smaller than the difference between the alternatives and the control. The figure below illustrates this concept, with the black error bars representing 95% lift confidence intervals. The true lift for each offer relative to the control offer is 95% likely to be included within the confidence interval-the range shown by the error bars.
Offers A and B have the highest observed lift during the test, and it would be unlikely that offer C would outperform those offers in a future test, because the confidence interval of C does not overlap with the confidence intervals of A or B. However, even though offer A has the highest observed lift during the test, it is possible that offer B could perform better in a future test because the confidence intervals overlap.
The takeaway here is that both offers A and B should be considered winners of the test.
It’s typically not feasible to run the test long enough to identify the true relative performance of the alternatives, and oftentimes the difference in performance between the alternatives is too small to substantially impact the conversion rate. In such cases, you can interpret the outcome as a tie and use other considerations, such as strategy or alignment with other elements of the page, to determine which offer to implement. With multiple tests, you must be open to more than one winner, which sometimes considerably opens up the possibilities for the direction to take with your website development.
If you do want to identify the offer with the highest conversion rate, you are comparing all offers to every other offer. In the example above, you have n = 5 offers—you have to make n(n-1)/2 comparisons, or 5*(5-1)/2 = 10 comparisons. In this case, the Bonferroni correction requires that the significance level of the test be 5%/10 = 0.5%, which corresponds to a confidence level of 99.5%. However, such a high confidence level might require you to run the test for an unreasonably long period.
Pitfall 3: Ignoring the effects of statistical power
Statistical power is the probability that a test detects a real difference in conversion rate between offers. Because of the random, or as statisticians like to call it, “stochastic,” nature of conversion events, a test might not show a statistically significant difference, even when a real difference exists in conversion rate between two offers in the end. Call it bad luck or by chance. Failing to detect a true difference in conversion rate is called a false negative or a Type II error.
There are two key factors that determine the power of a test. First is the sample size, that is, the number of visitors included in the test. Second is the magnitude of the difference in conversion rate that you want the test to detect. Perhaps this is intuitive, but if you are interested in detecting only large conversion rate differences, there’s a higher probability that the test will actually detect such large differences. Along those lines, the smaller the difference you want to detect, the larger the sample size, and therefore, time to get that larger sample size, you require.
Today’s marketers under-power a remarkable number of tests. In other words, they use a sample size that is too small. That means that they have a slim chance of detecting true positives, even when a substantial difference in conversion rate actually exists. In fact, if you continually run underpowered tests, the number of false positives can be comparable to, or even dominate, the number of true positives. This often leads to implementing neutral changes to a site (a waste of time) or changes that actually reduce conversion rates.
To avoid under-powering your test, consider that a typical standard for a well-powered test includes a confidence level of 95% and a statistical power of 80%. Such a test offers a 95% probability that you avoid a false positive and an 80% probability that you avoid a false negative.
Pitfall 4: Using one-tailed tests
One-tailed tests require a smaller observed difference in conversion rates between the offers to call a winner at a certain significance level. This type of test seems to appeal because winners can be called earlier and more often than when using two-tailed tests. But in keeping with the saying, “There’s no free lunch,” one-tailed tests come at an expense.
In a one-tailed test, you test whether offer B is better than offer A. The direction of the test has to be determined before the test commences, or “a priori” in statistics-speak. In other words, you must decide whether to test for B being better than A or A being better than B before initiating the test. However, if you look at the results of the A/B test and see that B is doing better than A and then decide to do a one-tailed test to see whether that difference is statistically significant, you are violating the assumptions behind the statistical test. Violating the assumptions of the test means that your confidence intervals are unreliable and the test has a higher false positive rate than you would expect.
You might view a one-tailed test as putting an offer on trial with a judge who has already made up his or her mind. In a one-tailed test, you’ve already decided what the winning offer is and want to prove it, rather than giving each experience an equal chance to prove itself as the winner. One-tailed tests should only be used in the rare situations in which you are only interested in whether one offer is better than the other and not the other way around. To avoid the issue of the one-tailed test, use an A/B testing solution that always uses two-tailed tests, such as Adobe Target.
Pitfall 5: Monitoring tests
Marketers frequently monitor A/B tests until the test determines a significant result. After all, why test after you’ve achieved statistical significance?
Unfortunately, it’s not that simple. Not to throw a wrench in the works, but it turns out that monitoring the results adversely impacts the effective statistical significance of the test. It greatly increases the likelihood of false positives and makes your confidence intervals untrustworthy.
This might seem confusing. It sounds like we are saying that just from looking at your results mid-test, you can cause them to lose their statistical significance. That’s not exactly what’s going on. The following example explains why.
Let’s say you simulate 10,000 conversion events of two offers, with both offers having 10% conversion rates. Because the conversion rates are the same, you should detect no difference in conversion lift when you test the two offers against each other. Using a 95% confidence interval, the test results in the expected 5% false positive rate when it is evaluated after collecting all 10,000 observations. So if we run 100 of these tests, on average we get five false positives (in actuality, all positives are false in this example because there is no difference in conversion rate between the two offers). However, if we evaluate the test ten times during the test—every 1,000 observations—it turns out that the false positive rate jumps up to 16%. Monitoring the test has more than tripled the risk of false positives! How can this be?
To understand why this occurs, you must consider the different actions taken when a significant result is detected and when it is not detected. When a statistically significant result is detected, the test is stopped and a winner is declared. However, if the result is not statistically significant, we allow the test to continue. This situation strongly favors the positive outcome, and hence, distorts the effective significance level of the test.
To avoid this problem, you should determine an adequate length of time the test runs before initiating the test. Although it’s fine to look at the test results during the test to make sure that you implemented the test correctly, do not draw conclusions or stop the test before the required number of visitors is reached. In other words, no peeking!
Pitfall 6: Stopping tests prematurely
It is tempting to stop a test if one of the offers performs better or worse than the others in the first few days of the test. However, when the number of observations is low, there is a high likelihood that a positive or negative lift will be observed just by chance because the conversion rate is averaged over a low number of visitors. As the test collects more data points, the conversion rates converge toward their true long-term values.
The figure below shows five offers that have the same long-term conversion rate. Offer B had a poor conversion rate for the first 2,000 visitors, and it takes a long time before the estimated conversion rate returns to the true long-term rate.
This phenomenon is known as “regression to the mean,” and can lead to disappointment when an offer that performed well during the initial days of a test fails to keep up this level of performance in the end. It can also lead to lost revenue when a good offer is not implemented because it happened to under-perform in the early days of a test just by chance.
Much like the pitfall of monitoring your test, the best way to avoid these issues is to determine an adequate number of visitors before running the test and then let the test run until this number of visitors has been exposed to the offers.
Pitfall 7: Changing the traffic allocation during the testing period
We recommend that you do not change the traffic allocation percentages during the testing period because this can skew your test results until the data normalizes.
For example, suppose you have an A/B test in which 80% of the traffic is assigned to Experience A (the control) and 20% of the traffic is assigned to Experience B. During the testing period, you change the allocation to 50% for each experience. A few days later, you change the traffic allocation to 100% to Experience B.
In this scenario, how are users assigned to experiences?
If you manually change the allocation split to 100% for Experience B, visitors who were originally allocated to Experience A (the control) remain in their initially assigned experience (Experience A). The change in traffic allocation impacts new entrants only.
If you want to change percentages or greatly affect the flow of visitors into each experience, we recommend that you create a new activity or copy the activity, and then edit the traffic allocation percentages.
If you change the percentages for different experiences during the testing period, it takes a few days for the data to normalize, especially if many purchasers are returning visitors.
As another example, if your A/B test’s traffic allocation is split 50/50, and then you change the split to 80/20, for the first few days after that change the results might look skewed. If the average time to conversion is high, meaning it takes someone several hours or even days to make a purchase, these delayed conversions can affect your reports. So, in that first experience where the number went from 50% to 80%, and the average time to conversion is two days, only visitors from 50% of the population are converting on the first day of the test, although today 80% of the population is entering into the experience. This makes it look like the conversion rate plummeted, but it will normalize again after these 80% of visitors have taken two days to convert.
Pitfall 8: Not considering novelty effects
Other unexpected things can happen if we don’t allow enough time for running a test. This time the problem is not a statistics problem; it’s simply a reaction to change by the visitors. If you change a well-established part of your website, returning visitors might at first engage less fully with the new offer because of changes to their usual workflow. This can temporarily cause a superior new offer to perform less optimally until returning visitors become accustomed to it—-a small price to pay given the long-term gains that the superior offer delivers.
To determine if the new offer under-performs because of a novelty effect or because it’s truly inferior, you can segment your visitors into new and returning visitors and compare the conversion rates. If it’s just the novelty effect, the new offer wins with new visitors. Eventually, as returning visitors get accustomed to the new changes, the offer wins with them, too.
The novelty effect can also work in reverse. Visitors often react positively to a change just because it introduces something new. After a while, as the new content becomes stale or less exciting to the visitor, the conversion rate drops. This effect is harder to identify, but carefully monitoring changes in the conversion rate is key to detecting this.
Pitfall 9: Not considering differences in the consideration period
The consideration period is the time period from when the A/B testing solution presents an offer to a visitor to when the visitor converts. This can be important with offers that affect the consideration period substantially, for example, an offer that implies a deadline, such as “Time-limited offer. Purchase by this Sunday.”
Such offers nudge visitors to convert sooner and will be favored if the test is stopped immediately after the offer expires, because the alternative offer might have a longer deadline or no deadline, and therefore, a longer consideration period. The alternative would get conversions in the period after the termination of the test, but if you stop the test at the end of the deadline, further conversions do not get counted toward the test conversion rate.
The figure below shows two offers that two different visitors see at the same time on a Sunday afternoon. The consideration period for offer A is short, and the visitor converts later that day. However, offer B has a longer consideration period, and the visitor who saw offer B thinks about the offer for a while and ends up converting Monday morning. If you stop the test Sunday night, the conversion associated with offer A is counted toward offer A’s conversion metric, whereas the conversion associated with offer B is not counted toward offer B’s conversion metric. This puts offer B at a significant disadvantage.
To avoid this pitfall, allow some time for visitors who were exposed to the test offers to convert after a new entry to the test has been stopped. This step gives you a fair comparison of the offers.
Pitfall 10: Using metrics that do not reflect business objectives
Marketers might be tempted to use high-traffic and low-variance conversion metrics in the upper funnel, such as click-through rate (CTR), to reach an adequate number of test conversions faster. However, carefully consider whether CTR is an adequate proxy for the business goal that you want to attain. Offers with higher CTRs can easily lead to lower revenue. This can happen when offers attract visitors with a lower propensity to buy, or when the offer itself, for example, a discount offer-simply leads to lower revenue.
Consider the skiing offer below. It generates a higher CTR than the cycling offer, but because visitors spend more money on average when they follow the cycling offer, the expected revenue of putting the cycling offer in front of a given visitor is higher. Therefore, an A/B test with CTR as the metric would pick an offer that does not maximize revenue, which might be the fundamental business objective.
To avoid this issue, monitor your business metrics carefully to identify the business impact of the offers, or better yet, use a metric that is closer to your business goal, if possible.
Conclusion: Success with A/B testing by recognizing and stepping around the pitfalls
After learning about the common A/B testing pitfalls, we hope you can identify when and where you might have fallen prey to them. We also hope we’ve armed you with a better understanding of some of the statistics and probability concepts involved in A/B testing that often feel like the domain of people with math degrees.
The steps below help you avoid these pitfalls and focus on achieving better results from your A/B testing:
- Carefully consider the right metric for the test based on relevant business goals.
- Decide on a confidence level before the test starts, and adhere to this threshold when evaluating the results after the test ends.
- Calculate the sample size (number of visitors) before the test is started.
- Wait for the calculated sample size to be reached before stopping the test.
- Adjust the confidence level when doing post-test segmentation or evaluating more than one alternative, for example, by using the Bonferroni correction.
Adobe Target Maturity Webinar Series
Adobe Customer Success Webinars
Tuesday, Feb 4, 4:00 PM UTC
Adobe Target innovations, including GenAI, and best practices on AI-powered personalization and experimentation at scale.
Register