Before performing an A/A test on your site using Adobe Target, it is important to understand what an A/A test is, why you might want to perform an A/A test, how long you should run the test, and how to interpret the results.
Before explaining A/A testing, it is good to review A/B testing so we can then discuss the differences.
In a standard A/B test, traffic is allocated to two or more different experiences. One experience is typically the “control,” and variations of the experience are tested against the control to see which experience creates the most lift in a given metric.
A/A testing, however, involves allocating traffic to two identical experiences, normally with a 50/50 traffic allocation split. With a standard A/B test, you typically want to discover a lift in conversion. This differs from an A/A test in which your goal is usually to determine that there is no difference in lift between the identical experiences.
Some organizations perform A/A testing when implementing a new testing tool, such as Target, to determine whether:
Although few organizations run A/A tests, it is good practice to run them as “sanity” experiments to build trust after implementing the tool or before performing A/B tests that could impact conversion and revenue.
There are numerous reasons why you might see lift in one experience over another (identical) experience:
A common problem in running any kind of test, including an A/A test, is to look at the results continually, and prematurely stop a test when you see statistical significance, and declare a winning experience. Analysts often do what is called “data peeking.” Data peeking involves looking at the test data early and frequently while trying to determine which experience is performing better. The risk is to stop the test prematurely, which could invalidate the results.
In an A/A test, data peeking can often cause analysts to see lift in one experience, when in fact there should be no difference, because the two experiences are identical. In fact, with continuous peeking, A/A tests are guaranteed to show “statistical significance” (namely, a confidence above a certain threshold, such as 95%) at some point during the test.
To avoid this, and as with a regular A/B test, you should therefore decide ahead of time what sample size to use, based on the minimum effect size (the minimum lift below which an effect is not important to your business), power, and significance levels you find acceptable.
In an A/A test, the goal would then be to not see a statistically significant result after your test has reached the desired sample size.
The Adobe Target Sample Size Calculator is an important tool to help you determine what sample size you should aim for and how long you should run the test.
In addition, see the following articles for information about how long you should run an activity, and other helpful tips and tricks:
The significance level of a test determines how likely it is that the test reports a significant difference in conversion rates between two different offers when, in fact, there is no real difference. This is known as a false positive, or a Type I error. The significance level is a threshold specified by the user and there is a trade-off between the tolerance for false positives and the number of visitors that must be included in the test in choosing the proper significance level.
A commonly used significance level in A/A and A/B testing is 5%, which corresponds to a confidence level of 95% (confidence level = 100% - significance level). A confidence level of 95% means that every time you perform a test, there is a 5% chance of detecting a statistically significant lift even if there is no difference between the experiences.
Suppose you want to achieve a 95% confidence level with your A/A test. With a 95% confidence level, 1 in 20 A/A tests could show statistically significant lift in conversions. With a 90% confidence level, 1 in 10 tests could show lift in conversions when testing identical experiences.
If you decide that an A/A test is necessary in your organization, be aware that the identical experiences might temporarily show a difference from the control. This can be normal, depending on the time the test is allowed to run. The difference should shrink given more time and visitors.
Best practice is to use regular A/B testing methodology: decide the sample size ahead of time based on a minimum relevant effect size, desired power, and significance by using the Adobe Target Size Calculator.
Then, allow adequate time and visitors before you reach any conclusions, and remember that depending on the significance level of your test, there is a chance that one experience will show a difference in lift, and even be declared the winner.