Why might you see lift for one experience when the experiences are identical?
There are numerous reasons why you might see lift in one experience over another (identical) experience:
The A/A test was monitored continually
A common problem in running any kind of test, including an A/A test, is to look at the results continually, and prematurely stop a test when you see statistical significance, and declare a winning experience. Analysts often do what is called “data peeking.” Data peeking involves looking at the test data early and frequently while trying to determine which experience is performing better. The risk is to stop the test prematurely, which could invalidate the results.
In an A/A test, data peeking can often cause analysts to see lift in one experience, when in fact there should be no difference, because the two experiences are identical. In fact, with continuous peeking, A/A tests are guaranteed to show “statistical significance” (namely, a confidence above a certain threshold, such as 95%) at some point during the test.
To avoid this, and as with a regular A/B test, you should therefore decide ahead of time what sample size to use, based on the minimum effect size (the minimum lift below which an effect is not important to your business), power, and significance levels you find acceptable.
In an A/A test, the goal would then be to not see a statistically significant result after your test has reached the desired sample size.
The Adobe Target Sample Size Calculator is an important tool to help you determine what sample size you should aim for and how long you should run the test.
In addition, see the following articles for information about how long you should run an activity, and other helpful tips and tricks: