The A / B Testing Conundrum

Recently I decided to start feeding my cats grain free dry food. Since there are so many different foods to choose from, I wanted to pick the one that my cats would like the best. I decided to do some A/B testing. I purchased seven different foods in the smallest sizes available. Evo was the “control” food. Each day I would put out small bowl of EVO and a different food. Each day I would pick the one that “wins”, and I was going to keep repeating this until I arrived at the ultimate, most scientifically desirable choice. But then something weird happened. On day five, during a head-to-head match with Blue Buffalo food, the control food (Evo) was the winner. When I recorded this in my journal, I discovered that I had made an error and I had already used Blue Buffalo on day two. Only on day two, Blue Buffalo was the winner. Why different results?

You might suggest that cats are fickle. You might even suggest that the cats got familiar with the Evo and decided to stick with it. Perhaps different results might lead to the conclusion that the cats were looking for a variety. Perhaps this testing should have been done differently with different cats each time.

Which leads me, Brett Loebel, to Internet related A/B testing. Most often I am asked to perform A/B testing on pay per click searches. What keywords lead to the most click through’s? What keywords lead to the most conversions? Do more specific keywords with fewer click through’s lead to more conversions?

During these tasks, I often run into non-statistical differences in results. What would be a statistical difference? First we would need to split traffic exactly evenly between two keywords. Then we would need a significant sample size to assure confidence. (While there is no exact sample size measure, the larger the sample, the more precision the result).

So now we look at the results. The results may or may not have given us our answer. We actually need to calculate the standard deviation to see whether the results are within or outside that number. Results within the standard deviation are not statistically different enough for us to determine whether the A was better than the B or vice versa.

And then we need to test A against C, and then against D, and then E and on and on until we achieve the most desirable result.

But we are forgetting about the cats! The Internet is even more fickle than my temperamental cats. Oh, and don’t forget about all those repeat visitors who are looking for familiarity when they search. There is a mood on the Internet as well. The mood is different in the morning than it is at night. People surf differently from work than they do at home. Sundays are different than Tuesday’s.

I have seen many companies throw out quality results based on non-statistical differences. Here’s a few quick numbers:

Hypothetically, we are displaying ads at the rate of 50% with keyword A and 50% with keyword B

We calculate the first 500 click-throughs:

Keyword A results in 60% of the click-throughs whereas keyword B results in 40% of the click-throughs.

Statistically significant? YES. Throw out keyword B!

Same test with same results but only 20 click throughs.

Statistically significant? No.

Leave a Reply

Your email address will not be published. Required fields are marked *