Three ridiculous biases in A/B testing

Lessons learned from analyzing 100+ A/B tests in Amazon

Aug 29, 2023

Hey friends,

I used to work in Amazon's central experimentation team for three years, and I have analyzed over 100 A/B tests. My coworkers and I also hosted office hours for product owners to help them with A/B testing best practices. Today, I will share three common biases I observed and how you can avoid those.

I'm actually traveling while writing this article, and I'd love to meet you in person! If you live in NYC, meet me here this weekend. If you live in LA, meet me there next weekend.

If you can't meet me in person, here is a great virtual event on ChatGPT for Product Led Growth (PLG): Talk with Your Salesforce or Segment Data; you can register here. You'll Learn how to build an app using Langchain and a vector database, best practices on data privacy, and how to handle multi-model data.

Alright, let's get started.

#1 Bias in the data quality check

When the experiment has statistically positive results, everyone wants to celebrate. However, when the results show that the new feature is not a good idea, the owner of the experiment usually says: "Based on XYZ, we think this is a winning feature! So, is there any data quality issue here? Or is this random? Should we re-run the test?"

People like to believe their assumptions about the features are correct.

However, more than 60% of the time, they are wrong. And that's okay -- that's the reason we need experimentation.

The thing is, how do you know your "successful" experiment is a real success?

In large companies like Amazon, Google, and Facebook, experimentation could affect millions of customers, so most of the percentage lift you see is small. It won't surprise me if the lift is <0.1%. So, when you see an experiment with a lift > 3%, that should raise an alarm.

The data quality check in A/B testing shouldn't only be triggered when the result is not as expected or "bad." In fact, there are no "good" or "bad" experiments. You always learn something about the customer.

What should you do?

The team should develop mechanisms to check data quality regardless of how the metrics move. Create an alarm if the absolute value of the percentage lift is too large. Don't celebrate too early when you see a 5% positive lift - it's probably too good to be true.

#2 Bias in choosing metrics

Here is an example. When a product owner designed an experiment, their key metric was the number of purchases. Later, they found that the number of purchases didn't move much after the experiment finished. However, some categories they didn't pay attention to had significant results - more customers purchased shoes, or maybe Android users had more sign-ups. Looks like a reason to launch the feature!

Let's take a step back.

How A/B tests work is that you are testing an assumption, or in some cases, a few assumptions. That means the criteria of the product launch need to be determined before you kick off the experiment.

Changing your success metric after observing the experimentation data to support the launch decision is called "cherry-picking." This invalidates the statistical tests.

What should you do?

If you slice the data thin enough, some small subgroups might always move in the direction you want. It could be totally random and you should stick to your original launch criteria. But what if those new observations are real effects?

Take this observation and create a new experiment with this assumption -- using the new metric as your launch criteria. But don't use this metric as an excuse to launch your current experiment.

#3 Bias in boosting A/B testing productivity

Online experimentation needs to run for at least a week. In order to have enough statistical power, some need almost a month. Sometimes, product leaders think it takes too long for them to innovate and want to see if they can "get more bang for the buck" through one experimentation.

3.1 Testing more than one feature

There are no rules on how many features you can test in an experiment. So, if you have more than one idea about a feature, why not test them all together? Let's have 10 groups??

When you split your traffic, you might not have enough sample size for each feature, and you might not detect the change even if there is one. You also increased the chance of false positives.

In general, we don't recommend people to test more than 3 features, and you need to run additional statistical tests and adjust the p-value to avoid getting false signals. It'll be counter-productive if you test too many features at the same time.

3.2 Stacking multiple changes in one group

Okay, so if I can't split into too many groups, how about I run a two-sample test where I combine all the changes I want into one feature for the treatment group? I'll move this button to a different location on the page, change its color, and add a banner.

I don't need to worry about not having enough traffic anymore!

However, what if this experiment doesn't show any significant changes? You only learn that this combo doesn't do anything for your customers, but you won't understand how customers react to the new color, button location, or banner, respectively.

You might get lucky that this experiment has significant results, but it's dangerous to launch it without understanding how each change contributed to the lift.

Instead, test one change at a time. Don't stack things you want to test.

In A/B testing, less is more. Be patient.

Design the experiment in a way that you can learn something, so you don't waste time betting on winning features.

I know it's hard to overcome those biases when you have worked on a product for so long and don't want to "kill" your baby.

When you find it hard to stick to the best practices, think about the end goal -- you want to create something useful for your customers. Even if you launch the product by manipulating the data and report a 'win' for the team, eventually, your business metric will suffer because you launched something the customers don't want.

That's it for today. I have many more stories about A/B testing; stay tuned!

Hope to see you in NYC or LA! If not, reply to this email to say hi or check out the Gen AI workshop.

Until next time,

Daliana

daliana's newsletter | data science career stories

Discussion about this post