Please, Please Don’t A/B Test That

I confess: I actively discourage teammates from running experiments.

Huh? Don’t you do growth product? Aren’t experiments, like, your shtick? And I’m often the loudest voice arguing against testing a product change as an experiment. Let me explain:

A/B testing has become so easy, I often hear people suggest “let’s test it!” Although it sounds fantastic, it’s often not the right thing to do.

A/B testing is not an insurance policy for critical thinking or knowing your users. Inappropriately suggesting to A/B test is a good way to sound smart in a meeting at best, and cargo cult science at worst.

Checklist for Running an A/B Test

So, when is it right to run an A/B test? Here’s a quick checklist that you can use to push back on anyone advocating an inappropriate A/B test:

ab test 1

Thanks to Brian Balfour and Susan Su at Reforge for turning this post into a visualization.

A “yes” to either of these:

  • Do we need precise quantification of the change?
  • Is there a plausible downside to this?

And a “yes” to all of these:

  • Do you have a well-formed hypothesis?
  • Is there any possible outcome that will change the course of our actions?
  • Does this metric really mean anything?

Pro tip: Add this checklist to your experiment doc template!

Let’s talk about where this checklist comes from.

A/B Testing is Expensive

Expensive? Tal, you’re behind the times. There’s like, virtually no code involved these days.

I don’t want to hear how slick your experiment tooling is, or how few lines of code are required. Running a test will always consume time and energy.

Planning Time and EnergyNo tool will tell you what metrics are important to your business. No tool will help you think through your eligibility criteria, and your product edges cases. No tool will articulate a hypothesis for you, nor will any tool guess on the best case scenario based on your team’s learnings. A skilled person has to sit and think about this.

Decision Time and EnergyThe cost of analysis, digging into segments, solidifying an analysis, and cleaning data, to name a few. This, too, will consume a skilled person’s expensive time.

Analysis shmanalysis, my software does that! Not so fast. Product teams are notorious for poor interpretation of experiments, biases, p-hacking, and using results to confirm pre-existing beliefs. I’ve been guilty of a few of those myself. Without the input of skilled professionals it may often do more damage to run an experiment than nothing at all.

Complexity and DelaysTesting means you have two co-existing versions of your product for a period of time. That makes it harder it is to move fast in that area of the product — for everyone:

  • Your immediate product team
  • Other product teams
  • Customer-facing teams

It’s also a delay on the number of people experiencing your change (if beneficial). No matter how slick the tools become, they won’t beat the laws of physics and mathematics.

A/B Testing Can Be Utterly Useless

I’ve witnessed many intelligent people turn this powerful tool into a whimpering distraction.

1) A/B Testing Is Useless When None of the Outcomes MatterSay you have an experiment that can turn out positive, flat, or negative. If you end up rolling out the feature regardless of the result, why run the test in the first place?

It’s important to have at least one scenario in which you take different action. It can be as simple as this example:

  • If your metric is positive or neutral, roll it out
  • If your metric is negative (by more than -0.5%) keep iterating

The simple act of mapping outcomes to different actions can save a lot of A/B tests from happening. An even more disciplined way to do it is using percentage thresholds. Writing this down ahead of time will raise the credibility and integrity of your experiments.

2) A/B Testing Is Useless When Your Metric Doesn’t Mean AnythingThe most important part of planning an A/B test is deciding what dependent variable you are measuring. Ideally this metric will be as close as possible to the business’s top-level metrics. That’s not always possible, and using more accessible proxy metrics is totally fine.

The problem arises when the metric you’re measuring is so far from the bottom line that it just doesn’t matter. No matter what the result of the test, it doesn’t really tell us anything. The metric could go down, and it could still be a good thing for the business. This is often called the streetlight fallacy.

One example is account signups. Account signups don’t keep the lights on for most businesses. While they’re shared publicly as a vanity metric, but they’re rarely a top-level strategic metric. Recently we discussed changes to our home page, and someone suggested A/B testing for account signups. We realized, however, that testing our change that way would not shed much light on whether we’re accomplishing our mission of getting creators paid.

A/B Testing Can Be Incredible Valuable When…

Lemme take a break from raining all over the experiment parade, and talk about two conditions where A/B testing is extremely valuable.

1) You Need PrecisionA/B test is perfect when you need to know precisely how much better or worse something fared.

Precision for learning: As a product team, you’re constantly hoping to learn and hone your hypotheses. Early on, you’ll have no idea what actually matters. It’s hard to predict what changes lead to dramatic improvements, and which fall flat. A/B testing is great for learning.

On the Patreon growth team, we’ve run dozens of experiments on our payment checkout flow, hoping to convert more of a creator’s fans to patrons. Like on any growth team, the vast majority of our experiments have fallen flat. With each failed experiment, our “aim” gets better and better. When we have a win, we know it’s not because of “one simple trick,” but because of a long line of iteration.

Another example is how our Payments team has precisely characterized the confusion around various payment types. In one A/B test, they measured the amount of support tickets received in both groups, and learned exactly what kind of clarity and education spoke to patrons.

Precision for tradeoffs: Often, you’ll need to make a tradeoff decision. Even if something was obviously a positive change, was it worth the variable costs? Was it worth the tradeoff in other metrics? Is working in this area worth the team’s time? In these cases, we need to precisely know the magnitude of the effect.

In one experiment, the Patreon growth team worked with the Trust & Safety to help weed out fake and fraudulent users from real patrons. This required adding friction to our checkout flow, and we expected conversion to go down. We ran it as an A/B test because we wanted to know precisely how much so that we could decide whether the benefit was worth the cost.

2) There’s a Plausible DownsideMost of the time, you’ll be testing something you hope will be positive. In many cases, even the best ideas will still be risky. If you can plausibly imagine a change backfiring, A/B testing provides blessed optionality. Whether that’s a hit to a financial metric or an unintended consequence to a related behavioral metric, transparency is power. You only have to roll out changes that have positive impact. Magic!

Even if you think a negative result is unlikely, you may be working in a high stakes area of the product. In those cases, such as a checkout flow, or high-traffic area, even a small chance of a very bad thing is enough to keep you up at night.

Downsides can be more than quantitative. If something has the potential to be an awkward experience, or complex engineering liability, it should really be worth it. At Patreon, we tested a time-delayed “abandoned cart email” that has become a staple of e-commerce sites. While in a strictly quantitative sense there could be no downside to shooting off the email (as it would only be sent to visitors who didn’t convert earlier that day), we felt it might be an awkward experience for creators and patrons, since Patreon is not your average e-commerce site. There was a downside to rolling it out blind, and we needed to verify that it really got creators paid (it didn’t).

In many cases however, I hear people advocating for a test when it’s hard to imagine a real downside to a change.

Here’s an example: When Patreon released a rebranded website and mobile app, we changed creators’ navigation bar. We moved creators’ “make a post” button from one side of the window to another. There was a ton of queasiness from the executive team — and requests for A/B testing.

As PMs, our response was: hey, take a step back. Think about who is taking this action, and in what context? How much intent do they have? What do we know about them from all our research and experience? We zoomed out to what we knew about our creators, and decided that changing the location of this button wouldn’t hurt creators. We went ahead without an A/B test, simply monitoring the metric directionally (nothing changed).

Of course, any change can cause unintended consequences, but the majority do not. Product teams can’t both move fast and also constantly pre-empt the 99.9th percentile of scenarios.

When you can imagine a plausible downside to your change, A/B testing is a great way to ensure a good night’s sleep.

Be Confident in Your Team’s Knowledge

The question of whether to A/B test comes down to how confident your team is in knowing your users. How well do you know how they think, how they feel in this funnel, what mindset they’re in, and what’s important to them?

To quote Des Navadeh, product manager at Stack Overflow:

“If we are confident that the change aligns with our product strategy and creates a better experience for users, we may forgo an A/B test.”

Early on, a product team won’t know much. Experiments yield plenty of surprises, and making assumptions feels dangerous. This is a great time to test things and learn — A/B tests are a great way to invest in future knowledge and speed.

Dozens of experiments and weeks of qualitative research later, your team will have a preeeetty good understanding of what makes your customers tick in different situations. You don’t have to test every product change as if you don’t know everything about your end user.

How to Build Confident Knowledge

  • Experience spent working with a specific user persona and area of the product
  • Failed experiments (and even the occasional success)
  • Qualitative research, talking to customers and end users
  • Experience working with the same customer and context in past companies

What DOESN’T Count as Knowledge

  • Reading blog posts and case studies of what worked for other companies
  • “Best practices” and growth hacks from Slide Decks on The Internet
  • Generic pop psychology research and articles

As you can tell, confidence comes from focusing on your users and spending extended focused amounts of time on one persona in your product. For example, the motivations and behaviors of someone using expense reporting software at work is not the same as the flash sale they have open in another tab, or the mobile game they’re playing on their commute.

Build knowledge and confidence with sustained, long-term focus and learning.

You Will Be Wrong, and That’s Okay

On the Patreon growth team, we often remind ourselves that, “we’re a business, not a laboratory.” If there’s a product quality improvement we want to make, we have a confident hypothesis, and we’ve been running experiments for multiple quarters, then it’s sometimes okay to roll out rather than A/B test.

AB testing is not a free insurance policy against having to take risks. In many cases, speed matters more than learnings. Save time and energy for the A/B tests that matter, and do more of those.

Of course, A/B testing is just one of many tools available to product teams. To quote Des Navadeh once more:

*“Product thinking is critical here… If we are confident that the change aligns with our product strategy and creates a better experience for users, we may forgo an A/B test. In these cases, we may take qualitative approaches to validate ideas such as running usability tests or user interviews to get feedback from users.”

“It’s a judgement call. If A/B tests aren’t practical for a given situation, we’ll use another tool in the toolbox to make progress.”*

Now, Push Back

Hopefully, the next time someone inappropriately suggests an A/B test, you can elegantly explain to them why it doesn’t meet the criteria for an A/B test. A bunch of your skilled teammates — and customers — will thank you.

Thank you to Buster Benson, Maura Church, Adam Fishman, Mike Jonas, and Wyatt Jenkins for their invaluable feedback on this post.

Join us at Patreon — we’re hiring.