P-Hacking in Startups
Science has a problem.
It’s kind of broken.
Well. Not all of it. Mostly the social sciences and medicine. And I don’t just mean the fact that they consider Freud canon.
It started with a trickle. A retracted paper here. A study that couldn’t be repeated, there.
Then someone decided to get systematic. It opened the floodgates. A study in 2016 showed that 70% of scientists had failed to replicate another scientist’s work, and fully half had failed to reproduce their own work.
Reproducibility is fundamental to the scientific method - it’s supposed to be a study of the natural world, which doesn’t change all that often - so what does its absence mean? Are we incompetent? Can we trust anything? Do we know anything?
The high failure rate of venture-backed startups is its own kind of replication crisis: “How could my company fail? I followed the growth-hacking, blitz-scaling advice from the founders who made it big!” I don’t mean to give blogs and podcasts the weight of peer-reviewed science. But our industry seems to trust them as if they deserve it.
What does it mean if a founder can’t get similar results when following the practices of another?
Science has begun to heal itself. It’s time for startups to go through their own reckoning. Their methods are failing most people. It’s time to learn why and how to get better.
What’s wrong with science?
The crisis in science has multiple, interconnected causes. A lot of them come down to taking techniques from simpler systems and applying them to the far more complex study of humans. The practices useful for studying minerals also worked great on metals, but with people? Not so much.
One of the most famous examples of these studies that fizzle under scrutiny is the marshmallow experiment, conducted at Stanford University in 1972 on the children of students enrolled there. It produced original, important conclusions on the ability of children to endure delayed gratification, and later studies showed that ability was highly correlated to success later in life. Suddenly we’ve got a new tool for understanding how successful you’ll be at a very young age.
Or… maybe not. Further studies showed the original work was actually just exposing the socioeconomic background of the kids. If your family is well off, you are comfortable with delayed gratification and, just coincidentally, are also likely to be well off when you’re older. If you’re from a poor family, delayed gratification is harder to accept and, huh, you’re also more likely to be poor than those kids of rich parents.
Once someone reran the study with a larger group of kids (900 instead of 90) and controlled for socioeconomic background… the effect largely disappeared. It’s not all that surprising that kids with no food insecurity are better at delaying gratification and also will be more successful in life. It certainly doesn’t grab the headlines like announcing that kids who can wait five minutes to eat a marshmallow will earn more money than those who can’t. No HBR article for that one.
It’s been almost fifty years since this study was published. That’s five decades of science based on flawed work, five decades of science that has to be unwound and retried. The longer these mistakes last, the more expensive they are to fix. And like that HBR article above, many conclusions never get retracted.
One particular “technique” has helped trigger the crisis in science. Many a growth-hacking product manager has fallen into the same trap. They can only be rescued through discipline and rigor.
The how and why of P-hacking
Abusing data is a sure way to get bad results. Unlike startups, scientists rarely just make up their data. They make more subtle mistakes, like P-Hacking. This probably sounds pretty cool, but it’s actually a common form of data misuse. Wikipedia describes it this way:
…performing many statistical tests on the data and only reporting those that come back with significant results.
It works like this:
A researcher comes up with an idea for a study. He collects a bunch of data, runs the experiment and… no dice. The idea didn’t pan out.
Hmm. “I have all this data. I can’t just throw it away.”
So he starts slicing the data looking for something that stands out. After a while, sure enough, he finds some correlation that is strong enough to stand up - usually its P-value is under 0.05, and thus considered statistically significant. He publishes this in a paper and looks like a genius. It gets big exposure in the press. Journalists love weird and surprising science. They can report on it without understanding it.
But no one can reproduce the work. The paper gets retracted. He gets uninvited from the big conferences. (Don’t worry. The papers never follow up and publish the retraction.)
What went wrong?
He left out one key piece: How he got the data.
Let’s say he thinks breastfed kids are healthier than bottle-fed kids. He sets up a study that tries to isolate just these variables, which means he wants his population to be reasonably homogenous (similar quality of life, similar locations, etc). Put simply, the difference being researched should be the only material one in the population (unlike in the marshmallow experiment).
But then he looks at the data and - like most of these studies - find there’s no significant difference in health outcomes between breastfed and bottle-fed kids.
He could just toss the data. But, well, he’s already paid to collect it. He’s got all these graduate students who are working nearly for free. He might as well try something. So he puts a student or two on trying to find useful results.
They nearly always do, but… that success kills his work. All those controls to make it work for his original experiment fatally bias it for other studies.
Let’s say he discovers that the study participants who were bottle-fed tended to move around a lot more than people who were breastfed. He concludes, oh, wow, getting bottle-fed causes you to hate your parents and move away. (Yes, this is exactly the kind of headline that would get picked for a result like this.)
He has not proven that. All he has shown is in this particular - probably small, and certainly narrow - data set, that happens to be the case.
He should throw away all existing data. Start from scratch controlling for everything except this new variable under test. Only then can you look for correlations between how a baby was fed and mobility.
But he was too lazy or scared to do that. He found a match in that smaller, biased data set, and then published the results without admitting the problems in either his data or his methods. A few decades ago he would have gotten away with it: A big splashy result on publication, and then everyone just assuming this was true, with no attempt to reproduce and no real questioning of the result.
Today, no chance. Science has developed defenses against this kind of malpractice.
Preregistration of experiments is a key tool.
Researchers register with a central database that they are going to study the health of breastfed vs. bottle-fed babies. When they get results, they point to that registration and say, see, this is what led to my data collection.
If they then wanted to publish some other study, people would say, no, you didn’t pre-register this, which makes us suspect you’re p-hacking, so we’re going to do a deep dive on how you got your data. On second thought, we’re just going to reject your paper. Come back when the results hold on a clean dataset.
From social science to startups
This might not initially seem to have anything to do with startups. Product managers and marketers aren’t commissioning studies - and they certainly aren’t controlling for variables!
Hmm. If you look at it a bit funny… Every data-backed marketing campaign and feature launch is an experiment.
Let’s build an analogous example.
A product manager builds a new feature, and because he’s growth hacking, he has lots of telemetry to tell him exactly how people are using it.
His theory is that people will use this new feature in some specific way. But he builds it, ships it, and observes, well, hmm, no, almost no one is using it. It’s a bust. I’m sure you’ve never worked on a project like this, but trust me, it happens.
Except… hey, there’s this small group that is using it, and widely. He looks into it more closely, and realizes they’re using it at 10x the rate people use the rest of the product. So he changes plans, and he rebuilds the feature around the specific thing those few people were doing with it.
Wait, what? No one uses that feature, either, and even worse, the people who originally used it aren’t any more, now that it’s focused on their actual usage!
What went wrong?
You got caught p-hacking
The data set from his failed feature is bad data. He got the most important result: This feature did not work well for his users. He wasn’t willing to let go of failed work. Just like the scientists, he went looking for some other way to reuse it. And instead of developing new hypotheses and running new experiments, he took his biased data and tried to find new correlations cheaply.
Unfortunately for him, he did.
But when he published the new feature, he is faced with a harsh truth: Those few people who were using the feature in unexpected ways don’t look like the rest of his users. A new feature built for that purpose doesn’t help everyone else. And because he relied on data to make his decisions instead of talking to actual users, he learned too late that those unrepresentative users were doing something even more weird. His simplified feature actually removed that weirdness in the name of simplicity that everyone can use.
So now he’s two features in and nothing to show for it. So much for growth-hacking.
How do I fix it?
The solution is very similar to what science has done.
Connect your data to experiments. With discipline. You must get new, clean data for each new test. I know this is anathema to modern data-oriented product management. But it’s the only real way to trust your results.
That word discipline is key. You don’t need to build some international central registry. Whatever your mission statement says, you’re not really saving the world, and you’re not actually doing science. You’re just trying to build a product people love. What you need is rigorous internal practices, and to hold each other accountable so you can’t cheat at statistics.
Unfortunately, this requires you let go of one of Silicon Valley’s most cherished and wrong beliefs.
Experiments fail. This might be an important part of the process, but it’s not very valuable. Congratulations. Of all the possible ways you could fail, you’ve discovered one of them. Don’t let it go to your head.
Don’t work too hard to salvage that failure. You’re p-hacking, and just making it worse. Yes, obviously, you get personal lessons. You might be lucky enough to learn something that triggers your next experiment. But you have to go run that separately.
You can’t build on the detritus of failure.
So my data is now worthless?!
Of course not. I still rely on data for all kinds of problems. One of the great things about building a company today is how easily you can get information at scale.
But never let yourself forget that your data is heavily biased, especially by how it was collected. One of my favorite examples is from when YouTube dramatically reduced response time. Their average response times went up! Suddenly people with much worse connectivity found it worth using, making the average worse. The developers thought they were helping existing users, but the biggest impact was in creating new ones.
You have to recognize your job isn’t to find some way to make the data valuable. Your job is to make high-quality decisions. Use data when you can. If you don’t have data, go get it.
But the job of the data is to inform you, not give you answers. Use it to hone your instinct, to improve your decision-making. When something doesn’t add up, go talk to the actual humans who are the source of the data. And even, spend some time with people not represented in it.
If you’re working at a software startup, you’re not doing science (even if, like me, you have a science degree). But you should still take advantage of its discipline and practices.
Don’t stop at protecting yourself from P-hacking. One founder’s success might be hard to replicate for many reasons. Gain what lessons you can. But don’t blindly trust others’ story of their work.
Because failure on your part won’t be paired with the retraction of a Nature paper, it’ll be an announcement of layoffs in TechCrunch.