The rise and fall of Google Flu Trends

45 overfit curves only millennial kids will remember

Apr 08, 2020

I found myself wondering the other day - hey, didn’t Google used to predict flu season? It sure seems like that sort of thing could be useful these days. It turns out that the story of Google Flu Trends was a sort of ur-story of data science, making all the core mistakes of Big Data before most people had ever heard of the phrase.

Approximately a million years ago (2008), Google claimed they could use search data to accurately predict the spread and severity of the seasonal flu. If you have ever heard of it, you may be wondering right now why they’re not using that technology in order to predict the future path of Covid-19. The answer is that it turns out that while it was easy to build a model that predicted past flus, predicting the future turned out to be much harder.

What was Google Flu Trends?

Google Flu Trends was a pretty cool application of the unstructured data that Google has to offer. For most people, Google is the first stop for medical advice - “sudden headache”, “constant fatigue”, “weird stuff oozing out of weird place”. This is true for chronic and acute conditions alike, and it turns out a lot of people would search for things like, let’s say, “high fever” or “cough + joint pain”. Some researchers had the pretty sensible idea to start indexing search terms that could be related to the flu and using them to predict where the flu was spreading and how bad flu season was going to be.

The results were amazing. The researchers published it in Nature in early 2009 [Google Research]. The mathematical foundation was pretty simple (and would be regarded as laughable in the modern era of ever-more-shiny-and-baroque machine learning models), but the source data was the key. The Google team trawled all unique search terms to figure out which ones had been most strongly related to physician visits for influenza-like-illnesses, and were effective leading indicators. They settled on 45 unique search terms to create an index of expected future flu severity.

It turned out to be incredibly useful when turned to future data, as they used a preliminary version in the 2007-2008 flu season to help the CDC prepare.

Across the nine regions, we were able to consistently estimate the current ILI percentage 1-2 weeks ahead of the publication of reports by the CDC’s U.S. Influenza Sentinel Provider Surveillance Network.

It was incredibly useful, intuitive, and accurate. It’s the sort of thing that would be incredibly helpful in triaging the coronavirus outbreak and helping the authorities be a step ahead of the virus or at least fewer steps behind.

So why aren’t we using it today?

It turned out that Google Flu Trends worked until it didn’t. In the 2013 flu season it went wildly off-course, predicting that the flu season would be twice as bad as the eventual result. By 2015 Google Flu Trends had gone offline [Mobi Health News]; it sent its data directly to the CDC, which I suppose is the algorithmic equivalent of donating your body to science.

The publicity of the failure generated a lot of attention and ultimately a good postmortem. A multidisciplinary research team found that the failure was multi-faced and the failure modes complex [Science]. There were two ones that I thought were particularly enlightening:

Spurious relationships: The dataset consisted of all the unique search terms, as I mentioned. Well, it turned out that some of the strongest predictors of the seasonal flu were also just predictors of…the winter season. For example, high school basketball searches popped as one of the top predictors. While the Google team cleaned out many of those obviously spurious results it’s not clear how good a job they really did at removing that bias - Google Flu Trends completely missed the 2009 H1N1 pandemic which struck in the spring rather than the winter.

“Blue-teaming”: The model was concrete built on shifting sands. The model used historical Google data, but the algorithms change all the time. By 2013 Google was starting to provide more information - i.e., if you google flu symptoms they would suggest options for flu treatment. This and similar changes fundamentally changed how prevalent flu-related searches are and the past became a terrible predictor of the present. The authors call this blue-teaming; when you’re training a model off of data you generate yourself, and you make changes in the sort of data you’re generating you can easily wreck your model.

Mental models > mathematical models

The problem with the Google model is obvious in retrospect: they didn’t think about how relationships would change. This is the kind of model that would need to be updated periodically to stay “fresh” - not just dropping in new data but changing how they construct features. This might include e.g., dropping data that might be affected by “blue team” problems such as search terms populated by the Google auto-suggest.

This may sound like a lot of work - and it is. But it’s required in order to make sure the mathematical model continues to reflect the real world. Without attention the model will fail - and unlike most types of computer program that “fail loudly” by refusing to work, a machine learning model will “fail quietly”. That is to say it will keep confidently putting out predictions that diverge further and further from reality.

It turns out that the machine learning part of this model was simple - even laughably so by the standards of 2020. But it’s also possible to produce a model that’s just as computationally simple that doesn’t fall into the same trap. The key to getting it right is having a strong mental model of how the source data is created and what are the problems that might cause your model to fail quietly.

All computational models are built to predict the past - it’s the job of the modeler to be careful that they also predict the future!

Standard Errors

Discussion about this post