Also, this overestimation by Google’s data-aggregating tool is actually a part of a trend, and it illustrates, according to Ryan Kennedy, University of Houston political science professor, “where ‘big data’ analysis can go wrong.”
Google Flu Trend (GTF) was created in 2008 as a method to provide a way to monitor flu cases around the globe. It was based on the assumption that there was a direct relationship between Google searches for flu-related terms and actual cases of influenza, and it used search algorithms to help predict actual amounts of flu cases each week during a year.
Although Ryan Kennedy calls Google Flu Trends “an amazing piece of engineering,” if the data it provides is analyzed incorrectly, or “improper polices” are put into place by other private big data collecting companies, vast inaccuracies can result. the other private big data-collecting companies, like Google, are “constantly changing their service in accordance with their business model.”
According to Kennedy, these “improper polices” that big data-collecting companies put into place can cause people to draw “incorrect conclusions” and make them adopt “improper policies.”
In the example of Google Flu Trends, the overestimation of the amount of flu case during the seasons of 2011-12 and 2012-13 was by over 50 percent. Between August 2011 to September 2013, Google Flu Trends made overestimations in their predictions in 100 out of 108 weeks.
Last winter, during the peak of the flu season, according to Google Flu Trends, 11 percent of the U.S. population had the flu. The CDC reported, in comparison, the more accurate number of flu cases. They reported that the flu affected 6 percent of the U.S. population.
Why did Google Flu Trends overestimate the amount of influenza cases by so much?
There are a number of reasons why Google Flu Trends overestimated the amount of flu cases by over 50 percent, and why their overestimation is a part of a continuing trend.
Besides relying on the numbers of Google searches for flu-related terms, and the information they collect from private companies which might follow “improper policies,” the researchers called into question data Google Flu Trends collected from Twitter and Facebook.
Also, the search algorithms that Google uses are not constant, but are always changing, and Google Flu Trends doesn’t, or can’t, take into account influenza cases that might seem to be aberrations, like the non-seasonal 2009 H1N1 flu.
Data collected by these two social media platforms was based on market popularity and polling trends, rather than on actual factual examples of numbers of influenza cases. Marketing campaigns and private companies can manipulate these sorts of social media platforms to create the impression that the products they’re selling are trending.
The research paper mentions that data that is gathered by companies like Twitter and Google can provide useful understanding about the prevalence of flu cases, but this data should not be taken out of context. It should be combined with more traditional methods of collecting data, like using the actual numbers of influenza cases reported by hospitals, medical clinics, and the CDC.
Kennedy and his co-author, David Lazar, a professor of political science and computer science at Northeastern University, suggest that a more complete understanding of the data can be made by “combining information and techniques from both sources.”
Lazar said of the inaccuracies that they’d been going on for several years, and the way Google Flu Trends collected and analyzed its data “went off the rails years ago.”
A popular theory exists that media-stoked panics resulted in increased searches of flu-related terms, which threw off the search algorithms. However, Kennedy and Lazar noted that Google Flu Trends also overestimated the numbers of influenza cases when there wasn’t media-stoked hysteria occurring.
The study by Kennedy, Lazar, and other researchers, which was published in the journal Science, points out that “big data” doesn’t always mean more accurate data. Big data definitely has a lot of value, but only if it’s gathered and analyzed accurately. As Lazar states: “This case proves you can do big data badly.”
One suggestion that David Lazar had about a way Google Flu Trends could be improved would be to make its research and search algorithms available to the general public. Despite there being a lag time of often several weeks, Lazar stated that the data collected by the CDC was more accurate than that of the Google Flu Trends system.
Google Flu Trends is a tool that can be very useful, but Kennedy and Lazar’s study points out that the data it collects should be taken in context with information that sources like the CDC provides about actual numbers of influenza cases.
Written by: Douglas Cobb