It all started a few years ago with Google Flu Trends.
Looking at correlations between what people were looking for in Google (for example: “flu symptoms”, “bulk Kleenex home delivery”) and the spread of the epidemic.
In Google’s own words: “We have found a close relationship between how many people search for flu-related topics and how many people actually have flu symptoms. Of course, not every person who searches for “flu” is actually sick, but a pattern emerges when all the flu-related search queries are added together. We compared our query counts with traditional flu surveillance systems and found that many search queries tend to be popular exactly when flu season is happening. By counting how often we see these search queries, we can estimate how much flu is circulating in different countries and regions around the world.”
Wonderful, just data churning, no theory, fast, cheap, isn’t it?
Big data was born!
Actually it should be called “Found Data”, as that’s exactly what’s used: public data, found, scavenged data from the web, mostly unverified data collected by devices for some purpose, which may or may not be compatible with the use that the “big data” algorithm is designed for.
Thinking that because you have a lot of data, erroneous information, mistakes, etc. will disappear is indeed believing in rosy scenarios. The bigger the data volume, the bigger the mistakes: there is no reason to believe differently.
When I fill a questionnaire and I get bored, I often alter income, age, etc. just for fun: I am sure I am not the only one to do so, for one reason or another…well, all those “fun data” are gobbled-up by the algorithm….as good stuff. Also think about potential biases: if the big data comes from a specific type of users (LinkedIn, or Twitter, etc.) then it may well be that the data are strongly biased in one way or another.
Furthermore, big data algorithms are based on correlation, not causation. For example, if a word becomes “fashionable” and many people look for it, even if they are not really concerned by that word (let’s say that there is a disease called Gang Nam-Fever and the rapper Psy comes out with his Gangnam-hit that year….obviously there will be correlations that do not actually exist!).
Think I am crazy?
That occurred in 2012 with a lot of “healthy” people being concerned about the flu, without being sick…Google Flu Trend overestimated twofold the epidemic! The algorithm will certainly be re calibrated, but the point is made, right?
And that brings us to Risk Assessments. In our courses we always stress that the past does not equate the future, that statistic may be good (the real ones, where time and money is spend on verifying sets of data, proofing correlations etc.), but cannot be the sole basis for risk assessments, as they depict the past, but not the future.
Using big data, aside all the prior described limitations, certainly does not shed light on what could happen in the future, does not help imagining unexpected scenarios, think about the unthinkable.