Big data will make it possible to find correlations in all sorts of areas and use them for forecasting and policy analysis. This is inevitable but not wholly a good thing. There are new dangers that lurk for the unwary.
There is a serious risk of mistaking correlations for causality when making business decisions or conducting policy analysis. The huge amounts of data available on consumer trends, lifestyles, and online habits are valuable to firms and provide information that can be used to target specific groups through advertising and sales pitches. In relation to policy, there is a vast trove of behavior data available that can inform policymakers how people will react to tax changes or other changes. The irresistible lure may be that the huge amounts of data provide seemingly precise estimates. Large data sets typically imply less uncertain measurement and when the whole population is measured, there is no remaining sampling uncertainty. More and more firms have access to big data or sell access to big data. But the apparent precision of such analysis may be a castle built on sand and may not withstand a change in conditions.
Those conducting empirical work needs to be mindful to not interpret correlations as showing causality. The experience with Google that used search frequencies to predict the onset of flu is an illustration of how fragile correlations can be. When conditions change, consumer behavior can change. The danger of over-interpreting correlations is aggravated by that correlations can be stable for a long time and change only suddenly to due some unforeseeable trigger. For example, banks that provided loans in the US prior to the financial crisis used vast amounts of past foreclosure data to predict future risk. But the many sub-prime foreclosures, while very small compared to the total market, created a domino effect that cascaded into the rest of the market, blowing historical correlations into the trashcan. The same may happen to consumer surveys and other analysis based on big data.
The consequences of presuming causality when none exists is thus not one of an academic faux pas; it can have serious economic consequences on society and lead to bad business decisions. One example of where this danger lurks is in the rising area of “now-casting,” using vast hoards of online data (such as people’s search habits) as input for macroeconomic forecasts. For example, do more online searches for unemployment benefits imply unemployment may be on the rise? Such a correlation may quite possibly seem strong, but just as with the Google flu, people’s behavior may alter over time and the correlation become weaker. The point is not to deny the value of drawing inferences from online searches but to highlight the importance of combining it with other types of corroborating data and models.
 See Lazer, David, Ryan Kennedy, Gary King and Alessandro Vespignani (2014). “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343(6176). 1203–1205.