Let the data work it out itself

Chris Anderson’s article in Wired is about the notion that vast amounts of data (in the order of Petabytes) will render models superfluous. The rationale is that in very complex systems for which vast data can easily be collected, it is more efficient to let the data make the model rather than devising it ourselves; or in the words of Google’s research director Peter Norvig: “All models are wrong, and increasingly you can succeed without them.”

Anderson believes we have overcome a critical point were the computing/storage power we have is enough to do this:

At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later. For instance, Google conquered the advertising world with nothing more than applied mathematics. It didn’t pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right.

The problem is that when the model’s causality is unknown (since we didn’t design it in the first place; the data did), we can never be sure when it will misfire: the black swan problem. While such models may work in domains like marketing, biology etc., where the cost of mistakes is low, it cannot be trusted in mission critical functions (see quant funds and subprime crisis).


 
 
 

Leave a Reply