Sometimes, it's helpful to trade accuracy for execution speed.
There's so much talk of big-data these days: All those mountains of records and all those machine cycles used to distill every last drop of information. That's what computers are good for -- keeping track of a zillion tiny details and doing arithmetic with supreme accuracy -- right?
Yet it turns out that there are situations where great execution speed can be obtained by essentially throwing away accuracy, by making do with rough answers. And in certain business situations, a rough answer in hand is all that's needed to get on with things and make money.
I first encountered this idea on the always-interesting DBMS2.com Website written by the always knowledgeable Curt Monash. (Roughly speaking, Curt has forgotten more about DBMS technology than 97.675 percent of IT people will ever know.) And he was referencing a blog item by a BI consultant named Steve Miller, writing at Information-Management.com.
It turns out that in many business situations, having a pretty good answer right now can be more useful -- more profitable, that is -- than having a more-perfect answer a day or even an hour from now. But this may be a difficult idea for those, including myself, who have grown up believing that the whole point of computing is accuracy and mastering the ever-finer granularity of data. It may be arriving for analysis in ever-more giant volumes from sensor networks, IT infrastructure elements, and all those people twittering at each other on the Web, but no problem -- big-data software and endlessly scaling infrastructure is available to sift, sort, and interpret all that data in any amount of detail required.
As I think about it, I realize that it wasn't always this way. Many of us grew up in a world pretty much built with slide rules, good for about three, maybe four, places of arithmetic accuracy. Just think how many great bridges, buildings, airplanes, ships, and other structures -- even the computer itself -- were designed by engineers working only with slide rules and log tables. (And the atom bomb, too, though the Los Alamos scientists did rely on banks of Marchant calculating machines, which they worked extremely hard and ended up fixing themselves, and some specially-rigged punch-card tabulators from IBM.)
And we do hear lots of talk about fuzzy logic, and here at Datacenter Acceleration, we've written about some new computing architectures that work with probabilistic data as a way to gain sizable processing speedups.
Still, too many businesspeople are suckers for significant digits. It's more impressive, they seem to believe, to talk about sales being up 63.42 percent in Europe instead of just saying "up two-thirds."
It turns out that a number of companies working in the database field, a.k.a. big-data analytics, see opportunities to gain speed at the cost of accuracy. A data-warehouse software-maker called Infobright, for instance, has developed something it calls Rough Query. If I understand it correctly, by working with summaries of data, Infobright's code can provide useful insights several orders of magnitude faster than usual.
Meanwhile, Datameer, developing big-data software, is out to get more interactive performance out of Hadoop, which is generally thought of as a batch-processing tool. To help with that, Datameer makes it possible to run queries against samples of a large dataset and also against intermediate Hadoop results.
Executives from a company called Attivio were interviewed in MIT Sloan Management Review, talking about "Why Companies Have to Trade 'Perfect Data' for 'Fast Info'." "Analytics don't have to be based on super-precise data," they assert. "The report doesn't have to be perfect. It needs to capture the behavior, not the totality of it."
Even Oracle, a leader in relational DBMS, has seen the light. Its researchers have published a paper showing the benefits of limiting the time used to process standard SQL queries and thus obtain useful results faster than usual.
Miller has coined the term "approximate BI" for this kind of thinking. It doesn't refer to any specific kind of technology, just a general idea. And I suspect it will be a lasting idea, for even as there is more computing power available to analyze data at high speed, the volumes of data available will grow even faster. And so, the need for fast observations, quick summaries, and rapid trending will only grow stronger.