Thursday, March 8, 2012

Big Data + CEP = percolator?

The question of what is located on the intersection of the Complex Event Processing and Big Data technologies keeps coming up a lot lately and I have not seen an adequate answer so far. Let me try to come up with one. Big Data is about solving really big problems (like back-rubbing the entire Web to determine the rank of every page and recalibrate ranking at the same time or assessing risk of the entire market). Most real life problems are well-posed or stable ones: small changes to the problem yield commensurably small changes to the solution. If the input data for big problem we are trying to solve is continuously undergoing small changes, the typical approach is to freeze it at arbitrary intervals, solve and use this solution for the next period hoping that things are not changing too fast (overnight/weekly/monthly/quarterly batch).
Google built percolator to go from re-indexing the entire web every couple of days to being able to process each change detected by the crawler separately in a matter of minutes. CEP is about reacting to a vast stream of discrete small changes to the state of the universe or, to put it another way, about continuously varying input for a big data problem. Am I the only one who finds this problem both general and utterly fascinating? Is anyone aware of anything being done in this space outside Google and Microsoft (they were building their own percolator when I talked to them a year ago)? Is there a paying need for such a platform that would solve a continuously-changing big problem? I can think of at least two possible candidates:
·         Market-in-a-box: a model of systemic risk of the entire financial market, that can be used both to simulate strategies and their combinations and also conduct joint “maneuvers” or simulation exercises buy major players.
·         Omniscient (way beyond smart) Grid: giving utilities ability to continuously solve the distribution problem based on real time feeds from myriad of smart meters.

A lot of people have pointed me to Storm as the answer. Based on a quick look at https://github.com/nathanmarz/storm/wiki/Rationale it does not appear to be drastically different from regular CEP/Stream processing platforms in the sense that I see nothing there to bootstrap or reset the percolation: ability to solve a static Big Problem in the first place.

No comments:

Post a Comment