I’ve been interested in distributed stream processing since the Storm and S4 announcements. Storm was open sourced by Twitter and S4, now an Apache Incubator project, was open sourced by Yahoo! Now, LinkedIn has open sourced Samza.
The idea is to process data as a stream of messages with nodes consuming messages from a stream, executing code, and / or producing a stream of messages. A distributed stream processing network is a directed acyclic graph (DAG / link).
I’m interested in distributed stream processing in the context of big data:
real time big data processing, asynchronous big data processing
Real Time Big Data Processing
It is more efficient…
- to process data in motion, not at rest.
- to process data in increments, not in full.
Applications and / or services get the the number of transactions by getting the transaction count, not by counting the number of transactions.
The source could be a JMS queue in JBoss Enterprise Application Platform.
The state could be a cache in JBoss Data Grid (JDG).
In a complex example, the transactions would be grouped and counted by transaction type.
A node would consume the input stream and produce multiple output streams using the transaction type, and thus increase parallelism. The output streams would be consumed by multiples nodes that count the number of transactions per transaction type.
The transactions could be counted in batches in order to reduce the number of writes to the state (e.g. JDG).
I think this slide from the Introduction to Storm presentation (link) is perfect.
Asynchronous Big Data Processing
While it is more efficient to process data in motion that data at rest, it may not be appropriate to do it on the critical path. For example, due to performance reasons. In this case, data in motion should be processed asynchronously.
Customers trade stocks and mutual funds by placing buy and sell orders via a web application.
A customer sends a request to place the order to the web application, and the order is persisted to a database for durability. The risk management team analyzes orders. However, the web application can not meet the service level agreement (SLA) if the order is persisted to a database, a data grid, and an Apache Hadoop distribution within the request and before the response is returned.
The solution is to persist the order to the database and to send the order to a message queue within the same transaction. The distributed stream processing platform consumes orders in the form of messages from the queue, processes the orders as messages, and persists the orders to NoSQL and Big Data platforms such as JBoss Data Grid and Apache Hadoop. For example, the risk management team might rely on a near real-time dashboard that accesses JBoss Data Grid for short term analysis and it might rely on an analytic application that integrates with Apache Hadoop for long term analysis.
I have seen use cases with JDG where customers bulk load data into JDG to a) use the data grid as a primary source of record or b) use the data grid for processing data (e.g. map / reduce). The problem is performance and administration. It takes time to load bulk data into a data grid. The processing would be faster if the data was in the data grid in the first place. In addition, administration overhead is increased in order to bulk load data into a data grid. A better architecture would integrate a distributed stream processing platform for asynchronous and incrementally pushing data to the data grid.
Common Patterns via the Storm Wiki (link)
Storm at Twitter (link)
Storm at Nedium (link)
In particular, their use cases.
Dempsy via InfoQ (link)
In particular, the creator’s thoughts on CEP.
Storm – The Hadoop of Processing (link)In particular, the comments on CEP and Hadoop.