Topic of Interest – Distributed Stream Processing

August 29, 2013

Big Data, Data Grid, NOSQL

I’ve been interested in distributed stream processing since the Storm and S4 announcements. Storm was open sourced by Twitter and S4, now an Apache Incubator project, was open sourced by Yahoo! Now, LinkedIn has open sourced Samza.

The idea is to process data as a stream of messages with nodes consuming messages from a stream, executing code, and / or producing a stream of messages. A distributed stream processing network is a directed acyclic graph (DAG / link).

I’m interested in distributed stream processing in the context of big data:
real time big data processing, asynchronous big data processing

Real Time Big Data Processing

It is more efficient…

  • to process data in motion, not at rest.
  • to process data in increments, not in full.

EXAMPLE

Applications and / or services get the the number of transactions by getting the transaction count, not by counting the number of transactions.

stream_ex_count

The source could be a JMS queue in JBoss Enterprise Application Platform.

The state could be a cache in JBoss Data Grid (JDG).

In a complex example, the transactions would be grouped and counted by transaction type.

A node would consume the input stream and produce multiple output streams using the transaction type, and thus increase parallelism. The output streams would be consumed by multiples nodes that count the number of transactions per transaction type.

The transactions could be counted in batches in order to reduce the number of writes to the state (e.g. JDG).

I think this slide from the Introduction to Storm presentation (link) is perfect.

hadoop_v_storm

Asynchronous Big Data Processing

While it is more efficient to process data in motion that data at rest, it may not be appropriate to do it on the critical path. For example, due to performance reasons. In this case, data in motion should be processed asynchronously.

EXAMPLE

Customers trade stocks and mutual funds by placing buy and sell orders via a web application.

A customer sends a request to place the order to the web application, and the order is persisted to a database for durability. The risk management team analyzes orders. However, the web application can not meet the service level agreement (SLA) if the order is persisted to a database, a data grid, and an Apache Hadoop distribution within the request and before the response is returned.

The solution is to persist the order to the database and to send the order to a message queue within the same transaction. The distributed stream processing platform consumes orders in the form of messages from the queue, processes the orders as messages, and persists the orders to NoSQL and Big Data platforms such as JBoss Data Grid and Apache Hadoop.  For example, the risk management team might rely on a near real-time dashboard that accesses JBoss Data Grid for short term analysis and it might rely on an analytic application that integrates with Apache Hadoop for long term analysis.

stream_ex_bd

Note

I have seen use cases with JDG where customers bulk load data into JDG to a) use the data grid as a primary source of record or b)  use the data grid for processing data (e.g. map / reduce). The problem is performance and administration. It takes time to load bulk data into a data grid. The processing would be faster if the data was in the data grid in the first place. In addition, administration overhead is increased in order to bulk load data into a data grid. A better architecture would integrate a distributed stream processing platform for asynchronous and incrementally pushing data to the data grid.


Recommended Reading

Common Patterns via the Storm Wiki (link)

Storm at Twitter (link)

Storm at Nedium (link)
In particular, their use cases.

Dempsy via InfoQ (link)
In particular, the creator’s thoughts on CEP.

Storm – The Hadoop of Processing (link)In particular, the comments on CEP and Hadoop.

, , ,

About Shane K Johnson

Technical Marketing Manager, Red Hat Inc.

View all posts by Shane K Johnson

Trackbacks/Pingbacks

  1. Topic of Interest - Distributed Stream Processi... - August 29, 2013

    […] I've been interested in distributed stream processing since the Storm and S4 announcements. Storm was open sourced by Twitter and S4, now an Apache Incubator project, was open sourced by Yahoo! Now…  […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 117 other followers

%d bloggers like this: