AMP’d for Hadoop Alternatives

September 4, 2013

Big Data, Data Grid, NOSQL

Apache Hadoop is not the only game in town.

I’ve been following AMPLab (UC Berkeley – link) for some time now. The goal of this collaboration is to tame big data by integrating algorithms, machines, and people (AMP).

If an image is worth a thousand words…

amplab_amp

The Apache Hadoop Alternative

The AMPLab Stack:

  • Apache Hadoop YARN > Mesos (Apace Incubator – link)
  • Apache Hadoop MapReduce > Spark (Apache Incubator – link)
  • Apache Hive > Shark (link)
  • Apache HDFS > Tachyon (link)

Mesos

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks.

The idea is to deploy multiple distributed systems to a shared pool of nodes in order to increase resource utilization.

mesos_arch_ex

I can see the value in porting JBoss Data Grid (JDG) to run on Mesos.

Use Case

Processing all of the day’s financial transactions at the close of business. The JDG scheduler for Mesos accepts resources offered by the Mesos master deamon based on the data and the computation complexity. More data, more JDG executor tasks on the Mesos slave daemons (i.e. JDG instances). More computation complexity, more JDG executor tasks on the Mesos slave daemons (i.e. JDG instances).

Note

Jay Kreps (LinkedIn) provided a detailed response to a question on Quora regarding the comparison of Apache Hadoop YARN and Mesos (link).

Spark

Spark is a distributed system for in-memory, parallel processing. At a high level, it is similar to JDG. The idea is to process data stored in memory, in parallel. In fact, Spark was developed for iterative algorithms and interactive data mining: use cases that traditional map / reduce implementations (e.g. Apache Hadoop) are not well suited for because they have to read and write to disk (e.g. input / intermediate output / output). However, the implementations are different.

Spark relies on external stable storage (e.g. Apache HDFS) for input data and coarse grained transformation (e.g. map / filter / join) / actions (e.g. count / save) via a generic parallel processing model instead of a map  / reduce model. The idea is that data is partitioned and processed with one or more transformations. The result of a transformation is read only data: an RDD, a resilient distributed dataset. Instead of storing the data for an RDD, though it can be cached and / or saved, Spark stores the transformation history of an RDD (i.e. the lineage). In the event of a node failure, Spark can recreate data for an RDD using its lineage and the input data. Thus the need for external stable storage for input data. Spark does not handle fault tolerance of the input data. It handles fault tolerance of transformed data.

I can see the value in integrating JDG with Spark. First, enabling Spark to retrieve input data from JDG. Second, enabling Spark to cache and / or save an RDD in JDG.

spark_jdg_int

In addition, I think there is value in incorporating some of the functionality of Spark in JDG. While JDG includes a map / reduce model, it is built on a generic distributed task model. I think the concept of resilient distributed datasets would work well in JDG. It would need a new parallel transformation model built on top the distributed task model, and it would need the ability to cache the transformation history and results. In fact, JDG can cache the intermediate results of map tasks right now. In addition, JDG supports external partitioning (e.g. grouping) right now.

The Spark research paper is a great read (link).

Shark

Shark is a data warehouse built on top of Spark just as Apache Hive is built on top of Apache Hadoop. In addition, Shark supports HiveQL.

I think that importing a subset of data from within a data warehouse into JDG for near real-time analysis is a great use case. In fact, we have customers doing just that. They import a subset of data from within their data warehouse into JDG and using distributed tasks and / or map / reduce jobs to analyze it in near real-time.

I have suggested that we add support for HiveQL to JDG.

Tachyon

Tachyon is an in-memory file system that checkpoints to the underlying file system. It implements the Apache HDFS API and can be used instead of Apache HDFS. In addition, it supports Apache HDFS as the underlying file system to checkpoint to.

Tachyon on top of Red Hat Storage? Interesting.

Infinispan, the upstream project for JDG, features the grid file system. However, it does not implement the HDFS API. Customers have asked about the best way to store large files in JDG, so it might be worth looking in to.

, , , ,

About Shane K Johnson

Technical Marketing Manager, Red Hat Inc.

View all posts by Shane K Johnson

One Comment on “AMP’d for Hadoop Alternatives”

  1. Dr. Vijay Srinivas Agneeswaran Says:

    Good explanation of Shark, Spark and Mesos. The BLink DB from AmpLabs is interesting stuff to keep track of. Here is the link:
    http://blinkdb.org/

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 105 other followers

%d bloggers like this: