Apache Hadoop is not the only game in town.
I’ve been following AMPLab (UC Berkeley – link) for some time now. The goal of this collaboration is to tame big data by integrating algorithms, machines, and people (AMP).
If an image is worth a thousand words…
The Apache Hadoop Alternative
The AMPLab Stack:
- Apache Hadoop YARN > Mesos (Apace Incubator – link)
- Apache Hadoop MapReduce > Spark (Apache Incubator – link)
- Apache Hive > Shark (link)
- Apache HDFS > Tachyon (link)
Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks.
The idea is to deploy multiple distributed systems to a shared pool of nodes in order to increase resource utilization.
I can see the value in porting JBoss Data Grid (JDG) to run on Mesos.
Processing all of the day’s financial transactions at the close of business. The JDG scheduler for Mesos accepts resources offered by the Mesos master deamon based on the data and the computation complexity. More data, more JDG executor tasks on the Mesos slave daemons (i.e. JDG instances). More computation complexity, more JDG executor tasks on the Mesos slave daemons (i.e. JDG instances).
Jay Kreps (LinkedIn) provided a detailed response to a question on Quora regarding the comparison of Apache Hadoop YARN and Mesos (link).
Spark is a distributed system for in-memory, parallel processing. At a high level, it is similar to JDG. The idea is to process data stored in memory, in parallel. In fact, Spark was developed for iterative algorithms and interactive data mining: use cases that traditional map / reduce implementations (e.g. Apache Hadoop) are not well suited for because they have to read and write to disk (e.g. input / intermediate output / output). However, the implementations are different.
Spark relies on external stable storage (e.g. Apache HDFS) for input data and coarse grained transformation (e.g. map / filter / join) / actions (e.g. count / save) via a generic parallel processing model instead of a map / reduce model. The idea is that data is partitioned and processed with one or more transformations. The result of a transformation is read only data: an RDD, a resilient distributed dataset. Instead of storing the data for an RDD, though it can be cached and / or saved, Spark stores the transformation history of an RDD (i.e. the lineage). In the event of a node failure, Spark can recreate data for an RDD using its lineage and the input data. Thus the need for external stable storage for input data. Spark does not handle fault tolerance of the input data. It handles fault tolerance of transformed data.
I can see the value in integrating JDG with Spark. First, enabling Spark to retrieve input data from JDG. Second, enabling Spark to cache and / or save an RDD in JDG.
In addition, I think there is value in incorporating some of the functionality of Spark in JDG. While JDG includes a map / reduce model, it is built on a generic distributed task model. I think the concept of resilient distributed datasets would work well in JDG. It would need a new parallel transformation model built on top the distributed task model, and it would need the ability to cache the transformation history and results. In fact, JDG can cache the intermediate results of map tasks right now. In addition, JDG supports external partitioning (e.g. grouping) right now.
The Spark research paper is a great read (link).
Shark is a data warehouse built on top of Spark just as Apache Hive is built on top of Apache Hadoop. In addition, Shark supports HiveQL.
I think that importing a subset of data from within a data warehouse into JDG for near real-time analysis is a great use case. In fact, we have customers doing just that. They import a subset of data from within their data warehouse into JDG and using distributed tasks and / or map / reduce jobs to analyze it in near real-time.
I have suggested that we add support for HiveQL to JDG.
Tachyon is an in-memory file system that checkpoints to the underlying file system. It implements the Apache HDFS API and can be used instead of Apache HDFS. In addition, it supports Apache HDFS as the underlying file system to checkpoint to.
Tachyon on top of Red Hat Storage? Interesting.
Infinispan, the upstream project for JDG, features the grid file system. However, it does not implement the HDFS API. Customers have asked about the best way to store large files in JDG, so it might be worth looking in to.