Big Data in Theory
What is it? It’s big data. Right?
I’m not sure if I like the term Big Data. I think it’s right up there with the term Cloud.
I do, however, like the framework created by Doug Laney: Volume, Velocity, and Variety. It’s the de facto description of Big Data, and it predates the Big Data phenomenon. That, and I like both alliteration and the KISS principle. Who doesn’t?
Here is my, albeit short, interpretation of the 3Vs.
Volume – More data.
Velocity – Data (in), faster. Information (out), faster.
Variety – More data sources and / or formats.
What about the Flying V?
Thinking about the 3Vs reminded me of the Flying V.
Then it occurred to me…
The Flying V worked in The Mighty Ducks. Yes, I watched The Mighty Ducks. It did not work in D2. Yes, I watched the sequel. No, I did not watch D3. I can only hope that it did not do to The Mighty Ducks what Alien 3 did to Alien.
Update It’s come to my attention that not everyone has seen The Mighty Ducks. The Ducks are a youth ice hockey team. I’ve been told that ice hockey is not the only hockey. Really? The Flying V is their trick play. It’s like how the option offense in college football (NCAA) does not work in professional football (NFL).
The 3Vs are a valid description of Big Data in theory, but they are not a valid description of Big Data in practice. Perhaps it is because they state the obvious, hint at the problem, and do not mention the solution.
Big Data in Practice
Volume is addressed with distributed storage using a shared nothing architecture on commodity hardware.
- Distributed File System – Red Hat Storage, Hadoop Distributed File System
- NoSQL – MongoDB
- In-Memory Data Grid – JBoss Data Grid
Outgoing information is generated faster with parallel processing in the form of batch processing (e.g. map / reduce), near real-time processing (e.g. distributed tasks), and real-time processing (e.g. stream processing).
- Map / Reduce Tasks – JBoss Data Grid, NoSQL, Hadoop MapReduce
- Distributed Tasks – JBoss Data Grid
- Stream Processing – Storm / S4
Volume and velocity are often two sides of the same coin. Incoming data is stored faster using distributed storage. While outgoing information is generated faster with parallel processing, it is often done in conjunction with distributed storage via data locality. The parallel processes are executed on the distributed storage nodes.
Apache Hadoop (HDFS + MapReduce), JBoss Data Grid
Variety is addressed with NoSQL for structured / semi-structured data and distributed file systems for unstructured data.
- Key / Value Store – JBoss Data Grid
- Document Store – MongoDB
- Column Oriented Store – Apache HBase (Hadoop)
- Hierarchical Store – ModeShape
It’s true. I liked The Mighty Ducks. I was a kid. That being said, it’s not The Goonies. If The Goonies is on television, I watch it for the nth time. If The Mighty Ducks is on television, I put in Serenity (BD) and watch it for the nth time.
Alien and Aliens are two of the greatest films ever. Period.