Big Data in Theory
What is it? It’s big data. Right?
I’m not sure if I like the term Big Data. I think it’s right up there with the term Cloud.
I do, however, like the framework created by Doug Laney: Volume, Velocity, and Variety. It’s the de facto description of Big Data, and it predates the Big Data phenomenon. That, and I like both alliteration and the KISS principle. Who doesn’t?
Here is my, albeit short, interpretation of the 3Vs.
Volume – More data.
Velocity – Data (in), faster. Information (out), faster.
Variety – More data sources and / or formats.
What about the Flying V?
Thinking about the 3Vs reminded me of the Flying V.
Then it occurred to me…
The Flying V worked in The Mighty Ducks. Yes, I watched The Mighty Ducks. It did not work in D2. Yes, I watched the sequel. No, I did not watch D3. I can only hope that it did not do to The Mighty Ducks what Alien 3 did to Alien.
Update It’s come to my attention that not everyone has seen The Mighty Ducks. The Ducks are a youth ice hockey team. I’ve been told that ice hockey is not the only hockey. Really? The Flying V is their trick play. It’s like how the option offense in college football (NCAA) does not work in professional football (NFL).
The 3Vs are a valid description of Big Data in theory, but they are not a valid description of Big Data in practice. Perhaps it is because they state the obvious, hint at the problem, and do not mention the solution.
Big Data in Practice
Volume
Volume is addressed with distributed storage using a shared nothing architecture on commodity hardware.
Examples
- Distributed File System – Red Hat Storage, Hadoop Distributed File System
- NoSQL – MongoDB
- In-Memory Data Grid – JBoss Data Grid
Velocity
Outgoing information is generated faster with parallel processing in the form of batch processing (e.g. map / reduce), near real-time processing (e.g. distributed tasks), and real-time processing (e.g. stream processing).
Examples
- Map / Reduce Tasks – JBoss Data Grid, NoSQL, Hadoop MapReduce
- Distributed Tasks – JBoss Data Grid
- Stream Processing – Storm / S4
Data Locality
Volume and velocity are often two sides of the same coin. Incoming data is stored faster using distributed storage. While outgoing information is generated faster with parallel processing, it is often done in conjunction with distributed storage via data locality. The parallel processes are executed on the distributed storage nodes.
Examples
Apache Hadoop (HDFS + MapReduce), JBoss Data Grid
Variety
Variety is addressed with NoSQL for structured / semi-structured data and distributed file systems for unstructured data.
Examples
- Key / Value Store – JBoss Data Grid
- Document Store – MongoDB
- Column Oriented Store – Apache HBase (Hadoop)
- Hierarchical Store – ModeShape
Additional Thoughts
It’s true. I liked The Mighty Ducks. I was a kid. That being said, it’s not The Goonies. If The Goonies is on television, I watch it for the nth time. If The Mighty Ducks is on television, I put in Serenity (BD) and watch it for the nth time.
Alien and Aliens are two of the greatest films ever. Period.








January 30, 2013 at 6:28 am
I’ve recently gone though an article from IBM which mentioned a 4th V, Veracity.
Bottom line it is about how much you can trust certain data and it also deals with data that constantly changes, such as weather forecast.
Caught my attention.
https://www14.software.ibm.com/webapp/iwm/web/signup.do?source=csuite-NA&S_PKG=Q412IBVBigData
“Veracity, the fourth “V”
Some data is inherently uncertain, for example: sentiment
and truthfulness in humans; GPS sensors bouncing
among the skyscrapers of Manhattan; weather condi-
tions; economic factors; and the future. When dealing
with these types of data, no amount of data cleansing can
correct for it. Yet despite uncertainty, the data still
contains valuable information. The need to acknowledge
and embrace this uncertainty is a hallmark of big data.
¨
January 30, 2013 at 10:38 am
Right, and, If I’m not mistaken, McKinsey added Value.