Overview on Big Data
1 min readMay 11, 2015
So I just watched Big Data: The Big Picture on Pluralsight. Sharing my learnings here and hoping this overview saves time and helps someone:
Example everyday scenarios
- Lots and lots of email
- Social network info like check-ins
Why now?
- Cost of storage — much cheaper
- Open source — driving momentum to analyze big data
What is Big Data?
Size matters
- Today, Petabyte data is a big scale application
- If too big for OLTP, then big data
- If data is being processed parallelly or distributed
Three V’s
- Volume — how much data?
- Velocity — how quickly is it arriving?
- Variety — how is it structured?
MapReduce
- Developed in Google
- Hadoop — open-source implementation of Google’s MapReduce
- Step 1: Map step, pre-processing by splitting the data into key:value format
- Step 2: Aggregating by reducing the data into a single row
Technologies
- MPP — Massively parallel processing
- Hadoop — Apache project that combines open-source MapReduce and GFS
Hadoop Stack
- Hadoop — MapReduce, HDFS
- Database — HBase, Casandra
- Query — HiveQL, PigLatin
- DB warehouse system — Sqoop (import/export frameworks)
- Machine learning/ data mining — Mahout
- Log file — Flume
Hive
- HiveQL like SQL (schema based, unlike noSQL)
- Can work with HBase though
- Glue that binds together BI (visualization) and BigData
R programming language
- Open source for data programming
- Rmr, Rhdfs, Rhbase libraries to create MapReduce jobs using R
- Lucene — open source lib to create full text indexes
MPP (Massively Parallel Processing)
- Using SQL
- Returns single data set
- Data stored column-wise, high compression so fast