Overview on Big Data

Sunny Patneedi
1 min readMay 11, 2015

So I just watched Big Data: The Big Picture on Pluralsight. Sharing my learnings here and hoping this overview saves time and helps someone:

Example everyday scenarios

  • Lots and lots of email
  • Social network info like check-ins

Why now?

  • Cost of storage — much cheaper
  • Open source — driving momentum to analyze big data

What is Big Data?

Size matters

  • Today, Petabyte data is a big scale application
  • If too big for OLTP, then big data
  • If data is being processed parallelly or distributed

Three V’s

  • Volume — how much data?
  • Velocity — how quickly is it arriving?
  • Variety — how is it structured?

MapReduce

  • Developed in Google
  • Hadoop — open-source implementation of Google’s MapReduce
  • Step 1: Map step, pre-processing by splitting the data into key:value format
  • Step 2: Aggregating by reducing the data into a single row

Technologies

  • MPP — Massively parallel processing
  • Hadoop — Apache project that combines open-source MapReduce and GFS

Hadoop Stack

  • Hadoop — MapReduce, HDFS
  • Database — HBase, Casandra
  • Query — HiveQL, PigLatin
  • DB warehouse system — Sqoop (import/export frameworks)
  • Machine learning/ data mining — Mahout
  • Log file — Flume

Hive

  • HiveQL like SQL (schema based, unlike noSQL)
  • Can work with HBase though
  • Glue that binds together BI (visualization) and BigData

R programming language

  • Open source for data programming
  • Rmr, Rhdfs, Rhbase libraries to create MapReduce jobs using R
  • Lucene — open source lib to create full text indexes

MPP (Massively Parallel Processing)

  • Using SQL
  • Returns single data set
  • Data stored column-wise, high compression so fast

To be continued..

--

--

Sunny Patneedi

Tech, Travel, Food, Fitness enthusiast | Aspiring photog | Opinions are mine alone. @Travelstellar | @Salesforce