Overview on Big Data

1 min readMay 11, 2015

--

So I just watched Big Data: The Big Picture on Pluralsight. Sharing my learnings here and hoping this overview saves time and helps someone:

Example everyday scenarios

Lots and lots of email
Social network info like check-ins

Why now?

Cost of storage — much cheaper
Open source — driving momentum to analyze big data

What is Big Data?

Size matters

Today, Petabyte data is a big scale application
If too big for OLTP, then big data
If data is being processed parallelly or distributed

Three V’s

Volume — how much data?
Velocity — how quickly is it arriving?
Variety — how is it structured?

MapReduce

Developed in Google
Hadoop — open-source implementation of Google’s MapReduce
Step 1: Map step, pre-processing by splitting the data into key:value format
Step 2: Aggregating by reducing the data into a single row

Technologies

MPP — Massively parallel processing
Hadoop — Apache project that combines open-source MapReduce and GFS

Hadoop Stack

Hadoop — MapReduce, HDFS
Database — HBase, Casandra
Query — HiveQL, PigLatin
DB warehouse system — Sqoop (import/export frameworks)
Machine learning/ data mining — Mahout
Log file — Flume

Hive

HiveQL like SQL (schema based, unlike noSQL)
Can work with HBase though
Glue that binds together BI (visualization) and BigData

R programming language

Open source for data programming
Rmr, Rhdfs, Rhbase libraries to create MapReduce jobs using R
Lucene — open source lib to create full text indexes

MPP (Massively Parallel Processing)

Using SQL
Returns single data set
Data stored column-wise, high compression so fast

To be continued..

Sunny Patneedi

Written by Sunny Patneedi

Tech, Travel, Food, Fitness enthusiast | Aspiring photog | Opinions are mine alone. @Travelstellar | @Salesforce

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams