This past week I dove into the hadoop pool. It's certainly overwhelming at first. Have you seen the list of components?
Avro
Flume
HBase
HDFS
Hive
Hue
Impala
Mahout
Map Reduce
Oozie
Pig
Spark
Sqoop
YARN
Zoo Keeper
etc.
In addition to all the components there are multiple distributions to choose from:
Coudera - CDH
Hortonworks
MapR
Roll your own
...and each of these distributions have various editions.
The Edureka resources were very helpful for understanding how all of this comes together:
http://www.edureka.co/blog/essential-hadoop-tools-for-big-data
The following screenshots came from this video:
https://www.youtube.com/watch?v=zjdN3IxUh6A
One of the first things I did was get this book, and start reading:
I also started watching videos:
Apache VS Cloudera VS MapR VS Hortonworks : Which Hadoop Distribution To Use?
Hadoop Distributions - Cloudera vs Hortonworks vs MapR vs Intel
Hadoop - Just the Basics for Big Data Rookies
...and when I was ready to get hands on (immediately)., this udemy Hadoop starter kit was great!
Hadoop Starter Kit
The instructor walks you through the basics and they provide you with credentials to get into and start running commands on a Cloudera - CDH system.
From here I think I'll start playing around on a sandbox. Each of the distributions offers a way to spin up a VM or log into a cloud based environment. There are also docker images out there. (search hadoop, cloudera, or hortonworks). Most of these docker images look fairly new, so don't cut yourself.
I'm looking at the HortonWorks distro, so I'll probably setup a Hortonworks Sandbox.
One last note. I also setup a couple of rss feeds, via the Hortonworks Community Connection, as a way to keep a pulse on what the experts are talking about.
If anyone else has a good list of hadoop or hortonworks, perhaps individual component feeds, I'm interested.
There's also this mailing list: https://hadoop.apache.org/mailing_lists.html
Thanks for reading. I hope you find the links and videos useful.
No comments:
Post a Comment