Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Thursday, 31 July 2014

What is data science/what do data scientists do?

Hadoop

Both Data Science and Hadoop

Cloudera: Data Science on Hadoop (video; sign-in required)
Cloudera: Data Science: A Personal History (slides)
Omer Trajman: Use Cases of Analytics in Hadoop (video)
Cloudera: Data Scientist: The New Data Analyst (slides)
Cloudera: R, Hadoop, and some use cases

Learning R

Basic
- CodeSchool's Try R
- Quora topic on R learning
Good Reference
- A compilation of R blogs
- Blog entries on visualization package GGPLOT2
Advanced / Meaty
- some Bioinformatics Packages for R
- a Financial modeling package for R

"Hadoop runs on commodity hardware." This sentence is often heard in discussions about Hadoop, but what precisely does it mean?
Just as the definition of "big" in "Big Data" may be relative to the company or industry, so will the definition of "commodity" in "commodity hardware" be relative to a given point in time or a given industry, but several general points can be made.

Commodity hardware in general

Commodity hardware has an average amount of computing resources; it is not considered a "sports car" in its field.
"Commodity hardware" does not imply low quality, but rather, affordability.
One common, though not necessary, feature of commodity hardware is that over time it is widely used in roles for which it was not specifically designed, as opposed to purpose-built hardware.

Commodity hardware in the context of Hadoop

Hadoop clusters are run on servers.
Most commodity servers used in production Hadoop clusters have an average (again, the definition of "average" may change over time) ratio of disk space to memory, as opposed to being specialized servers with massively high memory or CPU.
The servers are not designed specifically as parts of a distributed storage and processing framework, but have been appropriated for this role in Hadoop.

Examples of Commodity Hardware in Hadoop

An example of suggested hardware specifications for a production Hadoop cluster is:

four 1TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
two quad core CPUs, running at least 2-2.5GHz
16-24GBs of RAM (24-32GBs if you're considering HBase)
1 Gigabit Ethernet

(Source: Cloudera)

Or, for a more powerful cluster

six 2TB hard disks, with RAID 1 across two of the disks
two quad core CPUs
32-64GBs of ECC (Error Correcting Code) RAM
2-4 Gigabit Ethernet

(Source: OpenLogic, slide 15)

Additional Links

Cloudera.com: Cloudera's Support Team Shares Some Basic Hardware Recommendations