Showing posts with label Cloudera. Show all posts
Showing posts with label Cloudera. Show all posts

Friday 15 August 2014

Building Applications With Hadoop

Avro

Avro is a project for data serialization in formats. It is similar to Thrift or Protocol Buffers. It's expressive. You can deal in terms of records, arrays, unions, enums. It's efficient so it has a compact binary representation. One of the benefits of logging an Avro is that you get much smaller data files. All the traditional aspects of Hadoop data formats, like compressible or splittable data, are true of Avro.
 One of the reasons Doug Cutting (founder of the Hadoop project) created the Avro project was that a lot of the formats in Hadoop were Java only. It’s important for Avro to be interoperable – to a lot of different languages like Java, C, C++, C#, Python, Ruby, etc. – and to be usable by a lot of tools.
One of the goals for Avro is a set of formats and serialization that's usable throughout the data platform that you're using, not just in a subset of the components. So MapReduce, Pig, Hive, Crunch, Flume, Sqoop, etc. all support Avro.
Avro is dynamic and one of its neat features is that you can read and write data without generating any code. It will use reflection and look at the schema that you've given it to create classes on the fly. That's called Avro-generic formats. You can also specify formats for which Avro will generate optimal code.
Avro was designed with expectation that you would change your schema over time. That's an important attribute in a big-data system because you generate lots of data, and you don’t want to constantly reprocess it. You're going to generate data at one time and have tools process that data maybe two, three, or four years down the line. Avro has the ability to negotiate differences between schemata so that new tools can read old data and vice versa.
Avro forms an important basis for the following projects.

Crunch

You're probably familiar with Pig and Hive and how to process data with them and integrate valuable tools. However, not all data formats that you use will fit Pig and Hive.
Pig and Hive are great for a lot of logged data or relational data, but other data types don’t fit as well. You can still process poorly fitting data with Pig and Hive, which don’t force you to a relational model or a log structure, but you have to do a lot of work around it. You might find yourself writing unwieldy user-defined functions or doing things that are not natural in the language. People, sometimes, just give up and start writing raw Java MapReduce programs because that's easier.
Crunch was created to fill this gap. It's a higher-level API than MapReduce. It's in Java. It's lower level than, say, Pig, Hive, Cascade, or other frameworks you might be used to. It's based on a paper that Google published called FlumeJava. It's a very similar API. Crunch has you combine a small number of primitives with a small number of types and effectively allow the user to create really lightweight UDS, which are just Java methods and classes to create complex data pipelines.
Crunch has a number of advantages.
  • It's just Java. You have access to a full programming language.
  • You don't have to learn Pig.
  • The type system is well-integrated. You can use Java POJOs, but there's also a native support for Hadoop Writables in Avro. There's no impedance mismatch between the Java codes you're writing and the data that you're analyzing.
  • It's built as a modular library for reuse. You can capture your pipelines in Crunch code in Java and then combine it with arbitrary machine learning program later, so that someone else can reuse that algorithm.
The fundamental structure is a parallel collection so it's a distributed, unordered collection of elements. This collection has a parallel do operator which you can imagine turns into a MapReduce job. So if you had a bunch of data that you want to operate in parallel, you can use a parallel collection.
And there's something called the parallel table, which is a subinterface of the collection, and it's a distributed sorted map. It also has a group by operators you can use to aggregate all the values for a given key. We'll go through an example that shows how that works.
Finally, there's a pipeline class and pipelines are really for coordinating the execution of the MapReduce jobs that will actually do the back-end processing for this Crunch program.
Let’s take an example for which you've probably seen all the Java code before, word count, and see what it looks like in Crunch.
(Click on the image to enlarge it)
It’s a lot smaller and simpler. The first line creates a pipeline. We create a parallel collection of all the lines from a given file by using the pipeline class. And then we get a collection of words by running the parallel do operator on these lines.
We've got a defined anonymous function here that basically processes the input and word count splits on the word and emits that word for each map task.
Finally, we want to aggregate the counts for each word and write them out. There's a line at the bottom, pipeline run. Crunch's planner does lazy evaluation. We're going to create and run the MapReduce jobs until we've gotten a full pipeline together.
If you're used to programming Java and you've seen the Hadoop examples for writing word count in Java, you can tell that this is a more natural way to express that. This is among the simplest pipelines you can create, and you can imagine you can do many more complicated things.
If you want to go even one step easier than this, there's a wrapper for Scala. This is very similar idea to Cascade, which was built on Google FlumeJava. Since Scala runs on the JVM, it's an obvious natural fit. Scala's type inference actually ends up being really powerful in the context of Crunch.
(Click on the image to enlarge it)
This is the same program but written in Scala. We have the pipeline and we can use Scala's built-in functions that map really nicely to Crunch – so word count becomes a one-line program. It’s pretty cool and very powerful if you're writing Java code already and want to do complex pipelines.

Cloudera ML

Cloudera ML (machine learning) is an open-source library and tools to help data scientists perform the day-to-day tasks, primarily of data preparation to model evaluation.
With built-in commands for summarizing, sampling, normalizing, and pivoting data, Cloudera ML has recently added a built-in clustering algorithm for k-means, based on an algorithm that was just developed a year or two back. There are a couple of other implementations as well. It's a home for tools you can use so you can focus on data analysis and modeling instead of on building or wrangling the tools.
It’s built using Crunch. It leverages a lot of existing projects. For example, the vector formats: a lot of ML involves transforming raw data that's in a record format to vector formats for machine-learning algorithms. It leverages Mahout's vector interface and classes for that purpose. The record format is just a thin wrapper in Avro, and HCatalog is record and schema formats so you can easily integrate with existing data sources.
For more information on Cloudera ML, visit the projects’ GitHub page; there's a bunch of examples with datasets that can get you started.

Cloudera Development Kit

Like Cloudera ML, Cloudera Development Kit a set of open-source libraries and tools that make writing applications on Hadoop easier. Unlike ML though, it's not focused on using machine learning like a data scientist. It's directed at developers trying to build applications on Hadoop. It's really the plumbing of a lot of different frameworks and pipelines and the integration of a lot of different components.
The purpose of the CDK is to provide higher level APIs on top of the existing Hadoop components in the CDH stack that codify a lot of patterns in common use cases.
CDK is prescriptive, has an opinion on the way to do things, and tries to make it easy for you to do the right thing by default, but it’s architect is a system of loosely coupled modules. You can use modules independent of each other. It's not an uber-framework that you have to adopt in whole. You can adopt it piecemeal. It doesn't force you into any particular programming paradigms. It doesn't force you to adopt a ton of dependencies. You can adopt only the dependencies of the particular modules you want.
Let's look at an example. The first module in CDK is the data module, and the goal of the data module is to make it easier for you to work with datasets on Hadoop file systems. There are a lot of gory details to clean up to make this work in practice; you have to worry about serialization, deserialization, compression, partitioning, directory layout, commuting, getting that directory layout, partitioning to other people who want to consume the data, etc.
The CDK data module handles all this for you. It automatically serializes and deserializes data from Java POJOs, if that's what you have, or Avro records if you use them. It has built-in compression, and built-in policies around file and directory layouts so that you don't have to repeat a lot of these decisions and you get smart policies out of the box. It will automatically partition data within those layouts. It lets you focus on working on a dataset on HDFS instead of all the implementation details. It also has plugin providers for existing systems.
Imagine you're already using Hive and HCatalog as a metadata repository, and you've already got a schema for what these files look like. CDK integrates with that. It doesn't require you to define all of your metadata for your entire data repository from scratch. It integrates with existing systems.
You can learn more about the various CDK modules and how to use them in the documentation.
In summary, working with data from various sources, preparing and cleansing data and processing them via Hadoop involves a lot of work. Tools such as Crunch, Cloudera ML and CDK make it easier to do this and leverage Hadoop more effectively.

About the Author

Eli Collins is the tech lead for Cloudera's Platform team, an active contributor to Apache Hadoop and member of its project management committee (PMC) at the Apache Software Foundation. Eli holds Bachelor's and Master's degrees in Computer Science from New York University and the University of Wisconsin-Madison, respectively.

Thursday 19 June 2014

Cloudera Certification Practice Test



Cloudera Certification Practice Test Exam with answers
-------------------------------------------------------------------------------------

1. You have decided to develop a custom composite key class that you will use for keys emitted during the map phase to be consumed by a reducer. Which interface must this key implement?
Correct Answer: WritableComparable 
 
Explanation
All keys and values must implement the Writable interface so that they can be wriiten to disk. In addition, keys emitted during the map phase must implement WritableComparable so that the keys can be sorted during the shuffle and sort phase.  WritableComparable inherits from Writable.

Further Reading
  • For more information, see the Yahoo! Developer Network Apach Hadoop Tutorial, Custom Key Types.
  • Hadoop: the Definitive Guide, chapter four, in the Serialization: The Writable Interface section.
2. You’ve built a MapReduce job that denormalizes a very large table, resulting in an extremely large amount of output data. Which two cluster resources will your job stress? (Choose two).
Correct Answer: network I/O , disk I/O 
 
Explanation
When denormalizing a table, the amount of data written by the map phase will far exceed the amount of data read by the map phase.  All of the data written during the map phase is first written to local disk and then transferred over the network to the reducers during the beginning of the reduce phase.  Writing a very large amount of data in the map phase will therefore create a large amount of local disk I/O on the machines running map tasks and network I/O. Because map output is stored in a fixed size buffer that is written periodically to disk, this operation will not tax the memory of the machines running the map tasks. Denormalizing a table is not a compute-intesive operation, so this operation will not tax the processors of the machines running the map tasks.

Further Reading
  • Hadoop: the Definitive Guide, chapter six, Shuffle and Sort: The Map Side section includes more information on the process for writing map output to disk.
  • Hadoop: the Definitive Guide, chapter six, Shuffle and Sort: The Reduce Side section explains further how data is transferred to the reducers
  • Denormalizing a table is a form of join operation. You can read more about performing joins in MapReduce in Join Algorithms using Map/Reduce
3. When is the earliest that the reduce() method of any reduce task in a given job called?
Correct Answer: Not until all map tasks have completed

Explanation
No reduce task&rquo;s reduce() method is called until all map tasks have completed. Every reduce task&rquo;s reduce() method expects to receive its data in sorted order. If the reduce() method is called before all of the map tasks have completed, it would be possible that the reduce() method would receive the data out of order.

Further Reading
  • Hadoop: The Definitive Guide, chapter six includes a detailed explanation of the shuffle and sort phase of a MapReduce job.
4.  You have 50 files in the directory /user/foo/example. Each file is 300MB. You submit a MapReduce job with /user/foo/example as the input path. 

How much data does a single Mapper processes as this job executes?
Correct Answer: A single input split

Explanation
An input split is a unit of work (a chunk of data) that is processed by a single map task in a MapReduce program (represented by the Java interface InputSplit). The InputFormat you specify for MapReduce program determines how input files are split into records and read. Each map task processes a single split; each split is further comprised of records (key-value pairs) which the map task processes.

A MapReduce program run over a data set is usually called a MapReduce “job.” By splitting up input files, MapReduce can process a single file in parallel; if the file is very large, this represents a significant performance improvement. Also, because a single file is worked on in splits, it allows MapReduce to schedule those processes on different nodes in the cluster, nodes that have that piece of data already locally stored on that node, which also results in significant performance improvements. An InputSplit can span HDFS block boundaries.

Further Reading
  • Hadoop: The Definitive Guide, chapter two includes an excellent general discussion of input splits
Hadoop Administrator
5. In the execution of a MapReduce job, where does the Mapper place the intermediate data of each Map task?

Correct Answer: The Mapper stores the intermediate data on the underlying filesystem of the local disk of the machine which ran the Map task

Explanation
Intermediate data from a Mapper is stored on the local filesystem of the machine on which the Mapper ran. Each Reducer then copies its portion of that intermediate data to its own local disk. When the job terminates, all intermediate data is removed.

6. A client wishes to read a file from HDFS. To do that, it needs the block IDs (their names) which make up the file. From where does it obtain those block IDs?
Correct Answer: The NameNode reads the block IDs from memory and returns them to the client.

Explanation
When a client wishes to read a file from HDFS, it contacts the NameNode and requests the names and locations of the blocks which make up the file. For rapid access, the NameNode has the block IDs stored in RAM.

Further Reading
See Hadoop Operations, under the section "The Read Path." 

7. Your cluster has slave nodes in three different racks, and you have written a rack topology script identifying each machine as being in rack1, rack2, or rack3. A client machine outside of the cluster writes a small (one-block) file to HDFS. The first replica of the block is written to a node on rack2. How is block placement determined for the other two replicas?
Correct Answer: Either both will be written to nodes on rack1, or both will be written to nodes on rack3.

Explanation
For the default threefold replication, Hadoop’s rack placement policy is to write the first copy of a block on a node in one rack, then the other two copies on two nodes in a different rack. Since the first copy is written to rack2, the other two will either be written to two nodes on rack1, or two nodes on rack3.

Further Reading

8. A slave node in your cluster has 24GB of RAM, and 12 physical processor cores on hyperthreading-enabled processors. You set the value of mapred.child.java.opts to -Xmx1G, and the value of mapred.tasktracker.map.tasks.maximum to 12. What is the most appropriate value to set for mapred.tasktracker.reduce.tasks.maximum?
Correct Answer: 6

Explanation
For optimal performance, it is important to avoid the use of virtual memory (swapping) on slave nodes. From the configuration shown, the slave node could run 12 Map tasks simultaneously, and each one will use 1GB of RAM, resulting in a total of 12GB used. The TaskTracker daemon itself will use 1GB of RAM, as will the DataNode daemon. This is a total of 14GB. The operating system will also use some RAM -- a reasonable estimate would be 1-2GB. Thus, we can expect to have approximately 8-9GB of RAM available for Reducers. So the most appropriate of the choices presented is that we should configure the node to be able to run 6 Reducers simultaneously.

Further Reading
Hadoop: The Definitive Guide, 3rd Edition, Chapter 9, under the section “Environment Settings”
HBase
9. Your client application needs to scan a region for the row key value 104. Given a store file that contains the following list of Row Key values:

100, 101, 102, 103, 104, 105, 106, 107 What would a bloom filter return?  
Correct Answer: Confirmation that 104 may be contained in the set
Explanation
A Bloom filter is a kind of membership test using probability -- it tells you whether an element is a member of a set. It is quick and memory-efficient. The trade-off is that it is probabilistic where false positives are possible though false negatives are not; thus if your Bloom Filter returns true, it confirms that a key may be contained in a table. If Bloom Filter returns false, it confirms that a key is definitely not contained in a table.

Enabling Bloom Filters may save your disk seek and improve read latency.

Further Reading
  • HBase Documentation on Bloom Filters section 12.6.4. Bloom Filters includes:
    Bloom Filters can be enabled per ColumnFamily. Use
    HColumnDescriptor.setBloomFilterType(NONE | ROW | ROWCOL) to enable blooms per ColumnFamily. Default = NONE for no bloom filters. If ROW, the hash of the row will be added to the bloom on each insert. If ROWCOL, the hash of the row + column family + column family qualifier will be added to the bloom on each key insert.
10. You have two tables in an existing RDBMS. One contains information about the products you sell (name, size, color, etc.) The other contains images of the products in JPEG format. These tables are frequently joined in queries to your database. You would like to move this data into HBase. What is the most efficient schema design for this scenario?
Correct Answer: Create a single table with two column families

Explanation
Access patterns are an important factor in HBase schema design. Even though the two tables in this scenario have very different data sizes and formats, it is better to store them in one table if you are accessing them together most of the time.

Column families allow for separation of data. You can store different types of data and format into different column families. Attributes such as compression, Bloom filters, and replication are set on per column family basis. In this example, it is better to store product information and product images into two different column families and one table.

Further Reading
HBase Documentation on Column Family Section 5.6.especially the part:
Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.
11. You need to create a Blogs table in HBase. The table will consist of a single Column Family called Content and two column qualifiers, Author and Comment. What HBase shell command should you use to create this table?
Correct Answer: create 'Blogs', 'Content'

Explanation
When you create a HBase table, you need to specify table name and column family name.
In the Hbase Shell, you can create a table with the command:
create 'myTable', 'myColumnFamily'

For this example:
  • Table name: 'Blogs'
  • ColumnFamily: 'Content'
We can create the table and verify it with describe table command.
hbase> create 'Blogs', 'Content' hbase> describe 'Blogs' {Name => 'Blogs', FAMILIES => [{NAME => 'Content', ....

Further Reading
see the HBase Shell commands for create

Create table; pass table name, a dictionary of specifications per column family, and optionally a dictionary of table configuration. Dictionaries are described below in the GENERAL NOTES section.
Examples:
hbase> create 't1', {NAME => 'f1', VERSIONS => 5} hbase> create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'} hbase> # The above in shorthand would be the following: hbase> create 't1', 'f1', 'f2', 'f3' hbase> create 't1', {NAME => 'f1', VERSIONS => 1, TTL => 2592000, BLOCKCACHE => true}

12. From within an HBase application you are performing a check and put operation with the following code:table.checkAndPut(Bytes.toBytes("rowkey"),Bytes.toBytes("colfam"),Bytes.toBytes("qualifier"), Bytes.toBytes("barvalue"), newrow));

Which describes this check and put operation?
 
Correct Answer: Check if rowkey/colfam/qualifier exists and has the cell value "barvalue". If so, put the values in newrow and return "true".

Explanation
The method checkAndPut returns "true" if a row with specific column family, and column qualifier value matches the expected value; if "true" is returned, it executes the put with new value. If the specific value is not present in a row, it returns "false" and the put is not executed.


Data Science Essentials (DS-200)
13. What is the best way to evaluate the quality of the model found by an unsupervised algorithm like k-means clustering, given metrics for the cost of the clustering (how well it fits the data) and its stability (how similar the clusters are across multiple runs over the same data)?
Correct Answer: The lowest cost clustering subject to a stability constraint
 
Explanation
There is a tradeoff between cost and stability in unsupervised learning.  The more tightly you fit the data, the less stable the model will be, and vice versa.  The idea is to find a good balance with more weight given to the cost.  Typically a good approach is to set a stability threshold and select the model that achieves the lowest cost above the stability threshold.


14. A sandwich shop studies the number of men, and women, that enter the shop during the lunch hour from noon to 1pm each day. They find that the number of men that enter can be modeled as a random variable with distribution Poisson(M), and likewise the number of women that enter as Poisson(W). What is likely to be the best model of the total number of customers that enter during the lunch hour?
Correct Answer: Poisson(M+W)

Explanation
The total number of customers is the sum of the number of men and women. The sum of two Poisson random variables also follows a Poisson distribution with rate equal to the sum of their rates. The Normal and Binomial distribution can approximate the Poisson distribution in certain cases, but the expressions above do not approximate Poisson(M+W).

15. Which condition supports the choice of a support vector machine over logistic regression for a classification problem?
Correct Answer: The test set will be dense, and contain examples close to decision boundary learned from the training set
 
Explanation
The SVM algorithm is a maximum margin classifier, and tries to pick a decision boundary that creates the widest margin between classes, rather than just any boundary that separates the classes. This helps generalization to test data, since it is less likely to misclassify points near the decision boundary, as the boundary maintains a large margin from training examples.

SVMs are not particularly better at multi-label clasification. Linear separability is not required for either classification technique, and does not relate directly to an advantage of SVMs. SVMs are not particularly more suited to low dimensional data.

Related Posts Plugin for WordPress, Blogger...