Cloudera Certification Practice Test Exam with answers
-------------------------------------------------------------------------------------
1. You have decided to develop a custom composite key class
that you will use for keys emitted during the map phase to be
consumed by a reducer. Which interface must this key implement?
Correct Answer: WritableComparable
Explanation
All keys and values must implement the Writable
interface so that they can be wriiten to disk. In addition, keys
emitted during the map phase must implement WritableComparable
so that the keys can be sorted during the shuffle and sort phase.
WritableComparable
inherits from Writable.
Further
Reading
2. You’ve built a MapReduce job that denormalizes
a very large table, resulting in an extremely large amount of output
data. Which two cluster resources will your job stress? (Choose two).
Correct Answer: network I/O , disk I/O
Explanation
When denormalizing a table, the amount of data written
by the map phase will far exceed the amount of data read by the map
phase. All of the data written during the map phase is first
written to local disk and then transferred over the network to the
reducers during the beginning of the reduce phase. Writing a
very large amount of data in the map phase will therefore create a
large amount of local disk I/O on the machines running map tasks and
network I/O. Because map output is stored in a fixed size buffer that
is written periodically to disk, this operation will not tax the
memory of the machines running the map tasks. Denormalizing a table
is not a compute-intesive operation, so this operation will not tax
the processors of the machines running the map tasks.
Further
Reading
Hadoop: the Definitive
Guide, chapter six, Shuffle and Sort: The Map Side section includes
more information on the process for writing map output to disk.
Hadoop: the Definitive
Guide, chapter six, Shuffle and Sort: The Reduce Side section
explains further how data is transferred to the reducers
-
3. When is the earliest that the reduce()
method of any reduce task in a given job called?
Correct Answer: Not until all map tasks have
completed
Explanation
No reduce task&rquo;s reduce()
method is called until all map tasks have completed. Every reduce
task&rquo;s reduce()
method expects to receive its data in sorted order. If the reduce()
method is called before all of the map tasks have completed, it would
be possible that the reduce()
method would receive the data out of order.
Further
Reading
4. You have 50 files in the directory
/user/foo/example.
Each file is 300MB. You submit a MapReduce job
with /user/foo/example
as the input path.
How much data does a single
Mapper processes as this job executes?
Correct Answer: A single input split
Explanation
An input split is a unit of work (a chunk of data) that
is processed by a single map task in a MapReduce program (represented
by the Java interface InputSplit).
The InputFormat
you specify for MapReduce program determines how input files are
split into records and read. Each map task processes a single split;
each split is further comprised of records (key-value pairs) which
the map task processes.
A MapReduce program run over a data
set is usually called a MapReduce “job.” By splitting up input
files, MapReduce can process a single file in parallel; if the file
is very large, this represents a significant performance improvement.
Also, because a single file is worked on in splits, it allows
MapReduce to schedule those processes on different nodes in the
cluster, nodes that have that piece of data already locally stored on
that node, which also results in significant performance
improvements. An InputSplit
can span HDFS block boundaries.
Further Reading
Hadoop Administrator
5. In the execution of a MapReduce job, where does
the Mapper place the intermediate data of each Map task?
Correct Answer: The Mapper stores the intermediate
data on the underlying filesystem of the local disk of the machine
which ran the Map task
Explanation
Intermediate data from a Mapper is stored on the local
filesystem of the machine on which the Mapper ran. Each Reducer then
copies its portion of that intermediate data to its own local disk.
When the job terminates, all intermediate data is removed.
6. A client wishes to read a file from HDFS. To do
that, it needs the block IDs (their names) which make up the file.
From where does it obtain those block IDs?
Correct Answer: The NameNode reads the block IDs
from memory and returns them to the client.
Explanation
When a client wishes to read a file from HDFS, it
contacts the NameNode and requests the names and locations of the
blocks which make up the file. For rapid access, the NameNode has the
block IDs stored in RAM.
Further Reading
See
Hadoop Operations, under the section "The Read Path."
7. Your cluster has slave nodes in three different
racks, and you have written a rack topology script identifying each
machine as being in rack1,
rack2,
or rack3.
A client machine outside of the cluster writes a small (one-block)
file to HDFS. The first replica of the block is written to a node on
rack2.
How is block placement determined for the other two replicas?
Correct Answer: Either both will be written to nodes
on rack1, or both will be written to nodes on rack3.
Explanation
For the default threefold replication, Hadoop’s rack
placement policy is to write the first copy of a block on a node in
one rack, then the other two copies on two nodes in a different rack.
Since the first copy is written to rack2,
the other two will either be written to two nodes on rack1,
or two nodes on rack3.
Further
Reading
8. A slave node in your cluster has 24GB of RAM, and
12 physical processor cores on hyperthreading-enabled processors. You
set the value of mapred.child.java.opts
to -Xmx1G,
and the value of mapred.tasktracker.map.tasks.maximum to
12.
What is the most appropriate value to set for
mapred.tasktracker.reduce.tasks.maximum?
Correct Answer: 6
Explanation
For optimal performance, it is important to avoid the
use of virtual memory (swapping) on slave nodes. From the
configuration shown, the slave node could run 12 Map tasks
simultaneously, and each one will use 1GB of RAM, resulting in a
total of 12GB used. The TaskTracker daemon itself will use 1GB of
RAM, as will the DataNode daemon. This is a total of 14GB. The
operating system will also use some RAM -- a reasonable estimate
would be 1-2GB. Thus, we can expect to have approximately 8-9GB of
RAM available for Reducers. So the most appropriate of the choices
presented is that we should configure the node to be able to run 6
Reducers simultaneously.
Further Reading
Hadoop:
The Definitive Guide, 3rd Edition, Chapter 9, under the
section “Environment Settings”
HBase
9. Your client application needs to scan a region
for the row key value 104
.
Given a store file that contains the following list of Row Key
values:
100, 101, 102, 103, 104, 105, 106, 107
What would a bloom filter return?
Correct Answer: Confirmation that 104 may be contained in the set
Explanation
A Bloom filter is a kind of membership test using
probability -- it tells you whether an element is a member of a set.
It is quick and memory-efficient. The trade-off is that it is
probabilistic where false positives are possible though false
negatives are not; thus if your Bloom Filter returns true, it
confirms that a key may be contained in a table. If Bloom
Filter returns false, it confirms that a key is definitely not
contained in a table.
Enabling Bloom Filters may save your
disk seek and improve read latency.
Further Reading
-
HBase
Documentation on Bloom Filters section 12.6.4. Bloom Filters
includes:
Bloom Filters can be enabled per ColumnFamily. Use
HColumnDescriptor.setBloomFilterType(NONE
| ROW | ROWCOL)
to enable blooms per ColumnFamily.
Default = NONE
for no
bloom filters. If ROW
,
the hash of the row will be added to the bloom on each insert. If
ROWCOL
, the hash of the
row + column family + column family qualifier will be added to the
bloom on each key insert.
10. You have two tables in an existing RDBMS. One
contains information about the products you sell (name, size, color,
etc.) The other contains images of the products in JPEG format. These
tables are frequently joined in queries to your database. You would
like to move this data into HBase. What is the most efficient schema
design for this scenario?
Correct Answer: Create a single table with two
column families
Explanation
Access patterns are an important factor in HBase schema
design. Even though the two tables in this scenario have very
different data sizes and formats, it is better to store them in one
table if you are accessing them together most of the time.
Column
families allow for separation of data. You can store different types
of data and format into different column families. Attributes such as
compression, Bloom filters, and replication are set on per column
family basis. In this example, it is better to store product
information and product images into two different column families and
one table.
Further Reading
HBase
Documentation on Column Family Section 5.6.especially the part:
Physically, all column family
members are stored together on the filesystem. Because tunings and
storage specifications are done at the column family level, it is
advised that all column family members have the same general access
pattern and size characteristics.
11. You need to create a Blogs
table in HBase. The table will consist of a single Column Family
called Content
and two
column qualifiers, Author
and Comment
. What HBase
shell command should you use to create this table?
Correct Answer: create 'Blogs', 'Content'
Explanation
When you create a HBase table, you need to specify
table name and column family name.
In the Hbase Shell, you can
create a table with the command:
create
'myTable', 'myColumnFamily'
For this example:
Table name: 'Blogs'
ColumnFamily: 'Content'
We can create the table and verify it with
describe table command.
hbase> create 'Blogs', 'Content'
hbase> describe 'Blogs' {Name => 'Blogs', FAMILIES => [{NAME => 'Content', ....
Further Reading
see the HBase
Shell commands for create
Create table; pass table
name, a dictionary of specifications per column family, and
optionally a dictionary of table configuration. Dictionaries are
described below in the GENERAL NOTES section.
Examples:
hbase> create 't1', {NAME => 'f1', VERSIONS => 5}
hbase> create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}
hbase> # The above in shorthand would be the following:
hbase> create 't1', 'f1', 'f2', 'f3'
hbase> create 't1', {NAME => 'f1', VERSIONS => 1, TTL => 2592000, BLOCKCACHE => true}
12. From within an HBase application you are
performing a check
and put
operation with the following
code:table.checkAndPut(Bytes.toBytes("rowkey"),Bytes.toBytes("colfam"),Bytes.toBytes("qualifier"),
Bytes.toBytes("barvalue"), newrow));
Which
describes this check and put operation?
Correct Answer: Check if rowkey/colfam/qualifier
exists and has the cell value "barvalue". If so, put the
values in newrow and return "true".
Explanation
The method checkAndPut
returns "true"
if a row with specific column family, and column qualifier value
matches the expected value; if "true"
is returned, it executes the put
with new value. If the specific value is not present in a row, it
returns "false"
and the put
is not executed.
Data Science Essentials (DS-200)
13. What is the best way to evaluate the quality of
the model found by an unsupervised algorithm like k-means
clustering, given metrics for the cost of the clustering (how well it
fits the data) and its stability (how similar the clusters are across
multiple runs over the same data)?
Correct Answer: The lowest cost clustering subject
to a stability constraint
Explanation
There is a tradeoff between cost and stability in
unsupervised learning. The more tightly you fit the data, the
less stable the model will be, and vice versa. The idea is to
find a good balance with more weight given to the cost. Typically
a good approach is to set a stability threshold and select the model
that achieves the lowest cost above the stability threshold.
14. A sandwich shop studies the number of men, and
women, that enter the shop during the lunch hour from noon to 1pm
each day. They find that the number of men that enter can be modeled
as a random variable with distribution Poisson(M)
,
and likewise the number of women that enter as Poisson(W)
.
What is likely to be the best model of the total number of customers
that enter during the lunch hour?
Correct Answer: Poisson(M+W)
Explanation
The total number of customers is the sum of the number
of men and women. The sum of two Poisson random variables also
follows a Poisson distribution with rate equal to the sum of their
rates. The Normal and Binomial distribution can approximate the
Poisson distribution in certain cases, but the expressions above do
not approximate Poisson(M+W).
15. Which condition supports the choice of a support
vector machine over logistic regression for a classification problem?
Correct Answer: The test set will be dense, and
contain examples close to decision boundary learned from the training
set
Explanation
The SVM algorithm is a maximum margin classifier, and
tries to pick a decision boundary that creates the widest margin
between classes, rather than just any boundary that separates the
classes. This helps generalization to test data, since it is less
likely to misclassify points near the decision boundary, as the
boundary maintains a large margin from training examples.
SVMs
are not particularly better at multi-label clasification. Linear
separability is not required for either classification technique, and
does not relate directly to an advantage of SVMs. SVMs are not
particularly more suited to low dimensional data.