Monday, 20 October 2014

Let's Discuss Hbase - Row key Design

Let's Discuss Hbase - Row key Design

HBase: Row-Key Design ---- --- Demonstrate: Design Solutions and Pros/Cons ---


Row-Key Design
Try to keep row keys short because they are stored with each cell in an HBase table, thus noticeably reducing 
row-key size results of data needed for storing HBase data. This advice also applies to column family names.

Common problems of choosing between sequential row keys and randomly distributed row keys:

Some mixed-design approaches allow fast range scans while distributing data among all clusters when 
writing sequential (by nature) data.

Design Solution: Using sequential row keys (e.g. time-series data with row key built based on timestamp)
Pros: Makes it possible to perform fast range scans with help of setting start/stop keys on Scanner
Cons: Creates single regionserver, hotspotting problems upon writing data (as row keys go in sequence, 
all records end up written into a single region at a time)

Design Solution: Using randomly distributed row keys(e.g. UUIDs)
Pros: Aims for fastest writing performance by distributing new records over random regions
Cons: Does not conduct fast range scans against written data


{{{Column Families}}}Currently, HBase does not do well with anything above two or three column families per table. With that said, 
keep the number of column families in your schema low. Try to make do with one column family in your schemata 
if you can. Only introduce a second and third column family in the case where data access is usually 
column-scoped; i.e. you usually query no more than a single column family at one time.

You can also set TTL (in seconds) for a column family. HBase will automatically delete rows once reaching 
the expiration time.

{{{Versions}}}The maximum number of row versions that can be stored is configured per column family (the default is 3). 
This is an important parameter because HBase does not overwrite row values, but rather stores different values
per row by time (and qualifier). Setting the number of maximum versions to an exceedingly high level 
(e.g., hundreds or more) is not a good idea because that will greatly increase StoreFile size.

The minimum number of row versions to keep can also be configured per column family (the default is 0, meaning
 
the feature is disabled). This parameter is used together with TTL and maximum row versions parameters to allow 
configurations such as “keep the last T minutes worth of data of at least M versions, and at most N versions.” 
This parameter should only be set when TTL is enabled for a column family and must be less than the number of row versions.

{{{Data Types}}}
HBase supports a “bytes-in/bytes-out” interface via Put and Result, so anything that can be converted to an 
array of bytes can be stored as a value. Input can be strings, numbers, complex objects, or even images, as long as they can be rendered as bytes.

One supported data type that deserves special mention is the “counters” type. 
This type enables atomic increments of numbers.......

Some Case study for designing the row key-

Impala JDBC Connection

Impala JDBC Connection

Cloudera Impala is an open source Massively Parallel Processing (MPP) query engine that runs natively on Apache Hadoop. With Impala, analysts and data scientists now have the ability to perform real-time, “speed of thought” analytics on data stored in Hadoop via SQL or through Business Intelligence (BI) tools. The result is that large-scale data processing (via MapReduce) and interactive queries can be done on the same system using the same data and metadata – removing the need to migrate data sets into specialized systems and/or proprietary formats simply to perform analysis.

This is the sample program to connect your impalad by your client machine through JDBC.
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;

public class DriverTest {

private static String SQL_STATEMENT = "select sid from student limit 5";

// set the impalad host
private static String IMPALAD_HOST = "hadoop@ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com";

// port 21050 is the default impalad JDBC port
private static final String IMPALAD_JDBC_PORT = "21050";

private static final String CONNECTION_URL = "jdbc:hive2://" + IMPALAD_HOST + ':' + IMPALAD_JDBC_PORT + "/;auth=noSasl";

private static final String JDBC_DRIVER_NAME = "org.apache.hive.jdbc.HiveDriver";

/**
 * @param args
 * @throws ClassNotFoundException
 * @throws SQLException
 */
public static void main(String[] args) throws ClassNotFoundException, SQLException {

System.out.println("\n=============================================");
System.out.println("Cloudera Impala JDBC Example");
System.out.println("Using Connection URL: " + CONNECTION_URL);
System.out.println("Running Query: " + SQL_STATEMENT);

Connection con = null;

try {

Class.forName(JDBC_DRIVER_NAME);

con = DriverManager.getConnection(CONNECTION_URL);

Statement stmt = con.createStatement();

ResultSet rs = stmt.executeQuery(SQL_STATEMENT);

System.out.println("\n== Begin Query Results ======================");

// print the results to the console
while (rs.next()) {
// the example query returns one String column
System.out.println(rs.getString(1));
}

System.out.println("== End Query Results =======================\n\n");

} catch (SQLException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
con.close();
} catch (Exception e) {
}
}
}
}
Related Posts Plugin for WordPress, Blogger...