Friday 31 October 2014

HBase via Hive

One of the things I’m frequently asked about is how to use HBase from Apache Hive. Not just how to do it, but what works, how well it works, and how to make good use of it. I’ve done a bit of research in this area, so hopefully this will be useful to someone besides myself. This is a topic that we did not get to cover in HBase in Action, perhaps these notes will become the basis for the 2nd edition ;) These notes are applicable to Hive 0.11.x used in conjunction with HBase 0.94.x. They should be largely applicable to 0.12.x + 0.96.x, though I haven’t tested everything yet.

The hive project includes an optional library for interacting with HBase. This is where the bridge layer between the two systems is implemented. The primary interface you use when accessing HBase from Hive queries is called the HBaseStorageHandler. You can also interact with HBase tables directly via Input and Output formats, but the handler is simpler and works for most uses.

HBase tables from Hive

Use the HBaseStorageHandler to register HBase tables with the Hive metastore. You can optionally specify the HBase table as EXTERNAL, in which case Hive will not create to drop that table directly – you’ll have to use the HBase shell to do so.
CREATE [EXTERNAL] TABLE foo(...)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
TBLPROPERTIES ('hbase.table.name' = 'bar');
The above statement registers the HBase table named bar in the Hive metastore, accessible from Hive by the name foo.
Under the hood, HBaseStorageHandler is delegating interaction with the HBase table to HiveHBaseTableInputFormat and HiveHBaseTableOutputFormat. You can register your HBase table in Hive using those classes directly if you desire. The above statement is roughly equivalent to:
CREATE TABLE foo(...)
STORED AS
  INPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat'
TBLPROPERTIES ('hbase.table.name' = 'bar');
Also provided is the HiveHFileOutputFormat which means it should be possible to generate HFiles for bulkloading from Hive as well. In practice, I haven’t gotten this to work end-to-end (see HIVE-4627).

Schema mapping

Registering the table is only the first step. As part of that registration, you also need to specify a column mapping. This is how you link Hive column names to the HBase table’s rowkey and columns. Do so using the hbase.columns.mapping SerDe property.
CREATE TABLE foo(rowkey STRING, a STRING, b STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,f:c1,f:c2')
TBLPROPERTIES ('hbase.table.name' = 'bar');
...
The values provided in the mapping property correspond one-for-one with column names of the hive table. HBase column names are fully qualified by column family, and you use the special token :key to represent the rowkey. The above example makes rows from the HBase table bar available via the Hive table foo. The foo column rowkey maps to the HBase’s table’s rowkey, a to c1 in the f column family, and b to c2, also in the f family.

You can also associate Hive’s MAP data structures to HBase column families. In this case, only the STRING Hive type is used. The other Hive type currently supported is BINARY. See the wiki page for more examples.

Interacting with data

With the column mappings defined, you can now access HBase data just like you would any other Hive data. Only simple query predicates are currently supported.
SELECT * FROM foo WHERE ...;
You can also populate and HBase table using Hive. This works with both INTO and OVERWRITE clauses.
FROM source_hive_table INSERT INTO TABLE my_hbase_table
SELECT source_hive_table.* WHERE ...;
Be advised that there is a regression in Hive 0.12.0 which breaks this feature, see HIVE-5515.

In practice

There’s still a little finesse required to get everything wired up properly at runtime. The HBase interaction module is completely optional, so you have to make sure it and it’s HBase dependencies are available on Hive’s classpath.
$ export HADOOP_CLASSPATH=...
$ hive -e "CREATE TABLE ... STORED BY 'org.apache...HBaseStorageHandler'"
The installation environment could do a better job of handling this for users, but for the time being you must manage it yourself. Ideally the hive bin script can detect the presence of HBase and automatically make the necessary CLASSPATH adjustments. This enhancement appears to be tracked in HIVE-2055. The last mile is provided by the distribution itself, ensuring the environment variables are set for hive. This functionality is provided by BIGTOP-955.
You also need to make sure the necessary jars are shipped out to the MapReduce jobs when you execute your Hive statements. Hive provides a mechanism for shipping additional job dependencies via the auxjars feature.
$ export HIVE_AUX_JARS_PATH=...
$ hive -e "SELECT * FROM ..."
I did discover a small bug in HDP-1.3 builds which masks user-specified values of HIVE_AUX_JARS_PATH. With administrative rights, this is easily fixed by correcting the line in hive-env.sh to respect an existing value. The work-around in user scripts is to use the SET statement to provide a value once you’ve launched the Hive CLI.
SET hive.aux.jars.path = ...
Hive should be able to detect which jars are necessary and add them itself. HBase provides the TableMapReduceUtils#addDependencyJars methods for this purpose. It appears that this is done in hive-0.12.0, at least according to HIVE-2379.

Future work

Much has been said about proper support for predicate pushdown (HIVE-1643, HIVE-2854, HIVE-3617, HIVE-3684) and data type awareness (HIVE-1245, HIVE-2599). These go hand-in-hand as predicate semantics are defined in terms of the types upon which they operate. More could be done to map Hive’s complex data types like Maps and Structs onto HBase column families as well (HIVE-3211). Support for HBase timestamps is a bit of a mess; they’re not made available to Hive applications with any level of granularity (HIVE-2828, HIVE-2306). The only interaction a user has is via storage handler setting for writing a custom timestamp with all operations.

From a performance perspective, there are things Hive can do today (ie, not dependent on data types) to take advantage of HBase. There’s also the possibility of an HBase-aware Hive to make use of HBase tables as intermediate storage location (HIVE-3565), facilitating map-side joins against dimension tables loaded into HBase. Hive could make use of HBase’s natural indexed structure (HIVE-3634, HIVE-3727), potentially saving huge scans.

Currently, the user doesn’t have (any?) control over the scans which are executed. Configuration on a per-job, or at least per-table basis should be enabled (HIVE-1233). That would enable an HBase-savy user to provide Hive with hints regarding how it should interact with HBase. Support for simple split sampling of HBase tables (HIVE-3399) could also be easily done because HBase manages table partitions already.

Other access channels

Everything discussed thus far has required Hive to interact with online HBase RegionServers. Applications may stand to gain significant throughput and enjoy greater flexibility by interacting directly with HBase data persisted to HDFS. This also has the benefit of preventing Hive workloads from interfering with online SLA-bound HBase applications (at least, until we see HBase improvements in QOS isolation between tasks, HBASE-4441).

As mentioned earlier, there is the HiveHFileOutputFormat. Resolving HIVE-4627 should make Hive a straight-forward way to generate HFiles for bulk loading. Once you’ve created the HFiles using Hive, there’s still the last step of running the LoadIncrementalHFiles utility to copy and register them in the regions. For this, the HiveStorageHandler interface will need some kind of hook to influence the query plan as it’s created, allowing it to append steps. Once in place, it should be possible to SET a runtime flag, switching an INSERT operation to use bulkload.

HBase recently introduced the table snapshot feature. This allows a user to create a persisted point-in-time view of a table, persisted to HDFS. HBase is able to restore a table from a snapshot to a previous state, and to create an entirely new table from an existing snapshot. Hive does not currently support reading from an HBase snapshot. For that matter, HBase doesn’t yet support MapReduce jobs over snapshots, though the feature is a work in progress (HBASE-8369).

Conclusions

The interface between HBase and Hive is young, but has nice potential. There’s a lot of low-hanging fruit that can be picked up to make things easier and faster. The most glaring issue barring real application development is the impedance mismatch between Hive’s typed, dense schema and HBase’s untyped, sparse schema. This is as much a cognitive problem as technical issue. Solutions here would allow a number of improvements to fall out, including much in the way of performance improvements. I’m hopeful that continuing work to add data types to HBase (HBASE-8089) can help bridge this gap.

Basic operations mostly work, at least in a rudimentary way. You can read data out of and write data back into HBase using Hive. Configuring the environment is an opaque and manual process, one which likely stymies novices from adopting the tools. There’s also the question of bulk operations – support for writing HFiles and reading HBase snapshots using Hive is entirely lacking at this point. And of course, there are bugs sprinkled throughout. The biggest recent improvement is the deprecation of HCatalog’s interface, removing the necessary upfront decision regarding which interface to use.

Hive provides a very usable SQL interface on top of HBase, one which integrates easily into many existing ETL workflows. That interface requires simplifying some of the BigTable semantics HBase provides, but the result will be to open up HBase to a much broader audience of users. The Hive interop compliments extremely well the experience provided by Phoenix. Hive has the benefit of not requiring the deployment complexities currently required by that system. Hopefully the common definition of types will allow a complimentary future.


Grab some data and register it in Hive

We’ll need some data to work with. For this purpose, grab some traffic stats from wikipedia. Once we have some data, copy it up to HDFS.
$ mkdir pagecounts ; cd pagecounts
$ for x in {0..9} ; do wget "http://dumps.wikimedia.org/other/pagecounts-raw/2008/2008-10/pagecounts-20081001-0${x}0000.gz" ; done
$ hadoop fs -copyFromLocal $(pwd) ./
For reference, this is what the data looks like.
$ zcat pagecounts-20081001-000000.gz | head -n5
aa.b Special:Statistics 1 837
aa Main_Page 4 41431
aa Special:ListUsers 1 5555
aa Special:Listusers 1 1052
aa Special:PrefixIndex/Comparison_of_Guaze%27s_Law_and_Coulomb%27s_Law 1 4332
As I understand it, each record is a count of page views of a specific page on Wikipedia. The first column is the language code, second is the page name, third is the number of page views, and fourth is the size of the page in bytes. Each file contains an hour’s worth of aggregated data. None of the above pages were particularly popular that hour.
Now that we have data and understand its raw schema, create a Hive table over it. To do that, we’ll use a DDL script that looks like this.
$ cat 00_pagecounts.ddl
-- define an external table over raw pagecounts data
CREATE TABLE IF NOT EXISTS pagecounts (projectcode STRING, pagename STRING, pageviews STRING, bytes STRING)
ROW FORMAT
  DELIMITED FIELDS TERMINATED BY ' '
  LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/ndimiduk/pagecounts';
Run the script to register our dataset with Hive.
$ hive -f 00_pagecounts.ddl
OK
Time taken: 2.268 seconds
Verify that the schema mapping works by calculating a simple statistic over the dataset.
$ hive -e "SELECT count(*) FROM pagecounts;"
Total MapReduce jobs = 1
Launching Job 1 out of 1
...
OK
36668549
Time taken: 25.31 seconds, Fetched: 1 row(s)
Hive says the 10 files we downloaded contain just over 36.5mm records. Let’s just confirm things are working as expected by getting a second opinion. This isn’t that much data, so confirm on the command line.
$ zcat * | wc -l
36668549
The record counts match up – excellent.

Transform the schema for HBase

The next step is to transform the raw data into a schema that makes sense for HBase. In our case, we’ll create a schema that allows us to calculate aggregate summaries of pages according to their titles. To do this, we want all the data for a single page grouped together. We’ll manage that by creating a Hive view that represents our target HBase schema. Here’s the DDL.
$ cat 01_pgc.ddl
-- create a view, building a custom hbase rowkey
CREATE VIEW IF NOT EXISTS pgc (rowkey, pageviews, bytes) AS
SELECT concat_ws('/',
         projectcode,
         concat_ws('/',
           pagename,
           regexp_extract(INPUT__FILE__NAME, 'pagecounts-(\\d{8}-\\d{6})\\..*$', 1))),
       pageviews, bytes
FROM pagecounts;
The SELECT statement uses hive to build a compound rowkey for HBase. It concatenates the project code, page name, and date, joined by the '/' character. A handy trick: it uses a simple regex to extract the date from the source file names. Run it now.
$ hive -f 01_pgc.ddl
OK
Time taken: 2.712 seconds
This is just a view, so the SELECT statement won’t be evaluated until we query it for data. Registering it with hive doesn’t actually process any data. Again, make sure it works by querying Hive for a subset of the data.
$ hive -e "SELECT * FROM pgc WHERE rowkey LIKE 'en/q%' LIMIT 10;"
Total MapReduce jobs = 1
Launching Job 1 out of 1
...
OK
en/q:Special:Search/Blues/20081001-090000       1       1168
en/q:Special:Search/rock/20081001-090000        1       985
en/qadam_rasul/20081001-090000  1       1108
en/qarqay/20081001-090000       1       933
en/qemu/20081001-090000 1       1144
en/qian_lin/20081001-090000     1       918
en/qiang_(spear)/20081001-090000        1       973
en/qin_dynasty/20081001-090000  1       1120
en/qinghe_special_steel_corporation_disaster/20081001-090000    1       963
en/qmail/20081001-090000        1       1146
Time taken: 40.382 seconds, Fetched: 10 row(s)

Register the HBase table

Now that we have a dataset in Hive, it’s time to introduce HBase. The first step is to register our HBase table in Hive so that we can interact with it using Hive queries. That means another DDL statement. Here’s what it looks like.
$ cat 02_pagecounts_hbase.ddl
-- create a table in hbase to host the view
CREATE TABLE IF NOT EXISTS pagecounts_hbase (rowkey STRING, pageviews STRING, bytes STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,f:c1,f:c2')
TBLPROPERTIES ('hbase.table.name' = 'pagecounts');
This statement will tell Hive to go create an HBase table named pagecounts with the single column family f. It registers that HBase table in the Hive metastore by the name pagecounts_hbase with 3 columns: rowkey, pageviews, and bytes. The SerDe property hbase.columns.mapping makes the association from Hive column to HBase column. It says the Hive column rowkey is mapped to the HBase table’s rowkey, the Hive column pageviews to the HBase column f:c1, and bytes to the HBase column f:c2. To keep the example simple, we have Hive treat all these columns as the STRING type.

In order to use the HBase library, we need to make the HBase jars and configuration available to the local Hive process (at least until HIVE-5518 is resolved). Do that by specifying a value for the HADOOP_CLASSPATH environment variable before executing the statement.
$ export HADOOP_CLASSPATH=/etc/hbase/conf:/usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar:/usr/lib/zookeeper/zookeeper.jar
$ hive -f 02_pagecounts_hbase.ddl
OK
Time taken: 4.399 seconds

Populate the HBase table

Now it’s time to write data to HBase. This is done using a regular Hive INSERT statement, sourcing data from the view with SELECT. There’s one more bit of administration we need to take care of though. This INSERT statement will run a mapreduce job that writes data to HBase. That means we need to tell Hive to ship the HBase jars and dependencies with the job.
Note that this is a separate step from the classpath modification we did previously. Normally you can do this with an export statement from the shell, the same way we specified the HADOOP_CLASSPATH. However there’s a bug in HDP-1.3 that requires me to use Hive’s SET statement in the script instead.
$ cat 03_populate_hbase.hql
-- ensure hbase dependency jars are shipped with the MR job
-- Should export HIVE_AUX_JARS_PATH but this is broken in HDP-1.3.x
SET hive.aux.jars.path = file:///etc/hbase/conf/hbase-site.xml,file:///usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.2.0-111.jar,file:///usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar,file:///usr/lib/zookeeper/zookeeper-3.4.5.1.3.2.0-111.jar;

-- populate our hbase table
FROM pgc INSERT INTO TABLE pagecounts_hbase SELECT pgc.* WHERE rowkey LIKE 'en/q%' LIMIT 10;
Note there’s a big ugly bug in Hive 0.12.0 which means this doesn’t work with that version. Never fear though, we have a patch in progress. Follow along at HIVE-5515.
If you choose to use a different method for setting Hive’s auxpath, be advised that it’s a tricky process – depending on how you specify it (HIVE_AUX_JARS_PATH, --auxpath), Hive will interpret the argument differently. HIVE-2349 seeks to remedy this unfortunate state of affairs.
$ hive -f 03_populate_hbase.hql
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
...
OK
Time taken: 40.296 seconds
Be advised also that this step is currently broken on secured HBase deployments. Follow along HIVE-5523 if that’s of interest to you.

Query data from HBase-land

40 seconds later, you now have data in HBase. Let’s have a look using the HBase shell.
$ echo "scan 'pagecounts'" | hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.6.1.3.2.0-111, r410a7a1c151ca953553eae68aa84e2a9f0d6e4ca, Mon Aug 19 19:00:12 PDT 2013

scan 'pagecounts'
ROW                                                 COLUMN+CELL
 en/q:Pan%27s_Labyrinth/20081001-080000             column=f:c1, timestamp=1381534232485, value=1
 en/q:Pan%27s_Labyrinth/20081001-080000             column=f:c2, timestamp=1381534232485, value=1153
 en/q:Special:Search/Jazz/20081001-080000           column=f:c1, timestamp=1381534232485, value=1
 en/q:Special:Search/Jazz/20081001-080000           column=f:c2, timestamp=1381534232485, value=980
 en/q:Special:Search/peinture/20081001-080000       column=f:c1, timestamp=1381534232485, value=1
 en/q:Special:Search/peinture/20081001-080000       column=f:c2, timestamp=1381534232485, value=989
 en/q:Special:Search/rock/20081001-080000           column=f:c1, timestamp=1381534232485, value=1
 en/q:Special:Search/rock/20081001-080000           column=f:c2, timestamp=1381534232485, value=980
 en/qadi/20081001-080000                            column=f:c1, timestamp=1381534232485, value=1
 en/qadi/20081001-080000                            column=f:c2, timestamp=1381534232485, value=1112
 en/qalawun%20complex/20081001-080000               column=f:c1, timestamp=1381534232485, value=1
 en/qalawun%20complex/20081001-080000               column=f:c2, timestamp=1381534232485, value=942
 en/qalawun/20081001-080000                         column=f:c1, timestamp=1381534232485, value=1
 en/qalawun/20081001-080000                         column=f:c2, timestamp=1381534232485, value=929
 en/qari'/20081001-080000                           column=f:c1, timestamp=1381534232485, value=1
 en/qari'/20081001-080000                           column=f:c2, timestamp=1381534232485, value=929
 en/qasvin/20081001-080000                          column=f:c1, timestamp=1381534232485, value=1
 en/qasvin/20081001-080000                          column=f:c2, timestamp=1381534232485, value=921
 en/qemu/20081001-080000                            column=f:c1, timestamp=1381534232485, value=1
 en/qemu/20081001-080000                            column=f:c2, timestamp=1381534232485, value=1157
10 row(s) in 0.4960 seconds
Here we have 10 rows with two columns each containing the data loaded using Hive. It’s now accessible in your online world using HBase. For example, perhaps you receive an updated data file and have a corrected value for one of the stats. You can update the record in HBase with a regular PUT command.

Verify data from from Hive

The HBase table remains available to you Hive world; Hive’s HBaseStorageHandler works both ways, after all.
Note that this command expects that the HADOOP_CLASSPATH is still set and HIVE_AUX_JARS_PATH as well if your query is complex.
$ hive -e "SELECT * from pagecounts_hbase;"
OK
en/q:Pan%27s_Labyrinth/20081001-080000  1       1153
en/q:Special:Search/Jazz/20081001-080000        1       980
en/q:Special:Search/peinture/20081001-080000    1       989
en/q:Special:Search/rock/20081001-080000        1       980
en/qadi/20081001-080000 1       1112
en/qalawun%20complex/20081001-080000    1       942
en/qalawun/20081001-080000      1       929
en/qari'/20081001-080000        1       929
en/qasvin/20081001-080000       1       921
en/qemu/20081001-080000 1       1157
Time taken: 2.554 seconds, Fetched: 10 row(s)

Continue using Hive for analysis

Since the HBase table is accessible from Hive, you can continue to use Hive for your ETL processing with mapreduce. Keep in mind that the auxpath considerations apply here too, so I’ve scripted out the query instead of just running it directly at the command line.
$ cat 04_query_hbase.hql
-- ensure hbase dependency jars are shipped with the MR job
-- Should export HIVE_AUX_JARS_PATH but this is broken in HDP-1.3.x
SET hive.aux.jars.path = file:///etc/hbase/conf/hbase-site.xml,file:///usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.2.0-111.jar,file:///usr/lib/hbase/hbase-0.94.6.1.3.2.0-111-security.jar,file:///usr/lib/zookeeper/zookeeper-3.4.5.1.3.2.0-111.jar;

-- query hive data
SELECT count(*) from pagecounts_hbase;
Run it the same way we did the others.
$ hive -f 04_query_hbase.hql
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
...
OK
10
Time taken: 19.473 seconds, Fetched: 1 row(s)
There you have it: a hands-on, end to end demonstration of interacting with HBase from Hive. You can learn more about the nitty-gritty details in Enis’s deck on the topic, or see the presentation he and Ashutosh gave at HBaseCon. If you’re inclined to make the intersection of these technologies work better (faster, stronger), I encourage you to pick up any of the JIRA issues mentioned in this post or the previous.

Happy hacking!

Why are Facebook, Digg, and Twitter so hard to scale?

Real-time social graphs (connectivity between people, places, and things). That's why scaling Facebook is hard says Jeff Rothschild, Vice President of Technology at Facebook. Social networking sites like Facebook, Digg, and Twitter are simply harder than traditional websites to scale. Why is that? Why would social networking sites be any more difficult to scale than traditional web sites? Let's find out.

Traditional websites are easier to scale than social networking sites for two reasons:
  • They usually access only their own data and common cached data.
  • Only 1-2% of users are active on the site at one time.
Imagine a huge site like Yahoo. When you come to Yahoo they can get your profile record with one get and that's enough to build your view of the website for you. It's relatively straightforward to scale systems based around single records using distributed hashing schemes. And since only a few percent of the people are on the site at once it takes comparatively little RAM cache to handle all the active users.

Now think what happens on Facebook. Let's say you have 200 friends. When you hit your Facebook account it has to go gather the status of all 200 of your friends at the same time so you can see what's new for them. That means 200 requests need to go out simultaneously, the replies need to be merged together, other services need to be contacted to get more details, and all this needs to be munged together and sent through PHP and a web server so you see your Facebook page in a reasonable amount of time. Oh my.
There are several implications here, especially given that on social networking sites a high percentage of users are on the system at one time (that's the social part, people hang around):
  • All data is active all the time. 
  • It's hard to partition this sort of system because everyone is connected. 
  • Everything must be kept in RAM cache so that the data can be accessed as fast as possible.
Partitioning means you would like to find some way to cluster commonly accessed data together so it can be accessed more efficiently. Facebook, because of the interconnectedness of the data, didn't find any clustering scheme that worked in practice. So instead of partitioning and denormalizing data Facebook keeps data normalized and randomly distributes data amongst thousands of databases.

This approach requires a very fast cache. Facebook uses memcached as their caching layer. All data is kept in cache and they've made a lot of modifications to memcached to speed it up and to help it handle more requests (all contributed back to the community).

Their caching tier services 120 million queries every second and it's the core of the site. The problem is memcached is hard to use because it requires programmer cooperation. It's also easy to corrupt. They've developed a complicated system to keep data in the caching tier consistent with the database, even across multiple distributed data centers. Remember, they are caching user data here, not HTML pages or page fragments. Given how much their data changes it's would be hard to make page caching work.

We see similar problems at Digg. Digg, for example, must deal with the problem of sending out updates to 40,000 followers every time Kevin Rose diggs a link. Digg and I think Twitter too have taken a different approach than Facebook.

Facebook takes a Pull on Demand approach. To recreate a page or a display fragment they run the complete query. To find out if one of your friends has added a new favorite band Facebook actually queries all your friends to find what's new. They can get away with this but because of their awesome infrastructure.

But if you've ever wondered why Facebook has a 5,000 user limit on the number of friends, this is why. At a certain point it's hard to make Pull on Demand scale.

Another approach to find out what's new is the Push on Change model. In this model when a user makes a change it is pushed out to all the relevant users and the changes (in some form) are stored with each user. So when a user want to view their updates all they need to access is their own account data. There's no need to poll all their friends for changes.

With security and permissions it can be surprisingly complicated to figure out who should see an update. And if a user has 2 million followers it can be surprisingly slow as well. There's also an issue of duplication. A lot of duplicate data (or references) is being stored, so this is a denormalized approach which can make for some consistency problems. Should permission be consulted when data is produced or consumed, for example? Or what if the data is deleted after it has already been copied around?

While all these consistency and duplications problems are interesting, Push on Change seems the more scalable approach for really large numbers of followers. It does take a lot of work to push all the changes around, but that can be handled by a job queuing system so the work is distributed across a cluster.

The challenges will only grow as we get more and more people, more and deeper inter-connectivity, faster and faster change, and a greater desire to consume it all in real-time. We are a long way from being able to handle this brave new world.

What the heck are you actually using NoSQL for?

It's a truism that we should choose the right tool for the job. Everyone says that. And who can disagree? The problem is this is not helpful advice without being able to answer more specific questions like: What jobs are the tools good at? Will they work on jobs like mine? Is it worth the risk to try something new when all my people know something else and we have a deadline to meet? How can I make all the tools work together?

In the NoSQL space this kind of real-world data is still a bit vague. When asked, vendors tend to give very general answers like NoSQL is good for BigData or key-value access. What does that mean for for the developer in the trenches faced with the task of solving a specific problem and there are a dozen confusing choices and no obvious winner? Not a lot. It's often hard to take that next step and imagine how their specific problems could be solved in a way that's worth taking the trouble and risk.

Let's change that. What problems are you using NoSQL to solve? Which product are you using? How is it helping you? Yes, this is part the research for my webinar on December 14th, but I'm a huge believer that people learn best by example, so if we can come up with real specific examples I think that will really help people visualize how they can make the best use of all these new product choices in their own systems.

Here's a list of uses cases I came up with after some trolling of the interwebs. The sources are so varied I can't attribute every one, I'll put a list at the end of the post. Please feel free to add your own. I separated the use cases out for a few specific products simply because I had a lot of uses cases for them they were clearer out on their own. This is not meant as an endorsement of any sort. Here's a master list of all the NoSQL products. If you would like to provide a specific set of use cases for a product I'd be more than happy to add that in.

General Use Cases

These are the general kinds of reasons people throw around for using NoSQL. Probably nothing all that surprising here.
  • Bigness. NoSQL is seen as a key part of a new data stack supporting: big data, big numbers of users, big numbers of computers, big supply chains, big science, and so on. When something becomes so massive that it must become massively distributed, NoSQL is there, though not all NoSQL systems are targeting big. Bigness can be across many different dimensions, not just using a lot of disk space. 
  • Massive write performance. This is probably the canonical usage based on Google's influence. High volume. Facebook needs to store 135 billion messages a month. Twitter, for example, has the problem of storing 7 TB/data per day with the prospect of this requirement doubling multiple times per year. This is the data is too big to fit on one node problem. At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster, which implies key-value access, MapReduce, replication, fault tolerance, consistency issues, and all the rest. For faster writes in-memory systems can be used.
  • Fast key-value access. This is probably the second most cited virtue of NoSQL in the general mind set.  When latency is important it's hard to beat hashing on a key and reading the value directly from memory or in as little as one disk seek. Not every NoSQL product is about fast access, some are more about reliability, for example. but what people have wanted for a long time was a better memcached and many NoSQL systems offer that.
  • Flexible schema and flexible datatypes.  NoSQL products support a whole range of new data types, and this is a major area of innovation in NoSQL. We have: column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. Developers love avoiding complex schemas and ORM frameworks. Lack of structure allows for much more flexibility. We also have program and programmer friendly compatible datatypes likes JSON. 
  • Schema migration. Schemalessness makes it easier to deal with schema migrations without so much worrying. Schemas are in a sense dynamic, because they are imposed by the application at run-time, so different parts of an application can have a different view of the schema.
  • Write availability. Do your writes need to succeed no mater what? Then we can get into partitioning, CAP, eventual consistency and all that jazz.
  • Easier maintainability, administration and operations. This is very product specific, but many NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are spending a lot of effort on ease of use, minimal administration, and automated operations. This can lead to lower operations costs as special code doesn't have to be written to scale a system that was never intended to be used that way.
  • No single point of failure. Not every product is delivering on this, but we are seeing a definite convergence on relatively easy to configure and manage high availability with automatic load balancing and cluster sizing. A perfect cloud partner.
  • Generally available parallel computing. We are seeing MapReduce baked into products, which makes parallel computing something that will be a normal part of development in the future.
  • Programmer ease of use. Accessing your data should be easy. While the relational model is intuitive for end users, like accountants, it's not very intuitive for developers. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. This is a developer led coup. The response to a database problem can't always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves. It shouldn't be so hard to make a product perform. Money is part of the issue. If it costs a lot to scale a product then won't you go with the cheaper product, that you control, that's easier to use, and that's easier to scale?
  • Use the right data model for the right problem. Different data models are used to solve different problems. Much effort has been put into, for example, wedging graph operations into a relational model, but it doesn't work. Isn't it better to solve a graph problem in a graph database? We are now seeing a general strategy of trying find the best fit between a problem and solution.
  • Avoid hitting the wall. Many projects hit some type of wall in their project. They've exhausted all options to make their system scale or perform properly and are wondering what next? It's comforting to select a product and an approach that can jump over the wall by linearly scaling using incrementally added resources.  At one time this wasn't possible. It took custom built everything, but that's changed. We are now seeing usable out-of-the-box products that a project can readily adopt.
  • Distributed systems support. Not everyone is worried about scale or performance over and above that which can be achieved by non-NoSQL systems. What they need is a distributed system that can span datacenters while handling failure scenarios without a hiccup. NoSQL systems, because they have focussed on scale, tend to exploit partitions, tend not use heavy strict consistency protocols, and so are well positioned to operate in distributed scenarios.
  • Tunable CAP tradeoffs. NoSQL systems are generally the only products with a "slider" for choosing where they want to land on the CAP spectrum. Relational databases pick strong consistency which means they can't tolerate a partition failure. In the end this is a business decision and should be decided on a case by case basis. Does your app even care about consistency? Are a few drops OK? Does your app need strong or weak consistency? Is availability more important or is consistency? Will being down be more costly than being wrong? It's nice to have products that give you a choice.

More Specific Use Cases

  • Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, clickstreams, etc.
  • Syncing online and offline data. This is a niche CouchDB has targeted
  • Fast response times under all loads.
  • Avoiding heavy joins for when the query load for complex joins become too large for a RDBMS.
  • Soft real-time systems where low latency is critical. Games are one example.
  • Applications where a wide variety of different write, read, query, and consistency patterns need to be supported. There are systems optimized for 50% reads 50% writes, 95% writes, or 95% reads. Read-only applications needing extreme speed and resiliency, simple queries, and can tolerate slightly stale data. Applications requiring moderate performance, read/write access, simple queries, completely authoritative data. Read-only application which complex query requirements.
  • Load balance to accommodate data and usage concentrations and to help keep microprocessors busy.
  • Real-time inserts, updates, and queries.
  • Hierarchical data like threaded discussions and parts explosion.
  • Dynamic table creation.
  • Two tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps.
  • Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads.
  • Slicing off part of service that may need better performance/scalability onto it's own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals.
  • Caching. A  high performance caching tier for web sites and other applications. Example is a cache for the Data Aggregation System used by the Large Hadron Collider.
  • Voting.
  • Real-time page view counters.
  • User registration, profile, and session data.
  • Document, catalog management  and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types.
  • Archiving. Storing a large continual stream of data that is still accessible on-line. Document-oriented databases with a flexible schema that can handle schema changes over time.
  • Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads.
  • Working with heterogenous types of data, for example, different media types at a generic level.
  • Embedded systems. They don’t want the overhead of SQL and servers, so they uses something simpler for storage.
  • A "market" game, where you own buildings in a town. You want the building list of someone to pop up quickly, so you partition on the owner column of the building table, so that the select is single-partitioned. But when someone buys the building of someone else you update the owner column along with price.
  • JPL is using SimpleDB to store rover plan attributes. References are kept to a full plan blob in S3. 
  • Federal law enforcement agencies tracking Americans in real-time using credit cards, loyalty cards and travel reservations.
  • Fraud detection by comparing transactions to known patterns in real-time.
  • Helping diagnose the typology of tumors by integrating the history of every patient.
  • In-memory database for high update situations, like a web site that displays everyone's "last active" time (for chat maybe). If users are performing some activity once every 30 sec, then you will be pretty much be at your limit with about 5000 simultaneous users.
  • Handling lower-frequency multi-partition queries using materialized views while continuing to process high-frequency streaming data.
  • Priority queues.
  • Running calculations on cached data, using a program friendly interface, without have to go through an ORM.
  • Unique a large dataset using simple key-value columns.
  • To keep querying fast, values can be rolled-up into different time slices.
  • Computing the intersection of two massive sets, where a join would be too slow.
  • A timeline ala Twitter

Redis Use Cases

Redis is unique in the repertoire as it is a data structure server, with many fascinating use cases that people are excited to share.
  • Calculating whose friends are online using sets. 
  • Memcached on steroids.
  • Distributed lock manager for process coordination.
  • Full text inverted index lookups.
  • Tag clouds.
  • Leaderboards. Sorted sets for maintaining high score tables.
  • Circular log buffers.
  • Database for university course availability information. If the set contains the course ID it has an open seat. Data is scraped and processed continuously and there are ~7200 courses.
  • Server for backed sessions. A random cookie value which is then associated with a larger chunk of serialized data on the server) are a very poor fit for relational databases. They are often created for every visitor, even those who stumble in from Google and then leave, never to return again. They then hang around for weeks taking up valuable database space. They are never queried by anything other than their primary key.
  • Fast, atomically incremented counters are a great fit for offering real-time statistics.
  • Polling the database every few seconds. Cheap in a key-value store. If you're sharding your data you'll need a central lookup service for quickly determining which shard is being used for a specific user's data. A replicated Redis cluster is a great solution here - GitHub use exactly that to manage sharding their many repositories between different backend file servers.
  • Transient data. Any transient data used by your application is also a good fit for Redis. CSRF tokens (to prove a POST submission came from a form you served up, and not a form on a malicious third party site, need to be stored for a short while, as does handshake data for various security protocols. 
  • Incredibly easy to set up and ridiculously fast (30,000 read or writes a second on a laptop with the default configuration)
  • Share state between processes. Run a long running batch job in one Python interpreter (say loading a few million lines of CSV in to a Redis key/value lookup table) and run another interpreter to play with the data that’s already been collected, even as the first process is streaming data in. You can quit and restart my interpreters without losing any data. 
  • Create heat maps of the BNP’s membership list for the Guardian
  • Redis semantics map closely to Python native data types, you don’t have to think for more than a few seconds about how to represent data.
  • That’s a simple capped log implementation (similar to a MongoDB capped collection)—push items on to the tail of a ’log’ key and use ltrim to only retain the last X items. You could use this to keep track of what a system is doing right now without having to worry about storing ever increasing amounts of logging information.
  • An interesting example of an application built on Redis is Hurl, a tool for debugging HTTP requests built in 48 hours by Leah Culver and Chris Wanstrath. 
  • It’s common to use MySQL as the backend for storing and retrieving what are essentially key/value pairs. I’ve seen this over-and-over when someone needs to maintain a bit of state, session data, counters, small lists, and so on. When MySQL isn’t able to keep up with the volume, we often turn to memcached as a write-thru cache. But there’s a bit of a mis-match at work here. 
  • With sets, we can also keep track of ALL of the IDs that have been used for records in the system.
  • Quickly pick a random item from a set. 
  • API limiting. This is a great fit for Redis as a rate limiting check needs to be made for every single API hit, which involves both reading and writing short-lived data.  
  • A/B testing is another perfect task for Redis - it involves tracking user behaviour in real-time, making writes for every navigation action a user takes, storing short-lived persistent state and picking random items.
  • Implementing the inbox method with Redis is simple: each user gets a queue (a capped queue if you're worried about memory running out) to work as their inbox and a set to keep track of the other users who are following them. Ashton Kutcher has over 5,000,000 followers on Twitter - at 100,000 writes a second it would take less than a minute to fan a message out to all of those inboxes.
  • Publish/subscribe is perfect for this broadcast updates (such as election results) to hundreds of thousands of simultaneously connected users. Blocking queue primitives mean message queues without polling.
  • Have workers periodically report their load average in to a sorted set.
  • Redistribute load. When you want to issue a job, grab the three least loaded workers from the sorted set and pick one of them at random (to avoid the thundering herd problem).
  • Multiple GIS indexes. 
  • Recommendation engine based on relationships.
  • Web-of-things data flows.
  • Social graph representation. 
  • Dynamic schemas so schemas don't have to be designed up-front. Building the data model in code, on the fly by adding properties and relationships, dramatically simplifies code. 
  • Reducing the impedance mismatch because the data model in the database can more closely match the data model in the application.

VoltDB Use Cases

VoltDB as a relational database is not traditionally thought of as in the NoSQL camp, but I feel based on their radical design perspective they are so far away from Oracle type systems that they are much more in the NoSQL tradition.
  • Application: Financial trade monitoring
    1. Data source: Real-time markets
    2. Partition key: Market symbol (ticker, CUSIP, SEDOL, etc.)
    3. High-frequency operations: Write and index all trades, store tick data (bid/ask)
    4. Lower-frequency operations: Find trade order detail based on any of 20+ criteria, show TraderX's positions across all market symbols
  •  Application: Web bot vulnerability scanning (SaaS application)
    1. Data source: Inbound HTTP requests
    2. Partition key: Customer URL
    3. High-frequency operations: Hit logging, analysis and alerting
    4. Lower-frequency operations: Vulnerability detection, customer reporting
  • Application: Online gaming leaderboard 
    1. Data source: Online game 
    2. Partition key: Game ID 
    3. High-frequency operations: Rank scores based on defined intervals and player personal best
    4. Lower-frequency transactions: Leaderboard lookups
  • Application: Package tracking (logistics)
    1. Data source: Sensor scan
    2. Partition key: Shipment ID
    3. High-frequency operations: Package location updates
    4. Lower-frequency operations: Package status report (including history), lost package tracking, shipment rerouting
  •  Application: Ad content serving
    1. Data source: Website or device, user or rule triggered
    2. Partition key: Vendor/ad ID composite
    3. High-frequency operations: Check vendor balance, serve ad (in target device format), update vendor balance
    4. Lower-frequency operations: Report live ad view and click-thru stats by device (vendor-initiated)
  •  Application: Telephone exchange call detail record (CDR) management
    1. Data source: Call initiation request
    2. Partition key: Caller ID
    3. High-frequency operations: Real-time authorization (based on plan and balance)
    4. Lower-frequency operations: Fraud analysis/detection
  • Application: Airline reservation/ticketing
    1. Data source: Customers (web) and airline (web and internal systems)
    2. Partition key: Customer (flight info is replicated)
    3. High-frequency operations: Seat selection (lease system), add/drop seats, baggage check-in
    4. Lower-frequency operations: Seat availability/flight, flight schedule changes, passenger re-bookings on flight cancellations

Analytics Use Cases

Kevin Weil at Twitter is great at providing Hadoop use cases. At Twitter this includes counting big data with standard counts, min, max, std dev; correlating big data with probabilities, covariance, influence; and research on Big data. Hadoop is on the fringe of NoSQL, but it's very useful to see what kind of problems are being solved with it.
  • How many request do we serve each day?
  • What is the average latency? 95% latency?
  • Grouped by response code: what is the hourly distribution?
  • How many searches happen each day at Twitter?
  • Where do they come from?
  • How many unique queries?
  • How many unique users?
  • Geographic distribution?
  • How does usage differ for mobile users?
  • How does usage differ for 3rd party desktop client users?
  • Cohort analysis: all users who signed up on the same day—then see how they differ over time.
  • Site problems: what goes wrong at the same time?
  • Which features get users hooked?
  • Which features do successful users use often?
  • Search corrections and suggestions (not done now at Twitter, but coming in the feature).
  • What can web tell about a user from their tweets?
  • What can we tell about you from the tweets of those you follow?
  • What can we tell about you from the tweets of your followers?
  • What can we tell about you from the ratio of your followers/following?
  • What graph structures lead to successful networks? (Twitter’s graph structure is interesting since it’s not two-way)
  • What features get a tweet retweeted?
  • When a tweet is retweeted, how deep is the corresponding retweet three?
  • Long-term duplicate detection (short term for abuse and stopping spammers)
  • Machine learning. About not quite knowing the right questions to ask at first. How do we cluster users?
  • Language detection (contact mobile providers to get SMS deals for users—focusing on the most popular countries at first).
  • How can we detect bots and other non-human tweeters?

Poor Use Cases

  • OLTP. Outside VoltDB, complex multi-object transactions are generally not supported. Programmers are supposed to denormalize, use documents, or use other complex strategies like compensating transactions.
  • Data integrity. Most of the NoSQL systems rely on applications to enforce data integrity where SQL uses a declarative approach. Relational databases are still the winner for data integrity.
  • Data independence.  Data outlasts applications. In NoSQL applications drive everything about the data. One argument for the relational model is as a repository of facts that can last for the entire lifetime of the enterprise, far past the expected life-time of any individual application.
  • SQL. If you require SQL then very few NoSQL system will provide a SQL interface, but more systems are starting to provide SQLish interfaces.
  • Ad-hoc queries. If you need to answer real-time questions about your data that you can’t predict in advance, relational databases are generally still the winner. 
  • Complex relationships. Some NoSQL systems support relationships, but a relational database is still the winner at relating.
  • Maturity and stability. Relational databases still have the edge here. People are familiar with how they work, what they can do, and have confidence in their reliability. There are also more programmers and toolsets available for relational databases. So when in doubt, this is the road that will be traveled.

Related Articles

Related Posts Plugin for WordPress, Blogger...