Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Friday, 17 October 2014

NoSQL or RDBMS? – Are we asking the right questions?

Most articles on the topic of NoSQL are around the theme of RDBMS vs. NoSQL. DBA’s are defending RDBMS by stating that NoSQL solutions are all dumb immature data stores without any standards. Many NoSQL proponents react with the argument that RDMBS does not scale and that today everybody needs to deal with huge amounts of data.

I think NoSQL is sold short here. Yes Big Data plays a big role, it is not the primary driver in all NoSQL solutions. There are no standards, because there really is no NoSQL solution, but different types of solutions that cater for different use cases. In fact nearly all of them state that theirs is not a replacement for a traditional RDBMS! When we compare RDBMS against them we need to do so on a use case basis. There are very good reasons for choosing an RDBMS as long as the amount of data is not prohibitive. There are however equally good reason not to do so and choose one of the following solution types

Distributed Key Value Stores
Distributed Column Family Stores
(Distributed) Document Databases
Graph Databases

It has to be said however that there are very simple and specific reasons as to why traditional RDBMS solutions cannot scale beyond a handful of database nodes and even that is painful. However before we look at why NoSQL solutions tend not to have that problem, we will take a look why and when you should choose an RDBMS and when you shouldn’t.

When and Why you (should) choose an RDBMS

While data durability is an important aspect of an RDBMS it is not a differentiator compared to other solutions. So I will concentrate first and foremost on unique features of an RDBMS that also have impact on the application design and performance.

Table based
Relations between distinct Table Entities and Rows (the R in RDBMS)
Referential Integrity
ACID Transactions
Arbitrary Queries and Joins

If you really need all or most of these features than an RDBMS is certainly right for you, although the level of data you have might force you in another direction. But do you really them, let’s look closer?

The table based nature of RDBMS is not a real feature, it is just the way it stores data. While I can think of usecases that specifically benefit from this, most of them are simple in nature (think of excel spreadsheets). That nature however requires a relational concept between rows and tables in order to make up complex entities.

Datamodel showing two different kinds of relationshios

There are genuine relations between otherwise standalone entities (like one person being married to another) and relationships that really define hierarchical context or ownership of some sort (A room is always part of a house). The first one is a real feature, the second is a result of the storage nature. It can be argued that a Document (e.g. an XML) stores such a “relation” more naturally because the House Document contains the Room instead of having the Room as a separate document.

Referential Integrity is really one of the corner stones of an RDBMS, it ensures logical consistency of my domain model. Not only does it ensure consistency within a certain logical entity (which might span multiple rows/tables) but more importantly cross entity consistency. If you access the same data via different applications and need to enforce integrity at the central location this is the way to go. We could check this in the application as well, but the database often acts as the final authority of consistency.

The final aspect of consistency comes in the form of ACID transactions. It ensures that either all my changes are consistent seen by others in their entirety, or the none of my changes is committed at all. Consistency really is the hallmark of an RDBMS. However we often set commit points for other reasons than consistency. How often did I use a bulk update for the simple reason of increased performance? In many cases I did not care about the visibility of those changes, but just wanted to have them done fast. In other cases we would deliberately commit more often in order to decrease locking and increase concurrency. The question is do I care whether Peter shows up as married while Alice is still seen as unmarried? The government for sure does, Facebook on the other hand does not!

SELECT count(e.isbn) AS "number of books", p.name AS publisher
FROM editions AS e INNER JOIN
 publishers AS p ON (e.publisher_id = p.id)
GROUP BY p.name;

The final defining feature of an RDBMS is its ability to execute arbitrary queries: SQL Selects. Very often NoSQL is understood as not being able to execute queries. While this is not true it is true that RDBMS solutions do offer a far superior query language. Especially the ability to group and join data from unrelated entities into a new view on the data is something that makes an RDBMS a powerful tool. If you business is defined by the underlying structured data and you need the ability to ask different questions all the time than this is a key reason to use an RDBMS.

However if you know how to access the data in advance, or you need to change your application in case you want to access it differently, then a lot of that advantage is overkill.

Why an RDBMS might not be right for you

These features come at the price of complexity in terms of datamodel, storage, data retrieval and administration. And as we will see shortly a builtin limit for horizontal scalability. If you do not need any or most of the features you should not use an RDMBS.

If you just want to store your application entities in a persistent and consistent way then an RDBMS is overkill. A Key Value Store might be perfect for you. Note that the Value can be a complex entity in itself!
If you have hierarchical application objects and need some query capability into them then any of the NoSQL solutions might be a fit. With an RDBMS you can use ORM to achieve the same, but at the cost of adding complexity to hide complexity.
If you ever tried to store large trees or networks you will know that an RDBMS is not the best solution here. Depending on your other needs a Graph database might suit you.
You are running in the Cloud and need to run a distributed database for durability and availability. This is what dynamo and big table based datastores were built fore. RDBMS on the other hand do not well here.
You might already use a dataware house for your analytics. This is not to disimilar form a Column Family database. If your data grows to large to be processed on a single machine, you might look into hadoop or any other solution that supports distributed Map/Reduce.

There are many scenarios where fully ACID driven relational table based database is simply not the best option or simplest option to go with. Now that we got that out of the way, let’s look at the big one, amount of data and scalability.

Why an RDBMS does not scale and many NoSQL solutions do

The real problem with RDBMS is the horizontal distribution of load and data. The fact is that RDBMS solutions can not easily achieve automatic data sharding. Data Sharding would require distinct data entities that can be distributed and processed independently. An ACID based relational database cannot do that due to its table based nature. This is where NoSQL solutions differ greatly. They do not distribute a logical entity across multiple tables, it’s always stored in one place. A logical entity can be anything from a simple value, to a complex object or even a full JSON document. They do not enforce referential integrity between these logical entities. They only enforce consistency inside a single entity and sometimes not even that.

NoSQL differs to RDBMS in the way entities get distributed and that no consistency is enforced across those entities

This is what allows them to automatically distribute data across a large number of database nodes and also write them independently. If I were to write 20 entities to a database cluster with 3 nodes, chances are I can evenly spread the writes across all of them. The database does not need to synchronize between the nodes for that to happen and there is no need for a two phase commit, with the visible effect that Client one might see changes on Node 1 before Client 2 has written all 20 entities. A distributed RDBMS solution on the other hand needs to enforce ACID consistency across all three nodes. That means that Client 1 will either not see any changes until all three nodes acknowledged a two phase commit or will be blocked until that happened. In addition to that synchronization the RDBMS also needs to read data from other nodes in order to ensure referential integrity, all that happens during the transaction and blocks Client 2. NoSQL solutions do no such thing for the most part.

The fact that such a solution can scale horizontally also means that it can leverage its distributed nature for high availability. This is very important in the cloud, where every single node might fail at any moment.

Another key factor is these solutions do not allow joins and groups across entities, as that would not be possible in a scalable way if your data ranges in the millions and is distributed across 10 nodes or more. I think this is something that a lot of us have trouble with. We have to start thinking about how to access data and store it accordingly and not the other way around.

So it is true that NoSQL solutions lack some of the features that define an RDBMS solution. They do so for the reason of scalability. That does however not mean that they are dump datastores, Document, Column Family and Graph databases are far from unstructured and simple.

What about Application Performance?

The fact that all these solutions scale in principle, does however not mean that they do so in practice or that your application will perform better because of it! Indeed the overall performance depends to a very large degree on choosing the right implementation for your use case. Key/Value stores are very simple, but you can still use them wrong. Column Family Stores are very interesting and also very different from a table based design. Due to this it is easy to have a bad data model design and this will kill your performance.

Besides the obvious factors of disk I/O, network and caching (which you must of course take into consideration), both application performance and scalability depend heavily on the data itself; more specifically on the distribution across the database cluster. This is something that you need to monitor in live systems and take into consideration during the design phase as well. I will talk more about this and specific implementations in the coming months.

There is one other factor that will play a key role in the choice between NoSQL and more traditional databases. Companies are used to RDBMS, they have experts and DBAs for them. NoSQL is new and not well understood yet. The administration is different. Performance tuning and anlysis is different, as are the problem patterns that we see. More importantly performance and setup are more than ever governed by the applications that use them and not by index tuning.

Application Performance Management as a discipline is well equipped to deal with this. In fact by looking at the end-to-end application performance it can handle the different NoSQL solutions just like any other database

Speeding up a Pig+HBase MapReduce job by a factor of 15

The other day I ran a Pig script. Nothing fancy; I loaded some data into HBase and then ran a second Pig job to do some aggregations. I knew the data loading would take some time as it was multiple GB of data, but I expected the second aggregation job to run much faster. It ran for over 15 hours and was not done at that time. This was too long in my mind and I terminated it. I was using Amazon Elastic Map Reduce as my Hadoop environment, so I could have add more resources, but I wasn’t sure if that would do me any good. I took a look at the performance break downs, and within minutes I found the problem. Not only, that I realized that even an expert would have had a hard time identifying the problem. What’s more, the goto fix for more performance in Hadoop, which is adding more hardware, would not have helped at all!

Pig vs. Hive

Pig is one of the two predominant ways to access, transform and query data in Hadoop. Whereas Hive uses a SQL-like approach and it appeals to database people and analysts, Pig is a script language. Its appeal is in its procedural approach and its ability to transform data in a pipeline fashion, this appeals to ETL experts (Extract Transform Load). Pig is typically used to load data into Hadoop and perform complex transformations and aggregations that would be hard to do in a single Hive query. Both systems generate one or more MapReduce jobs and therein lies their power and complexity.

My MapReduce Job

For demonstration purposes I simply used a kmer index job to reproduce the issue, if you are interested in running it yourself get it from: https://github.com/lila/emr-hbase-sample

The job consists of two Amazon EMR steps in a single job flow. The first Pig script loads multiple gigabytes of genome sequence data into a single HBase table (I got some additional fasta sample genome files to do this compared to the original sample). The second pig job produces a size-based index. The respective Pig Latin looks like this:

A = LOAD 'hbase://kmer'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'sequence:*', '-loadKey true')
AS (id:bytearray, sequence:map[]);
B = foreach A generate id, SIZE(sequence);
C = group B by $1 PARALLEL $p;
D = foreach C generate group, COUNT(B);

I ran it on an empty HBase table and looked at the performance results.

The Loading Job

As said the first was a Pig job that loaded several gigabytes of data into a simple HBase table. When looking at it we can see a couple of things immediately:

The MapReduce Task is sending inserts to multiple HBase Region servers, thus not storing the data locally

First of all the HBase data is not stored locally. This is not much of a surprise. The Pig script is loading data from a file and gets distributed evenly across the Hadoop cluster based on that input data. That distribution is of course not the same as the resulting region distribution in HBase. Secondly we also notice that for the first 10 minutes only a single HBase region server is hit, then a second and only after 14 minutes do all 5 HBase region servers process data.

At the beginning of the job only a single HBase region Server consumes CPU while all others remain idle

This is also well documented and has to do with HBase regions and splits (read this very good blogby Hortonworks if you want to know more).

And lastly we also see that this was a map-only job.

When looking at the break down in time we see that the kmerLoad jobs do not spend any time in reducing or shuffling but only in the mapping phase

More specifically we spend nearly all time in HBase. As we are loading data, into HBase we would expect it to be the primary contributor.

This break down shows that all MapTasks are processing and spend nearly all their time sending data to HBase

Next I took a look at the HBase inserts. We can see how many items we processed and inserted into HBase:

This shows the inserts per region server and HBase table/column family. We also see the number of roundtrips compared to the number of rows inserted

A couple of things are interesting here. First of all we see that the number of inserts is not equally distributed across the region servers. That is again due to the fact that the table was empty, and hbase started in a single region (see the previous screenshots and hortonworks blog). We can also compare the number of inserted rows with the number of roundtrips, making sure that we leverage batching correctly.

When comparing this with the number of mapped records we see that we are doing roughly 27 inserts per input record and each output record corresponds to a single HBase row.

This shows Details about the MapReduce Job and several counters. In this case we see the mapped records.

In summery we can say that the loading is already pretty optimal, the CPU load on the cluster is fine and we insert ~1700 rows into HBase per roundtrip.

This shows a single task attempt inserting rows into HBase, we can see that it inserts many rows in a single roundtrip

Together with the fact that HBase really was the main contributor here means, that there was nothing much that I could do to speed things up, at least not with some serious expert help (suggestions on how to improve this beyond pre-spliting are welcome). So let’s move on the the index job.

The Index Job

The second job was creating a size-based index over all inserted data; hence it had to read all data in the table. This second step lasted for 15 hours and did not complete, because I terminated it at that time. By analyzing it I saw the root cause almost immediately.

The Database Overview shows a single Scan that makes a roundtrip to the HBase region server for every single row it retrieves

Look at the roundtrips and the row counts processed. There is one more roundtrip than there are rows! That means the MapReduce job is calling out to the HBase region server for every single row, which is very inefficient.

The default scan caching on a Pig HBase job should be 100, but that was obviously not the case. The performance breakdown showed me that we were indeed spending all of our time waiting on HBase.

The HBase region Server contributes over 95% to our job execution time

This flow also shows that we spend 95% of our time waiting on HBase

Because we are making one network roundtrip for every single row, the utilization of my cluster was miserable.

The CPU utilization in the hadoop cluster is bad, it only has a couple of spikes which seem to coincide with HBase region splits

Therefore adding more hardware would not have helped much – it was not the lack of resources that made my indexing job slow.

Pig really doesn’t like HBase…

According to the documentation HBaseStorage has an option to set the Scanner Caching, which should lead to fewer roundtrips and more rows fetch in a single roundtrip. This is how it looks like in the PIG script:

A = LOAD 'hbase://kmer'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'sequence:*', '-loadKey true -caching 500')
AS (id:bytearray, sequence:map[]);
B = foreach A generate id, SIZE(sequence);
C = group B by $1 PARALLEL $p;
D = foreach C generate group, COUNT(B);

I ran a smaller sample to test this option. However when I looked at the result I saw that it was still reading one row at a time.

We see that the HBase scan is still doing a roundtrip for every single row it retrieves

Now I was really curious, I added a couple of sensors to get more visibility and what I found was that HBaseStorage did pick up by caching configuration option as defined in the script, but the actual HBase Scanner was still being configured to use 1 as cache size!

The two selected rows show that while the Pig option (the first) retrieves the caching option defined in my pig script, the Scanner itself still gets Caching size 1 (second selected row)

I did not give up there though! Pig also has a way to set global MapReduce job options which is what I tried next. This can be done by adding the following to the Pig script

SET hbase.client.scanner.caching 1000;

It didn’t have any effect either. I could verify that the setting made it into job.xml submitted to Hadoop, by looking at the job tracker. I also verified that the Map Task itself was not aware of that setting at all.

This shows that the hbase.client.scanner.caching job option always returns 1, although the job.xml is set to 1000

Finally I launched a new Hadoop cluster via EMR and added the HBase option in the bootstrap script, making it a cluster wide configuration setting:

--bootstrap-action s3://us-east-1.elasticmapreduce/bootstrap-actions/configure-hbase --args -s,hbase.client.scanner.caching=500

I ran the job again, and wow was it fast. My sample subset went down from 9 minutes to 2! My overall map time went down from 7 minutes to 25 seconds!

This shows the same job executed twice, once with caching option not working and once with a cluster wide setting that did work. The red rectangle shows the dramatic effect this has on map time

Satisfied that the scanner caching option finally worked I ran my big job again. The improvement was massive. It went down from over 15 hours (where I terminated it!) to 75 minutes! That is over15 times faster on the same hardware!

Looking at the transaction flow we see that we still spend most of our time in HBase, but it went down from over 95% to 72%. The shift meant that a lot more processing was done on the MapReduce side itself. To put this into context this means that a lot more time was spent in the aggregation part of the pig script.

This transaction flow shows that we now spend only 72% of the job time waiting on HBase and much more time doing actual processing

Thus we’ve turned this job from one mostly waiting on HBase and network to one that is actually leveraging the CPU. As a result the utilization of my cluster is much better now, albeit far from optimal.

The CPU load on the Cluster is much better now, compared to before.

There is still a lot of room for improvement, CPU utilization of my Hadoop nodes is not close to 100%, so we might be able to make this even faster, but that is work for another day. What’s important is, without the insight that our APM solution gives us I would have never know how to fix this!

Conclusion

While a Pig or Hadoop expert might have told me right away to set the caching option, even he wouldn’t have figured out easily why the option didn’t take effect. With APM I was able to verify the number of roundtrips and immediately saw that the option didn’t work; I didn’t need to spend days or hours to realize that. While I do not know why the option doesn’t work (any Hadoop developers here who can tell us?) I was able to remedy the situation and also verify that the site-wide option had the desired effect – with dramatic improvements!

When I ask customers how they deal with performance problems in MapReduce, they often tell me, beyond generic Hadoop tuning they don’t bother. They simply add more hardware, because they cannot have experts spending days or weeks browsing through logs or looking at every job they execute. Now in this case, adding more hardware would not have helped as the existing cluster was well underutilized. And by leveraging an APM solution I was able to figure out the root cause, that the documented fix to this was not working and come up with an alternate solution within an hour! I sped up my job by a factor of 15, without adding any more hardware to my cluster and without needing to have an expert go through log files!

Pages