Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Friday, 15 August 2014

The State of NoSQL

After at least four years of tough criticism, it's time to come to an intermediate conclusion about the state of NoSQL. So many things have happened around NoSQL that it is hard to get an overview and value what goals have been achieved and where NoSQL failed to deliver.

In many fields NoSQL has been more than successful in the industry and academics too. Universities are starting to understand that NoSQL must to be increasingly adopted by the curriculum. It is simply not enough to teach database normalization up and down. This, of course, does not mean that a profound relational foundation is wrong. To the contrary, NoSQL is certainly a perfect and important addition.

What happened?

The NoSQL Space has exploded in just 4-5 years to about 50 to 150 new databases. nosql-database.org lists about 150 such databases, including some quite old but still strong dinosaurs like Object Databases. And, of course, some interesting mergers have happened, such as the CouchDB and Membase deal leading to CouchBase. But we will discuss each major system later in this article.

Many people have been assuming a huge consolidation in the NoSQL space. However this has not happened. The NoSQL space simply exploded and is still exploding. As with all areas in computer science - like e.g. programming languages - there are more and more gaps opening up for a huge amount of databases. And this is all in line with the explosion of the Internet, big-data, sensors and many more technologies in the future, leading to more data and different requirements about their treatments. In the past four years we saw only one significant system leaving the stage: the German graph database Sones. The vast amount of NoSQL databases continues to live happily either in the open-source space, without any considerable money turnaround, or in the commercial space.

Visibility and Money

Another important point is the visibility and industry adoption. In this space we can see a huge difference between the old industry - protecting the investment - and the new industry: mostly startups. While nearly all of the hot web-startups such as Pinterest or Instagram do have a hybrid (SQL + NoSQL) architecture, the 'old' industry is still struggling with NoSQL adoption. But the observation here is that more and more companies like these are trying to cut out a part of their data streams to be processed and later on analyzed with NoSQL solutions like Hadoop, MongoDB, Cassandra, etc.

And this leads as well to a strong increased demand on developers and architects with NoSQL knowledge. A recent survey showed the following latest developer skills requested by the industry:

HTML5
MongoDB
iOS
Android
Mobile Apps
Puppet
Hadoop
jQuery
PaaS
Social Media

So there are 2 NoSQL databases in the top ten for technology requirements here. And even one before iOS. If this isn't a praise, what else?!

But NoSQL adoption is going faster and deeper as one might think at first glance. In a well known whitepaper Oracle stated in the summer of 2011 that NoSQL DBs feel like an ice cream flavor, but you should not get too attached because it may not be around for too long. Only a few months later Oracle showed its Hadoop integration into a Big Data Appliance. And even more, we saw the launch of their own NoSQL database, which was a revised BerkeleyDB. Since then, there has been a race for all vendors to integrate Hadoop. Microsoft, Sybase, IBM, Greenplum, Pervasive, and many more do already have a tight integration. A pattern that can be seen everywhere: can't fight it, embrace it.

But one of the strongest but silent signs of a broad NoSQL adoption is that NoSQL databases are getting a PaaS standard. Thanks to the easy setup and management of many NoSQL databases, DBs like Redis or MongoDB can be seen in dozens of Paa-Services as Cloud-Foundry, OPENSHIFT, dotCloud, Jelastic, etc. As everything moves more and more into the cloud this becomes a huge momentum for NoSQL to put pressure on classic relational databases. Having the choice to select either MySQL/PostGres or MongoDB/Redis, for example, will force them to think twice about their model, requirements and raise other important questions.

An interesting indicator for technologies is also the ThoughtWorks radar which always contains a lot of interesting stuff, even if you do not fully agree with everything contained in it. Let's have a look at their radar from October 2012 in picture 1:

Picture 1: ThoughtWorks Radar, October, 2012 - Platforms

In their platform quadrant they list five databases:

Neo4j (adopt)
MongoDB (tial but close to adopt)
Riak (trial)
CouchBase (trial)
Datomic (assess)

If you look at this you see that at least four of these have received a lot of venture capital. If you add up all the venture capital in the entire NoSQL Space you will surely count up to something in between 100M and a billion dollars! Neo4j is one of one of these examples for getting 11m $ in a series B funding. Other companies that received $10-30M in funding were Aerospike, Cloudera, DataStax, MongoDB, CouchBase, etc. But let's have a look at the list again: Neo4j, MongoDB, Riak and CouchBase have been in this space for the last four years and have constantly proven to be among market leaders for specific requirements. Then, DB number 5 –Datomic - is more than astonishing, a complete new database, with a complete new paradigm written by a small team. Must be really hot stuff and we will dig into it a bit later when discussing all DBs briefly.

Standards

Many people have asked for NoSQL standards, failing to see that NoSQL covers a really wide range of models and requirements. Hence unified languages for all major areas such as Wide Column, Key/Value, Document and Graph Databases will surely not be available for a long time because it's impossible to cover all areas. Several approaches, such as Spring Data, try to add a unified layer but it's up to the reader to test if this layer is a leap forward in building a polyglot persistence environment or not.

Mostly the graph and the document databases have come up with standards in their own domain. The graph world is more successful with its tinkerpop blueprints, Gremlin, Sparql, and Cypher. In the document space we have UnQL and jaql filling up some niches, although the first lacks real world support by a NoSQL database. But with the force of Hadoop many projects are working on bridging famous ETL languages such as Pig or Hive to other NoSQL databases. So the standards world is highly fragmented, but only due to the fact that NoSQL luckily is a very wide area.

Landscape

One of the best overviews of the database landscape has been given by Matt Aslett in a report of the 451 Group. He recently updated his picture giving us more insights to the categories he mentioned. As you can see in the following picture, the landscape is highly fragmented and overlapping:

(Click on the image to enlarge it)

Picture 2: The database landscape by Matt Aslett (451 group)

As you can see there are several dimensions in one picture. Relational vs. Non-relational, Analytic vs. Operational, NoSQL vs. NewSQL. The last two categories have the well known sub-categories Key-Value, Document, Graph and Big Tables for NoSQL and Storage-Engines, Clustering-Sharding, New Databases and Cloud Service Solutions. The interesting part of this picture is that it is increasingly difficult to put a database to an exact location. Everyone is now trying fiercely to integrate features from databases found in other spaces. NewSQL Systems implement core NoSQL features. NoSQL Systems try more and more to implement 'classic' features as SQL support or ACID or at least often configurable persistence.

It all started with the integration of Hadoop that tons of relational databases now offer. But there are many other examples: e.g. MarkLogic is now starting to ride the JSON wave and thus also hard to position. Furthermore more multi-model databases appear, such as ArangoDB, OrientDB or AlechemyDB (which is now a part of the promising Aerospike DB). They allow to start with one database model (e.g. document / JSON model) and add other models (graph or key-value) as new requirements pop up.

Books

Another wonderful sign of a beginning maturity is the book market. After two German books published in 2010 and 2011 we saw the Wiley book by Shashank Tiwari. Structured like a hurricane and full of great deep insights. The race continued with two nice books in 2012. The 'Seven Databases in Seven Weeks' is surely a masterpiece. Freshly written and full of practical 'hands-on' insights: it takes six famous NoSQL databases and adds PostGreSQL to the mix, Making it a top recommendation. On the other side P.J. Sandalage and Martin Fowler take a more holistic approach to cover all the characteristics and help evaluating your path and decisions with NoSQL.

But there is more to come. It is just a matter of time till a Manning book appears on the scene: Dan McCreary and Ann Kelly are writing a book called: "Making Sense of NoSQL" and the first MEAP chapters are already available.

After starting with concepts and patterns, their chapter 3 will surely look attractive:

Building NoSQL Big Data solutions
Building NoSQL search solutions
Building NoSQL high availability solutions
Using NoSQL to increase agility

This is a new fresh approach and will surely be worth reading.

State of the Leaders

Let's give each NoSQL leader a quick consideration. As one of the clear market leaders, Hadoop is a strange animal. On one hand it has an enormous momentum. As mentioned before, each classic database vendor is in a hurry to announce Hadoop support. Companies such as Cloudera and MapR continue to grow and new Hadoop extensions and successors are announced every week.
Even Hive and Pig continue to get even better acceptance. Nevertheless, there is a fly in the ointment: Companies still complain about an unstructured mess (reading and parsing files could be even faster), MapReduce is far 'too batch' (even Google goes away from it), management is still hard, stability issues, and local training/consultants are still hard to find. But even if you could address some of the issues it's still a hot question, if Hadoop will grow as it is or it will change dramatically.

The second leader, MongoDB, also suffers from flame wars, and it might be the nature of things that leading DBs get the most criticism. Nevertheless, MongoDB goes at a fast pace and criticism mostly is:

a) concerning old versions or
b) due to the lack of knowledge on how to deal with it in a correct way. This recently culminated in absurd complaints that the 32 bit version can only handle 2GB, although MongoDB states this clearly in the download section and recommends the 64 bit version.

Anyway, MongoDBs partnerships and funding rounds push ambitious roadmaps with hot stuff:

the industry called for some security / LDAP features which are currently being developed
full text search will be in soon
V8 for MapReduce is coming
even a finer level then collection level locking will come
and a Hash Shard Key is on the way

Especially this last point catches the interest of many architects. MongoDB was often blamed (also by competitors) for not implementing a concise consistent hashing which is not entirely correct because such a key can be easily defined. But in the future there will be a config for a hash shard key. This means the user is up to decide if a hash key for sharding is useful or if he needs some (perhaps even rare) advantages of selecting his own sharding key. Surely this increases the pressure on other vendors and will lead to fruitful discussion when to use a sharding key.

Cassandra is the next in line and quite doing well adding more and nicer features such as better querying. However rumors won't stop telling that running a Cassandra cluster is not piece of cake and requires some hard work. But the most attractive issue here is surely DataStax. The new Company on top of Cassandra - 25 Million round C funding - is mostly addressing analytics and some operational issues. Especially the analytics was a surprise for many because in the early days Cassandra was not known as a powerful query machine. But as this has changed in the latest version the query capabilities may be sufficient enough for some modern analytics.

The development speed of Redis is also remarkable. Despite Salvatore’s assertions that he would have achieved nothing without the community and the help of Pieter Noordhuis it still looks like a stunning one man show. The sentinel failover and server side scripting with the Lua programming language are the latest achievements. The decision for Lua was a bit of a shock for the community because everyone integrates JavaScript as a server-side language. Nevertheless, Lua is a neat language and will help Redis open up a new pandora of possibilities.

CouchBase also looks like a brilliant solution in terms of scalability and latency despite the strong winds that Facebook and hence Zynga are now facing. It's surely not a hot query machine but if they could improve querying in the future the portfolio would be quite complete. The merger with the CouchDB founders was definitely a strong step and it's worthwhile to see the great influences of CouchDB in CouchBase. On every database conference it's also funny to hear the discussions, if CouchDB is doing better or worse after Damien, Chris and Jan have left. One can only hear extreme opinions here. But who cares as long as the DB is doing fine. And it looks like it does.

The last NoSQL DB to be mentioned here is of course Riak, which also improved dramatically in functionality and monitoring. It continues to have a good reputation mostly in terms of stability: "rock solid, invisible and good for your sleep". The Riak CS fork also looks interesting in terms of the modularity of this technology.

Interesting Newcomers

Beside the market leaders, newcomers are always interesting to evaluate. Let's dig into some of them.

Elastic Search surely is one of the hottest new NoSQL products and just got a 10m $ in series A funding, and that for a good reason. As a scalable search engine on top of Lucene it brings many advantages: a) a company on top providing services and b) leveraging all the achievements that Lucene has conceived in the last years. It will surely infiltrate the industry now more than before, attacking all the big players in the semi-structured information space.

Google also send it's small but fast LevelDB into the field. And it serves as a basis for many usages with specific requirements such as compression integration. Even Riak integrated LevelDB. It remains to be seen when all the new Google internal databases such as Dremel or Spanner will find their way out as open-source projects (like Apache Drill or Cloudera Impala).

Another tectonic shift surely was DynamoDB at the start of 2012. They call it the fastest growing service ever launched at Amazon. It's the ultimate scaling machine. New features are coming slowly but the focus on SSDs and latency is quite amazing.

Multi-model databases are also a field worthwhile to have a look on. OrientDB, its famous representative, is by far not a newcomer but it is improving its capabilities quite fast. Perhaps too fast because some customers might now be happy that OrientDB has reached Version 1.0 and thus hopefully gained a lot more stability. Graph, Document, Key-Value support combined with transactions and SQL are reasons enough to give it second try. Especially the good SQL support makes it interesting for analytic solutions such as Penthao. Another newcomer in this space is ArangoDB. It is moving fast and it doesn't flinch from comparing itself in benchmarks against the established players.
However, again the native JSON and graph support saves a lot of effort if new requirements have to be implemented and the new data has a different model that must be persisted.

By far the biggest surprise this year was Datomic. Written by some rock stars of the Clojure programming language in an incredible short time, it unveils a whole bunch of new paradigms. Furthermore it has made its way into the ThoughtWorks radar with the recommendation to have a look at it. And although it is 'just' a layer on top of established databases it brings a huge amount of advantages, such as:

transactions
a time machine
a fresh and powerful query approach
a new schema approach
caching & scaling features

Currently, DynamoDB, Riak, CouchBase, Infinispan and SQL are supported as the underlying storage engine. It even allows you to mix and query different DBs simultaneously. Many veterans have been surprised that such a radical paradigm shift can be possible. Luckily it is.

Summary

To conclude, let us address three points:

Some new articles by Eric Brewer on the CAP theorem should have come several years earlier. In this article he states that "2 of 3" is misleading, explaining the reasons, why the world is more complicated than a simple CP/AP i.e. ACID/BASE choice. Nevertheless, thousands of talks and articles kept on praising the CAP theorem without any critical review for years. Michael Stonebraker was the strongest censor of the NoSQL movement (and the NoSQL space owes him a lot), pointing to these issues some years ago! Unfortunately, not many are listening. But now that Eric Brewer updated his theorem, the time of simple CAP statements is definitely over. Please be at the very front in pointing out the true and diverse CAP implications.
As we all know, the weaknesses of the classical relational databases have lead to the NoSQL field. But it was just a matter of time for the empire to strike back. Under the term "NewSQL" we can see a bunch of new engines (such as database.com, VoltDB, GenieDB, etc. see picture 2), improving classic solutions, sharding and cloud solutions. Thanks to the NoSQL movement.

But as many DBs try to implement every feature, clear frontiers vanish.
The decision for a database is getting more complicated than ever.
You have to know about 50 use cases, 50 DBs and you should answer at least 50 questions. The latter have been gathered by the author in over 2 years of NoSQL consulting and can be found here: Select the Right Database, Choosing between NoSQL and NewSQL.
It's common wisdom that every technology shift - since client-server and before - is about ten times more costly to switch to. For example, switching from Mainframe to Client-Server, Client-Server to SOA, SOA to WEB, RDBMS to Hybrid Persistence, etc. And as a consequence, many companies hesitate and struggle in adding NoSQL to their portfolio. But it is also known that the early adopters who are trying to get the best out from both worlds and thus integrate NoSQL fast will be better positioned for the future. In this regard, NoSQL solutions will be here to stay and always a gainful area for evaluations.

About the Author

Prof. Dr. Stefan Edlich is a senior lecturer at Beuth HS of Technology Berlin (University of App. Sc.). He wrote more than 10 IT books for publishers such as Apress, OReilly, Spektrum/Elsevier and others. He runs the NoSQL Archive, did NoSQL consulting, organizes NoSQL Conferences, wrote the world’s first two NoSQL books and is addicted to the Clojure programming language.

Transitioning from RDBMS to NoSQL. Interview with Couchbase’s Dipti Borkar

While relational databases have been used for decades to store data, and they still represent a viable solution for many use cases, NoSQL is being chosen today especially for scalability and performance reasons. This article contains an interview with Dipti Borkar, Director of Product Management at Couchbase, on the challenges, benefits and the process of migrating from RDBMS to NoSQL.

InfoQ: When is it the time to dump SQL for a NoSQL solution?

Dipti Borkar: OK, that title sounds a little harsh – and in truth, in most cases, it's not a matter of dumping SQL for a NoSQL solution, but rather, it’s about making a transition from one to the other, where application and use case dictate the need for a change. In general, this transition will be spurred by the need for flexibility – both in the scaling model and the data model - when building modern web and mobile applications.

Typical web applications are built with a three-tier architecture. To scale out the application, more commodity web servers are simply added behind a load balancer to support more users. The ability to scale out is a core tenet of the increasingly important cloud-computing model, in which virtual machine instances can be easily added or removed to match demand.

However, when it comes to the data layer, relational database (RDBMS) technology does not scale out and does not provide a flexible data model, presenting a number of challenges. Handling more users means adding a bigger server, and big servers are highly complex, proprietary, and disproportionately expensive, unlike the low-cost, commodity hardware in web- and cloud-based architectures. So, as organizations start seeing performance problems with their relational database for existing or new applications, particularly as the number of users grows, they realize the need for a faster, elastic database tier. This is the time to start evaluating NoSQL and adopting it as the database tier in their interactive web applications.

InfoQ: What would be the main steps required to transition from SQL to NoSQL?

Dipti Borkar: Organizations/projects can vary greatly in terms of what they are looking for in a NoSQL database; so much of the transition will depend on your use-case. Below are general guidelines for transitioning:

#1 Understand the key requirements for your application:

Some of the requirements that match the need for NoSQL are

Rapid application development
– Changing market needs
– Changing data needs
Scalability
– Unknown user demand
– Need for constantly growing throughput to access, add and update data
Consistent performance
– Low response time for better user experience
– High throughput to handle viral growth
Operational reliability
– High-availability to handle failures gracefully with minimal impact to the application
– Built-in monitoring APIs for easy ongoing maintenance

#2 Understand the various types of NoSQL offerings:

There is a common myth that all NoSQL databases are created equally - this is not true. Cassandra, for instance may be a solution you use for analytical use cases given its columnar data model. Neo4j, a graph database, for example, may be the database you use for applications that need access to relationships between entities.

I'll focus specifically on distributed document-oriented NoSQL database technology – with Couchbase and MongoDB being the two most visible and widely adopted examples.

#3 Execute a proof of concept

Once you have narrowed down on potential choices for the database tier, plan a proof of concept integrating the key characteristics of your application. Look for response time and throughput performance and the ability to scale out easily.

#4 Document modeling and development

For document databases, spend sometime on modeling your data from fixed tabular schemas to flexible document objects.

#5 Deploying to staging and production

Operational stability is a very important aspect for interactive web applications. Test and stage your application rollout as you would for applications that use traditional RDBMS systems. Ensure your selected database supports monitoring across the cluster, easy online scaling for adding capacity if needed and other database administrative tools.

#6 Stay up to date on newest trends

There is a plethora of quality, free training courses throughout the US that offer hands-on NoSQL training courses. The best way to ensure a successful NoSQL implementation is to have an educated developer team that is up to date on the latest server releases and vendor offerings.

Below are links to some of the biggest ones:

-CouchConf

-NoSQL Now

InfoQ: What are the main difficulties migrating from SQL to NoSQL?

Dipti Borkar: The main difficulty basically boils down to understanding the differences between the traditional RDBMS systems and document databases. The most important difference is the data model:

As shown above, each record in a relational database conforms to a schema – with a fixed number of fields (columns) each having a specified purpose and data type. Every record is the same. Data is denormalized across multiple tables. The upside is that there is less duplicated data in the database. The downside is that a change in the schema means performing several expensive “alter table” statements that requires locking down many tables simultaneously to ensure a change doesn’t leave the database in an inconsistent state.

With document databases on the other hand, each document can have a completely different structure from other documents. No additional management is required on the database to handle changes to document schemas.

InfoQ: What are the benefits of NoSQL document databases?

Dipti Borkar: The main benefits of document databases are:

Flexible data model
Data can be inserted without a defined schema, and the format of the data being inserted can change at any time—providing extreme flexibility in the application, which ultimately delivers substantial business agility.
Easy scalability
Some NoSQL databases automatically spreads data across servers, requiring no participation from the applications. Servers can be added and removed from the data layer without application downtime, with data and I/O spread across servers.
Consistent, high performance
Advanced NoSQL database technologies transparently cache data in system memory—a behavior that is completely transparent to the developer and the operations team.

InfoQ: How do developers react when you tell them about adopting NoSQL?

Dipti Borkar: Developers are extremely excited about NoSQL technologies particularly because of the ease of development some databases bring. Document databases have extremely flexible schemas and are easy to work with.

Developers can iterate over application changes faster without the need to change the schema of the underlying database. This is particularly useful when developers are building applications with sparse data or data that’s constantly changing or data from third-party providers they do not have control over.

InfoQ: Is it OK to work with existing developers and have them learn new skills or should you look for new ones that master NoSQL?

Dipti Borkar: Application developers will find it easy to adopt some NoSQL technologies, particularly those that support JSON as the document format. More and more developers are using JSON to model objects in their applications. Therefore storing the data directly as JSON in the database reduces the impedance mismatch across the stack.

Developers who heavily use SQL may need to adapt and learn about document modeling approaches. Rethinking how data can be structured in a logical way using documents rather than normalizing the data into a fixed database schema becomes an important aspect.

InfoQ: Have you had or heard of unsuccessful attempts to switch to NoSQL? If yes, what went wrong?

Dipti Borkar: Architects and developers should ensure that their key requirements are satisfied by the solution or database selected. For example, choosing a database that’s more suited towards analytical applications may not satisfy your latency and throughput needs for interactive applications. Projects that make a quick choice without investigating all requirements may find that they have slower response times for data access leading to a poor user experience. Users need to plan up front for scalability. Here’s a more drastic example of things going south. In some situations an app has gone viral but the database that was selected couldn’t keep up and scale out.

At the same time, using a database that is more suited towards an OLTP-like use case may not perform well for advanced analysis jobs or complex processing. A big data solution may be more suitable.

InfoQ: What are the key lessons migrating to NoSQL?

Dipti Borkar: There are a lot of benefits developers will see when moving to NoSQL. A more flexible data model and freedom from rigid schemas is a big one. You may also see significantly improved performance and the ability to horizontally scale out the data layer.

But most NoSQL products are in early stages of the product cycle. While functionality like complex joins or multi-document transactions can be simulated in the app, developers may be better off using a traditional RDBMS. And for some projects, a hybrid approach might be the best choice.

About the Interviewee

Dipti Borkar is the Director of Product Management at Couchbase where she is responsible for the product roadmap of Couchbase Server, a NoSQL database and works with customers and users to understand emerging requirements for low-latency, scalable data stores. Dipti has deep technical experience in the database industry having worked at IBM as a software engineer and Development Manager for the DB2 server team and then at MarkLogic as a Senior Product Manager. Dipti holds a Masters degree in Computer Science from the University of California, San Diego with a specialization in databases and holds an MBA from the Haas School of Business at University of California, Berkeley.

Pages