Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Showing posts with label NOSQL. Show all posts

Friday, 31 October 2014

Why are Facebook, Digg, and Twitter so hard to scale?

Real-time social graphs (connectivity between people, places, and things). That's why scaling Facebook is hard says Jeff Rothschild, Vice President of Technology at Facebook. Social networking sites like Facebook, Digg, and Twitter are simply harder than traditional websites to scale. Why is that? Why would social networking sites be any more difficult to scale than traditional web sites? Let's find out.

Traditional websites are easier to scale than social networking sites for two reasons:

They usually access only their own data and common cached data.
Only 1-2% of users are active on the site at one time.

Imagine a huge site like Yahoo. When you come to Yahoo they can get your profile record with one get and that's enough to build your view of the website for you. It's relatively straightforward to scale systems based around single records using distributed hashing schemes. And since only a few percent of the people are on the site at once it takes comparatively little RAM cache to handle all the active users.

Now think what happens on Facebook. Let's say you have 200 friends. When you hit your Facebook account it has to go gather the status of all 200 of your friends at the same time so you can see what's new for them. That means 200 requests need to go out simultaneously, the replies need to be merged together, other services need to be contacted to get more details, and all this needs to be munged together and sent through PHP and a web server so you see your Facebook page in a reasonable amount of time. Oh my.
There are several implications here, especially given that on social networking sites a high percentage of users are on the system at one time (that's the social part, people hang around):

All data is active all the time.
It's hard to partition this sort of system because everyone is connected.
Everything must be kept in RAM cache so that the data can be accessed as fast as possible.

Partitioning means you would like to find some way to cluster commonly accessed data together so it can be accessed more efficiently. Facebook, because of the interconnectedness of the data, didn't find any clustering scheme that worked in practice. So instead of partitioning and denormalizing data Facebook keeps data normalized and randomly distributes data amongst thousands of databases.

This approach requires a very fast cache. Facebook uses memcached as their caching layer. All data is kept in cache and they've made a lot of modifications to memcached to speed it up and to help it handle more requests (all contributed back to the community).

Their caching tier services 120 million queries every second and it's the core of the site. The problem is memcached is hard to use because it requires programmer cooperation. It's also easy to corrupt. They've developed a complicated system to keep data in the caching tier consistent with the database, even across multiple distributed data centers. Remember, they are caching user data here, not HTML pages or page fragments. Given how much their data changes it's would be hard to make page caching work.

We see similar problems at Digg. Digg, for example, must deal with the problem of sending out updates to 40,000 followers every time Kevin Rose diggs a link. Digg and I think Twitter too have taken a different approach than Facebook.

Facebook takes a Pull on Demand approach. To recreate a page or a display fragment they run the complete query. To find out if one of your friends has added a new favorite band Facebook actually queries all your friends to find what's new. They can get away with this but because of their awesome infrastructure.

But if you've ever wondered why Facebook has a 5,000 user limit on the number of friends, this is why. At a certain point it's hard to make Pull on Demand scale.

Another approach to find out what's new is the Push on Change model. In this model when a user makes a change it is pushed out to all the relevant users and the changes (in some form) are stored with each user. So when a user want to view their updates all they need to access is their own account data. There's no need to poll all their friends for changes.

With security and permissions it can be surprisingly complicated to figure out who should see an update. And if a user has 2 million followers it can be surprisingly slow as well. There's also an issue of duplication. A lot of duplicate data (or references) is being stored, so this is a denormalized approach which can make for some consistency problems. Should permission be consulted when data is produced or consumed, for example? Or what if the data is deleted after it has already been copied around?

While all these consistency and duplications problems are interesting, Push on Change seems the more scalable approach for really large numbers of followers. It does take a lot of work to push all the changes around, but that can be handled by a job queuing system so the work is distributed across a cluster.

The challenges will only grow as we get more and more people, more and deeper inter-connectivity, faster and faster change, and a greater desire to consume it all in real-time. We are a long way from being able to handle this brave new world.

What the heck are you actually using NoSQL for?

It's a truism that we should choose the right tool for the job. Everyone says that. And who can disagree? The problem is this is not helpful advice without being able to answer more specific questions like: What jobs are the tools good at? Will they work on jobs like mine? Is it worth the risk to try something new when all my people know something else and we have a deadline to meet? How can I make all the tools work together?

In the NoSQL space this kind of real-world data is still a bit vague. When asked, vendors tend to give very general answers like NoSQL is good for BigData or key-value access. What does that mean for for the developer in the trenches faced with the task of solving a specific problem and there are a dozen confusing choices and no obvious winner? Not a lot. It's often hard to take that next step and imagine how their specific problems could be solved in a way that's worth taking the trouble and risk.

Let's change that. What problems are you using NoSQL to solve? Which product are you using? How is it helping you? Yes, this is part the research for my webinar on December 14th, but I'm a huge believer that people learn best by example, so if we can come up with real specific examples I think that will really help people visualize how they can make the best use of all these new product choices in their own systems.

Here's a list of uses cases I came up with after some trolling of the interwebs. The sources are so varied I can't attribute every one, I'll put a list at the end of the post. Please feel free to add your own. I separated the use cases out for a few specific products simply because I had a lot of uses cases for them they were clearer out on their own. This is not meant as an endorsement of any sort. Here's a master list of all the NoSQL products. If you would like to provide a specific set of use cases for a product I'd be more than happy to add that in.

General Use Cases

These are the general kinds of reasons people throw around for using NoSQL. Probably nothing all that surprising here.

Bigness. NoSQL is seen as a key part of a new data stack supporting: big data, big numbers of users, big numbers of computers, big supply chains, big science, and so on. When something becomes so massive that it must become massively distributed, NoSQL is there, though not all NoSQL systems are targeting big. Bigness can be across many different dimensions, not just using a lot of disk space.
Massive write performance. This is probably the canonical usage based on Google's influence. High volume. Facebook needs to store 135 billion messages a month. Twitter, for example, has the problem of storing 7 TB/data per day with the prospect of this requirement doubling multiple times per year. This is the data is too big to fit on one node problem. At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster, which implies key-value access, MapReduce, replication, fault tolerance, consistency issues, and all the rest. For faster writes in-memory systems can be used.
Fast key-value access. This is probably the second most cited virtue of NoSQL in the general mind set. When latency is important it's hard to beat hashing on a key and reading the value directly from memory or in as little as one disk seek. Not every NoSQL product is about fast access, some are more about reliability, for example. but what people have wanted for a long time was a better memcached and many NoSQL systems offer that.
Flexible schema and flexible datatypes. NoSQL products support a whole range of new data types, and this is a major area of innovation in NoSQL. We have: column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. Developers love avoiding complex schemas and ORM frameworks. Lack of structure allows for much more flexibility. We also have program and programmer friendly compatible datatypes likes JSON.
Schema migration. Schemalessness makes it easier to deal with schema migrations without so much worrying. Schemas are in a sense dynamic, because they are imposed by the application at run-time, so different parts of an application can have a different view of the schema.
Write availability. Do your writes need to succeed no mater what? Then we can get into partitioning, CAP, eventual consistency and all that jazz.
Easier maintainability, administration and operations. This is very product specific, but many NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are spending a lot of effort on ease of use, minimal administration, and automated operations. This can lead to lower operations costs as special code doesn't have to be written to scale a system that was never intended to be used that way.
No single point of failure. Not every product is delivering on this, but we are seeing a definite convergence on relatively easy to configure and manage high availability with automatic load balancing and cluster sizing. A perfect cloud partner.
Generally available parallel computing. We are seeing MapReduce baked into products, which makes parallel computing something that will be a normal part of development in the future.
Programmer ease of use. Accessing your data should be easy. While the relational model is intuitive for end users, like accountants, it's not very intuitive for developers. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. This is a developer led coup. The response to a database problem can't always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves. It shouldn't be so hard to make a product perform. Money is part of the issue. If it costs a lot to scale a product then won't you go with the cheaper product, that you control, that's easier to use, and that's easier to scale?
Use the right data model for the right problem. Different data models are used to solve different problems. Much effort has been put into, for example, wedging graph operations into a relational model, but it doesn't work. Isn't it better to solve a graph problem in a graph database? We are now seeing a general strategy of trying find the best fit between a problem and solution.
Avoid hitting the wall. Many projects hit some type of wall in their project. They've exhausted all options to make their system scale or perform properly and are wondering what next? It's comforting to select a product and an approach that can jump over the wall by linearly scaling using incrementally added resources. At one time this wasn't possible. It took custom built everything, but that's changed. We are now seeing usable out-of-the-box products that a project can readily adopt.
Distributed systems support. Not everyone is worried about scale or performance over and above that which can be achieved by non-NoSQL systems. What they need is a distributed system that can span datacenters while handling failure scenarios without a hiccup. NoSQL systems, because they have focussed on scale, tend to exploit partitions, tend not use heavy strict consistency protocols, and so are well positioned to operate in distributed scenarios.
Tunable CAP tradeoffs. NoSQL systems are generally the only products with a "slider" for choosing where they want to land on the CAP spectrum. Relational databases pick strong consistency which means they can't tolerate a partition failure. In the end this is a business decision and should be decided on a case by case basis. Does your app even care about consistency? Are a few drops OK? Does your app need strong or weak consistency? Is availability more important or is consistency? Will being down be more costly than being wrong? It's nice to have products that give you a choice.

More Specific Use Cases

Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, clickstreams, etc.
Syncing online and offline data. This is a niche CouchDB has targeted.
Fast response times under all loads.
Avoiding heavy joins for when the query load for complex joins become too large for a RDBMS.
Soft real-time systems where low latency is critical. Games are one example.
Applications where a wide variety of different write, read, query, and consistency patterns need to be supported. There are systems optimized for 50% reads 50% writes, 95% writes, or 95% reads. Read-only applications needing extreme speed and resiliency, simple queries, and can tolerate slightly stale data. Applications requiring moderate performance, read/write access, simple queries, completely authoritative data. Read-only application which complex query requirements.
Load balance to accommodate data and usage concentrations and to help keep microprocessors busy.
Real-time inserts, updates, and queries.
Hierarchical data like threaded discussions and parts explosion.
Dynamic table creation.
Two tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps.
Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads.
Slicing off part of service that may need better performance/scalability onto it's own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals.
Caching. A high performance caching tier for web sites and other applications. Example is a cache for the Data Aggregation System used by the Large Hadron Collider.
Voting.
Real-time page view counters.
User registration, profile, and session data.
Document, catalog management and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types.
Archiving. Storing a large continual stream of data that is still accessible on-line. Document-oriented databases with a flexible schema that can handle schema changes over time.
Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads.
Working with heterogenous types of data, for example, different media types at a generic level.
Embedded systems. They don’t want the overhead of SQL and servers, so they uses something simpler for storage.
A "market" game, where you own buildings in a town. You want the building list of someone to pop up quickly, so you partition on the owner column of the building table, so that the select is single-partitioned. But when someone buys the building of someone else you update the owner column along with price.
JPL is using SimpleDB to store rover plan attributes. References are kept to a full plan blob in S3.
Federal law enforcement agencies tracking Americans in real-time using credit cards, loyalty cards and travel reservations.
Fraud detection by comparing transactions to known patterns in real-time.
Helping diagnose the typology of tumors by integrating the history of every patient.
In-memory database for high update situations, like a web site that displays everyone's "last active" time (for chat maybe). If users are performing some activity once every 30 sec, then you will be pretty much be at your limit with about 5000 simultaneous users.
Handling lower-frequency multi-partition queries using materialized views while continuing to process high-frequency streaming data.
Priority queues.
Running calculations on cached data, using a program friendly interface, without have to go through an ORM.
Unique a large dataset using simple key-value columns.
To keep querying fast, values can be rolled-up into different time slices.
Computing the intersection of two massive sets, where a join would be too slow.
A timeline ala Twitter.

Redis Use Cases

Redis is unique in the repertoire as it is a data structure server, with many fascinating use cases that people are excited to share.

Calculating whose friends are online using sets.
Memcached on steroids.
Distributed lock manager for process coordination.
Full text inverted index lookups.
Tag clouds.
Leaderboards. Sorted sets for maintaining high score tables.
Circular log buffers.
Database for university course availability information. If the set contains the course ID it has an open seat. Data is scraped and processed continuously and there are ~7200 courses.
Server for backed sessions. A random cookie value which is then associated with a larger chunk of serialized data on the server) are a very poor fit for relational databases. They are often created for every visitor, even those who stumble in from Google and then leave, never to return again. They then hang around for weeks taking up valuable database space. They are never queried by anything other than their primary key.
Fast, atomically incremented counters are a great fit for offering real-time statistics.
Polling the database every few seconds. Cheap in a key-value store. If you're sharding your data you'll need a central lookup service for quickly determining which shard is being used for a specific user's data. A replicated Redis cluster is a great solution here - GitHub use exactly that to manage sharding their many repositories between different backend file servers.
Transient data. Any transient data used by your application is also a good fit for Redis. CSRF tokens (to prove a POST submission came from a form you served up, and not a form on a malicious third party site, need to be stored for a short while, as does handshake data for various security protocols.
Incredibly easy to set up and ridiculously fast (30,000 read or writes a second on a laptop with the default configuration)
Share state between processes. Run a long running batch job in one Python interpreter (say loading a few million lines of CSV in to a Redis key/value lookup table) and run another interpreter to play with the data that’s already been collected, even as the first process is streaming data in. You can quit and restart my interpreters without losing any data.
Create heat maps of the BNP’s membership list for the Guardian
Redis semantics map closely to Python native data types, you don’t have to think for more than a few seconds about how to represent data.
That’s a simple capped log implementation (similar to a MongoDB capped collection)—push items on to the tail of a ’log’ key and use ltrim to only retain the last X items. You could use this to keep track of what a system is doing right now without having to worry about storing ever increasing amounts of logging information.
An interesting example of an application built on Redis is Hurl, a tool for debugging HTTP requests built in 48 hours by Leah Culver and Chris Wanstrath.
It’s common to use MySQL as the backend for storing and retrieving what are essentially key/value pairs. I’ve seen this over-and-over when someone needs to maintain a bit of state, session data, counters, small lists, and so on. When MySQL isn’t able to keep up with the volume, we often turn to memcached as a write-thru cache. But there’s a bit of a mis-match at work here.
With sets, we can also keep track of ALL of the IDs that have been used for records in the system.
Quickly pick a random item from a set.
API limiting. This is a great fit for Redis as a rate limiting check needs to be made for every single API hit, which involves both reading and writing short-lived data.
A/B testing is another perfect task for Redis - it involves tracking user behaviour in real-time, making writes for every navigation action a user takes, storing short-lived persistent state and picking random items.
Implementing the inbox method with Redis is simple: each user gets a queue (a capped queue if you're worried about memory running out) to work as their inbox and a set to keep track of the other users who are following them. Ashton Kutcher has over 5,000,000 followers on Twitter - at 100,000 writes a second it would take less than a minute to fan a message out to all of those inboxes.
Publish/subscribe is perfect for this broadcast updates (such as election results) to hundreds of thousands of simultaneously connected users. Blocking queue primitives mean message queues without polling.
Have workers periodically report their load average in to a sorted set.
Redistribute load. When you want to issue a job, grab the three least loaded workers from the sorted set and pick one of them at random (to avoid the thundering herd problem).
Multiple GIS indexes.
Recommendation engine based on relationships.
Web-of-things data flows.
Social graph representation.
Dynamic schemas so schemas don't have to be designed up-front. Building the data model in code, on the fly by adding properties and relationships, dramatically simplifies code.
Reducing the impedance mismatch because the data model in the database can more closely match the data model in the application.

VoltDB Use Cases

VoltDB as a relational database is not traditionally thought of as in the NoSQL camp, but I feel based on their radical design perspective they are so far away from Oracle type systems that they are much more in the NoSQL tradition.

Application: Financial trade monitoring
1. Data source: Real-time markets
2. Partition key: Market symbol (ticker, CUSIP, SEDOL, etc.)
3. High-frequency operations: Write and index all trades, store tick data (bid/ask)
4. Lower-frequency operations: Find trade order detail based on any of 20+ criteria, show TraderX's positions across all market symbols

Application: Web bot vulnerability scanning (SaaS application)
1. Data source: Inbound HTTP requests
2. Partition key: Customer URL
3. High-frequency operations: Hit logging, analysis and alerting
4. Lower-frequency operations: Vulnerability detection, customer reporting

Application: Online gaming leaderboard
1. Data source: Online game
2. Partition key: Game ID
3. High-frequency operations: Rank scores based on defined intervals and player personal best
4. Lower-frequency transactions: Leaderboard lookups

Application: Package tracking (logistics)
1. Data source: Sensor scan
2. Partition key: Shipment ID
3. High-frequency operations: Package location updates
4. Lower-frequency operations: Package status report (including history), lost package tracking, shipment rerouting

Application: Ad content serving
1. Data source: Website or device, user or rule triggered
2. Partition key: Vendor/ad ID composite
3. High-frequency operations: Check vendor balance, serve ad (in target device format), update vendor balance
4. Lower-frequency operations: Report live ad view and click-thru stats by device (vendor-initiated)

Application: Telephone exchange call detail record (CDR) management
1. Data source: Call initiation request
2. Partition key: Caller ID
3. High-frequency operations: Real-time authorization (based on plan and balance)
4. Lower-frequency operations: Fraud analysis/detection

Application: Airline reservation/ticketing
1. Data source: Customers (web) and airline (web and internal systems)
2. Partition key: Customer (flight info is replicated)
3. High-frequency operations: Seat selection (lease system), add/drop seats, baggage check-in
4. Lower-frequency operations: Seat availability/flight, flight schedule changes, passenger re-bookings on flight cancellations

Analytics Use Cases

Kevin Weil at Twitter is great at providing Hadoop use cases. At Twitter this includes counting big data with standard counts, min, max, std dev; correlating big data with probabilities, covariance, influence; and research on Big data. Hadoop is on the fringe of NoSQL, but it's very useful to see what kind of problems are being solved with it.

How many request do we serve each day?
What is the average latency? 95% latency?
Grouped by response code: what is the hourly distribution?
How many searches happen each day at Twitter?
Where do they come from?
How many unique queries?
How many unique users?
Geographic distribution?
How does usage differ for mobile users?
How does usage differ for 3rd party desktop client users?
Cohort analysis: all users who signed up on the same day—then see how they differ over time.
Site problems: what goes wrong at the same time?
Which features get users hooked?
Which features do successful users use often?
Search corrections and suggestions (not done now at Twitter, but coming in the feature).
What can web tell about a user from their tweets?
What can we tell about you from the tweets of those you follow?
What can we tell about you from the tweets of your followers?
What can we tell about you from the ratio of your followers/following?
What graph structures lead to successful networks? (Twitter’s graph structure is interesting since it’s not two-way)
What features get a tweet retweeted?
When a tweet is retweeted, how deep is the corresponding retweet three?
Long-term duplicate detection (short term for abuse and stopping spammers)
Machine learning. About not quite knowing the right questions to ask at first. How do we cluster users?
Language detection (contact mobile providers to get SMS deals for users—focusing on the most popular countries at first).
How can we detect bots and other non-human tweeters?

Poor Use Cases

OLTP. Outside VoltDB, complex multi-object transactions are generally not supported. Programmers are supposed to denormalize, use documents, or use other complex strategies like compensating transactions.
Data integrity. Most of the NoSQL systems rely on applications to enforce data integrity where SQL uses a declarative approach. Relational databases are still the winner for data integrity.
Data independence. Data outlasts applications. In NoSQL applications drive everything about the data. One argument for the relational model is as a repository of facts that can last for the entire lifetime of the enterprise, far past the expected life-time of any individual application.
SQL. If you require SQL then very few NoSQL system will provide a SQL interface, but more systems are starting to provide SQLish interfaces.
Ad-hoc queries. If you need to answer real-time questions about your data that you can’t predict in advance, relational databases are generally still the winner.
Complex relationships. Some NoSQL systems support relationships, but a relational database is still the winner at relating.
Maturity and stability. Relational databases still have the edge here. People are familiar with how they work, what they can do, and have confidence in their reliability. There are also more programmers and toolsets available for relational databases. So when in doubt, this is the road that will be traveled.

Hacker News and Reddit thread on this Post. More use case suggestions there.
List of NoSQL Systems
Pig - infrastructure to support ad-hoc analysis of very large data sets.
Digging Deeper Into Data With Hadoop By Gary Orenstein
NoSQL East 2009 - Summary of Day 1 by Eivind Uggedal
The “NoSQL” approach: struggling to see the benefits by Neil Saunders
Design Patterns for DistributedNon-Relational Databases by Todd Lipcon.
The Future Is Big Data in the Cloud By Ping Li
One size does not fit all: “document stores”, “nosql databases” , ODBMSs by Roberto V. Zicari
NoSQL is for niches By Dana Blankenhorn.
MongoDB Use Cases
MongoDB and Ecommerce by Kyle Banker
Archiving - a good MongoDB use case?
Five Reasons to Use NoSQL by Jeremiah Peschka
The Business Case for NoSQL, NoETL and NoProblems by Loraine Lawson
Is It Time For NoETL? by Seth Grimes
Holy Large Hadron Collider, Batman!
I Can't Wait for NoSQL to Die by Ted Dziuba.
NoSQL Basics, Benefits and Best-Fit Scenarios by Curt Monash
Redis Tutorial by Simon Willison
Why I Like Redis by Simon Willison
Redis: Lightweight key/value Store That Goes the Extra Mile by Jeremy Zawodny
Remote Dictionary Server
Is NoSQL for me? I’m just a small fish by Hadi Hariri
Why Big Enterprises are Interested in NoSQL by Jon Moore
Visual Guide to NoSQL Systems by Nathan Hurst
You Can't Sacrifice Partition Tolerance by Coda Hale
NoSQL Netflix Use Case Comparison by Adrian Cockcroft
Comparison guide of horizontally scalable datastores by Rick Cattell
Weak Consistency and CAP Implications by Ilya Grigorik
Schema-Free MySQL vs NoSQL by Ilya Grigorik
Neo4j - 5 Cool Graph Examples
Has anyone used Graph-based Databases
Neotechnology Uses Cases
CouchOne Ataxo Case Study
NOSQL: scaling to size and scaling to complexity by Emil Eifrem
NoSQL, NoProblem (Not Really… but it’s still awesome) by Jeremy Pinkham
An Expert's Guide to Oracle Technology by Lewis Cunningham
NoSQL vs SQL, Why Not Both? By Alaric Snell-Pym
Databases: relational vs object vs graph vs document by On Target
Use cases are driving the divergence, and the convergence, of NoSQL solutions by James Phillips
The beginning of the end of NoSQL by Matthew Aslett
Going NoSQL with MongoDB by Ted Neward
NoSQL Ecosystem by Jonathan Ellis
The New Dimension of NoSQL Scalability: Complexity by Alex Popescu
Quick Reference to Alternative data storages by Alex Popescu
6 Reasons Why Relational Database Will Be Superseded by Robin Bloor
NoSQL Misconceptions by Ben Scofield
SQL Databases Don't Scale by Adam Wiggins
“One Size Fits All”: An Idea Whose Time Has Come and Gone by Michael Stonebraker and Uğur Çetintemel
Future of RDBMS is RAM Clouds & SSD by Ilya Grigorik
To scale or not to scale: Key/Value, Document, SQL, JPA by Uri Cohen

Friday, 17 October 2014

NoSQL or RDBMS? – Are we asking the right questions?

Most articles on the topic of NoSQL are around the theme of RDBMS vs. NoSQL. DBA’s are defending RDBMS by stating that NoSQL solutions are all dumb immature data stores without any standards. Many NoSQL proponents react with the argument that RDMBS does not scale and that today everybody needs to deal with huge amounts of data.

I think NoSQL is sold short here. Yes Big Data plays a big role, it is not the primary driver in all NoSQL solutions. There are no standards, because there really is no NoSQL solution, but different types of solutions that cater for different use cases. In fact nearly all of them state that theirs is not a replacement for a traditional RDBMS! When we compare RDBMS against them we need to do so on a use case basis. There are very good reasons for choosing an RDBMS as long as the amount of data is not prohibitive. There are however equally good reason not to do so and choose one of the following solution types

Distributed Key Value Stores
Distributed Column Family Stores
(Distributed) Document Databases
Graph Databases

It has to be said however that there are very simple and specific reasons as to why traditional RDBMS solutions cannot scale beyond a handful of database nodes and even that is painful. However before we look at why NoSQL solutions tend not to have that problem, we will take a look why and when you should choose an RDBMS and when you shouldn’t.

When and Why you (should) choose an RDBMS

While data durability is an important aspect of an RDBMS it is not a differentiator compared to other solutions. So I will concentrate first and foremost on unique features of an RDBMS that also have impact on the application design and performance.

Table based
Relations between distinct Table Entities and Rows (the R in RDBMS)
Referential Integrity
ACID Transactions
Arbitrary Queries and Joins

If you really need all or most of these features than an RDBMS is certainly right for you, although the level of data you have might force you in another direction. But do you really them, let’s look closer?

The table based nature of RDBMS is not a real feature, it is just the way it stores data. While I can think of usecases that specifically benefit from this, most of them are simple in nature (think of excel spreadsheets). That nature however requires a relational concept between rows and tables in order to make up complex entities.

Datamodel showing two different kinds of relationshios

There are genuine relations between otherwise standalone entities (like one person being married to another) and relationships that really define hierarchical context or ownership of some sort (A room is always part of a house). The first one is a real feature, the second is a result of the storage nature. It can be argued that a Document (e.g. an XML) stores such a “relation” more naturally because the House Document contains the Room instead of having the Room as a separate document.

Referential Integrity is really one of the corner stones of an RDBMS, it ensures logical consistency of my domain model. Not only does it ensure consistency within a certain logical entity (which might span multiple rows/tables) but more importantly cross entity consistency. If you access the same data via different applications and need to enforce integrity at the central location this is the way to go. We could check this in the application as well, but the database often acts as the final authority of consistency.

The final aspect of consistency comes in the form of ACID transactions. It ensures that either all my changes are consistent seen by others in their entirety, or the none of my changes is committed at all. Consistency really is the hallmark of an RDBMS. However we often set commit points for other reasons than consistency. How often did I use a bulk update for the simple reason of increased performance? In many cases I did not care about the visibility of those changes, but just wanted to have them done fast. In other cases we would deliberately commit more often in order to decrease locking and increase concurrency. The question is do I care whether Peter shows up as married while Alice is still seen as unmarried? The government for sure does, Facebook on the other hand does not!

SELECT count(e.isbn) AS "number of books", p.name AS publisher
FROM editions AS e INNER JOIN
 publishers AS p ON (e.publisher_id = p.id)
GROUP BY p.name;

The final defining feature of an RDBMS is its ability to execute arbitrary queries: SQL Selects. Very often NoSQL is understood as not being able to execute queries. While this is not true it is true that RDBMS solutions do offer a far superior query language. Especially the ability to group and join data from unrelated entities into a new view on the data is something that makes an RDBMS a powerful tool. If you business is defined by the underlying structured data and you need the ability to ask different questions all the time than this is a key reason to use an RDBMS.

However if you know how to access the data in advance, or you need to change your application in case you want to access it differently, then a lot of that advantage is overkill.

Why an RDBMS might not be right for you

These features come at the price of complexity in terms of datamodel, storage, data retrieval and administration. And as we will see shortly a builtin limit for horizontal scalability. If you do not need any or most of the features you should not use an RDMBS.

If you just want to store your application entities in a persistent and consistent way then an RDBMS is overkill. A Key Value Store might be perfect for you. Note that the Value can be a complex entity in itself!
If you have hierarchical application objects and need some query capability into them then any of the NoSQL solutions might be a fit. With an RDBMS you can use ORM to achieve the same, but at the cost of adding complexity to hide complexity.
If you ever tried to store large trees or networks you will know that an RDBMS is not the best solution here. Depending on your other needs a Graph database might suit you.
You are running in the Cloud and need to run a distributed database for durability and availability. This is what dynamo and big table based datastores were built fore. RDBMS on the other hand do not well here.
You might already use a dataware house for your analytics. This is not to disimilar form a Column Family database. If your data grows to large to be processed on a single machine, you might look into hadoop or any other solution that supports distributed Map/Reduce.

There are many scenarios where fully ACID driven relational table based database is simply not the best option or simplest option to go with. Now that we got that out of the way, let’s look at the big one, amount of data and scalability.

Why an RDBMS does not scale and many NoSQL solutions do

The real problem with RDBMS is the horizontal distribution of load and data. The fact is that RDBMS solutions can not easily achieve automatic data sharding. Data Sharding would require distinct data entities that can be distributed and processed independently. An ACID based relational database cannot do that due to its table based nature. This is where NoSQL solutions differ greatly. They do not distribute a logical entity across multiple tables, it’s always stored in one place. A logical entity can be anything from a simple value, to a complex object or even a full JSON document. They do not enforce referential integrity between these logical entities. They only enforce consistency inside a single entity and sometimes not even that.

NoSQL differs to RDBMS in the way entities get distributed and that no consistency is enforced across those entities

This is what allows them to automatically distribute data across a large number of database nodes and also write them independently. If I were to write 20 entities to a database cluster with 3 nodes, chances are I can evenly spread the writes across all of them. The database does not need to synchronize between the nodes for that to happen and there is no need for a two phase commit, with the visible effect that Client one might see changes on Node 1 before Client 2 has written all 20 entities. A distributed RDBMS solution on the other hand needs to enforce ACID consistency across all three nodes. That means that Client 1 will either not see any changes until all three nodes acknowledged a two phase commit or will be blocked until that happened. In addition to that synchronization the RDBMS also needs to read data from other nodes in order to ensure referential integrity, all that happens during the transaction and blocks Client 2. NoSQL solutions do no such thing for the most part.

The fact that such a solution can scale horizontally also means that it can leverage its distributed nature for high availability. This is very important in the cloud, where every single node might fail at any moment.

Another key factor is these solutions do not allow joins and groups across entities, as that would not be possible in a scalable way if your data ranges in the millions and is distributed across 10 nodes or more. I think this is something that a lot of us have trouble with. We have to start thinking about how to access data and store it accordingly and not the other way around.

So it is true that NoSQL solutions lack some of the features that define an RDBMS solution. They do so for the reason of scalability. That does however not mean that they are dump datastores, Document, Column Family and Graph databases are far from unstructured and simple.

What about Application Performance?

The fact that all these solutions scale in principle, does however not mean that they do so in practice or that your application will perform better because of it! Indeed the overall performance depends to a very large degree on choosing the right implementation for your use case. Key/Value stores are very simple, but you can still use them wrong. Column Family Stores are very interesting and also very different from a table based design. Due to this it is easy to have a bad data model design and this will kill your performance.

Besides the obvious factors of disk I/O, network and caching (which you must of course take into consideration), both application performance and scalability depend heavily on the data itself; more specifically on the distribution across the database cluster. This is something that you need to monitor in live systems and take into consideration during the design phase as well. I will talk more about this and specific implementations in the coming months.

There is one other factor that will play a key role in the choice between NoSQL and more traditional databases. Companies are used to RDBMS, they have experts and DBAs for them. NoSQL is new and not well understood yet. The administration is different. Performance tuning and anlysis is different, as are the problem patterns that we see. More importantly performance and setup are more than ever governed by the applications that use them and not by index tuning.

Application Performance Management as a discipline is well equipped to deal with this. In fact by looking at the end-to-end application performance it can handle the different NoSQL solutions just like any other database

Friday, 15 August 2014

The State of NoSQL

After at least four years of tough criticism, it's time to come to an intermediate conclusion about the state of NoSQL. So many things have happened around NoSQL that it is hard to get an overview and value what goals have been achieved and where NoSQL failed to deliver.

In many fields NoSQL has been more than successful in the industry and academics too. Universities are starting to understand that NoSQL must to be increasingly adopted by the curriculum. It is simply not enough to teach database normalization up and down. This, of course, does not mean that a profound relational foundation is wrong. To the contrary, NoSQL is certainly a perfect and important addition.

What happened?

The NoSQL Space has exploded in just 4-5 years to about 50 to 150 new databases. nosql-database.org lists about 150 such databases, including some quite old but still strong dinosaurs like Object Databases. And, of course, some interesting mergers have happened, such as the CouchDB and Membase deal leading to CouchBase. But we will discuss each major system later in this article.

Many people have been assuming a huge consolidation in the NoSQL space. However this has not happened. The NoSQL space simply exploded and is still exploding. As with all areas in computer science - like e.g. programming languages - there are more and more gaps opening up for a huge amount of databases. And this is all in line with the explosion of the Internet, big-data, sensors and many more technologies in the future, leading to more data and different requirements about their treatments. In the past four years we saw only one significant system leaving the stage: the German graph database Sones. The vast amount of NoSQL databases continues to live happily either in the open-source space, without any considerable money turnaround, or in the commercial space.

Visibility and Money

Another important point is the visibility and industry adoption. In this space we can see a huge difference between the old industry - protecting the investment - and the new industry: mostly startups. While nearly all of the hot web-startups such as Pinterest or Instagram do have a hybrid (SQL + NoSQL) architecture, the 'old' industry is still struggling with NoSQL adoption. But the observation here is that more and more companies like these are trying to cut out a part of their data streams to be processed and later on analyzed with NoSQL solutions like Hadoop, MongoDB, Cassandra, etc.

And this leads as well to a strong increased demand on developers and architects with NoSQL knowledge. A recent survey showed the following latest developer skills requested by the industry:

HTML5
MongoDB
iOS
Android
Mobile Apps
Puppet
Hadoop
jQuery
PaaS
Social Media

So there are 2 NoSQL databases in the top ten for technology requirements here. And even one before iOS. If this isn't a praise, what else?!

But NoSQL adoption is going faster and deeper as one might think at first glance. In a well known whitepaper Oracle stated in the summer of 2011 that NoSQL DBs feel like an ice cream flavor, but you should not get too attached because it may not be around for too long. Only a few months later Oracle showed its Hadoop integration into a Big Data Appliance. And even more, we saw the launch of their own NoSQL database, which was a revised BerkeleyDB. Since then, there has been a race for all vendors to integrate Hadoop. Microsoft, Sybase, IBM, Greenplum, Pervasive, and many more do already have a tight integration. A pattern that can be seen everywhere: can't fight it, embrace it.

But one of the strongest but silent signs of a broad NoSQL adoption is that NoSQL databases are getting a PaaS standard. Thanks to the easy setup and management of many NoSQL databases, DBs like Redis or MongoDB can be seen in dozens of Paa-Services as Cloud-Foundry, OPENSHIFT, dotCloud, Jelastic, etc. As everything moves more and more into the cloud this becomes a huge momentum for NoSQL to put pressure on classic relational databases. Having the choice to select either MySQL/PostGres or MongoDB/Redis, for example, will force them to think twice about their model, requirements and raise other important questions.

An interesting indicator for technologies is also the ThoughtWorks radar which always contains a lot of interesting stuff, even if you do not fully agree with everything contained in it. Let's have a look at their radar from October 2012 in picture 1:

Picture 1: ThoughtWorks Radar, October, 2012 - Platforms

In their platform quadrant they list five databases:

Neo4j (adopt)
MongoDB (tial but close to adopt)
Riak (trial)
CouchBase (trial)
Datomic (assess)

If you look at this you see that at least four of these have received a lot of venture capital. If you add up all the venture capital in the entire NoSQL Space you will surely count up to something in between 100M and a billion dollars! Neo4j is one of one of these examples for getting 11m $ in a series B funding. Other companies that received $10-30M in funding were Aerospike, Cloudera, DataStax, MongoDB, CouchBase, etc. But let's have a look at the list again: Neo4j, MongoDB, Riak and CouchBase have been in this space for the last four years and have constantly proven to be among market leaders for specific requirements. Then, DB number 5 –Datomic - is more than astonishing, a complete new database, with a complete new paradigm written by a small team. Must be really hot stuff and we will dig into it a bit later when discussing all DBs briefly.

Standards

Many people have asked for NoSQL standards, failing to see that NoSQL covers a really wide range of models and requirements. Hence unified languages for all major areas such as Wide Column, Key/Value, Document and Graph Databases will surely not be available for a long time because it's impossible to cover all areas. Several approaches, such as Spring Data, try to add a unified layer but it's up to the reader to test if this layer is a leap forward in building a polyglot persistence environment or not.

Mostly the graph and the document databases have come up with standards in their own domain. The graph world is more successful with its tinkerpop blueprints, Gremlin, Sparql, and Cypher. In the document space we have UnQL and jaql filling up some niches, although the first lacks real world support by a NoSQL database. But with the force of Hadoop many projects are working on bridging famous ETL languages such as Pig or Hive to other NoSQL databases. So the standards world is highly fragmented, but only due to the fact that NoSQL luckily is a very wide area.

Landscape

One of the best overviews of the database landscape has been given by Matt Aslett in a report of the 451 Group. He recently updated his picture giving us more insights to the categories he mentioned. As you can see in the following picture, the landscape is highly fragmented and overlapping:

(Click on the image to enlarge it)

Picture 2: The database landscape by Matt Aslett (451 group)

As you can see there are several dimensions in one picture. Relational vs. Non-relational, Analytic vs. Operational, NoSQL vs. NewSQL. The last two categories have the well known sub-categories Key-Value, Document, Graph and Big Tables for NoSQL and Storage-Engines, Clustering-Sharding, New Databases and Cloud Service Solutions. The interesting part of this picture is that it is increasingly difficult to put a database to an exact location. Everyone is now trying fiercely to integrate features from databases found in other spaces. NewSQL Systems implement core NoSQL features. NoSQL Systems try more and more to implement 'classic' features as SQL support or ACID or at least often configurable persistence.

It all started with the integration of Hadoop that tons of relational databases now offer. But there are many other examples: e.g. MarkLogic is now starting to ride the JSON wave and thus also hard to position. Furthermore more multi-model databases appear, such as ArangoDB, OrientDB or AlechemyDB (which is now a part of the promising Aerospike DB). They allow to start with one database model (e.g. document / JSON model) and add other models (graph or key-value) as new requirements pop up.

Books

Another wonderful sign of a beginning maturity is the book market. After two German books published in 2010 and 2011 we saw the Wiley book by Shashank Tiwari. Structured like a hurricane and full of great deep insights. The race continued with two nice books in 2012. The 'Seven Databases in Seven Weeks' is surely a masterpiece. Freshly written and full of practical 'hands-on' insights: it takes six famous NoSQL databases and adds PostGreSQL to the mix, Making it a top recommendation. On the other side P.J. Sandalage and Martin Fowler take a more holistic approach to cover all the characteristics and help evaluating your path and decisions with NoSQL.

But there is more to come. It is just a matter of time till a Manning book appears on the scene: Dan McCreary and Ann Kelly are writing a book called: "Making Sense of NoSQL" and the first MEAP chapters are already available.

After starting with concepts and patterns, their chapter 3 will surely look attractive:

Building NoSQL Big Data solutions
Building NoSQL search solutions
Building NoSQL high availability solutions
Using NoSQL to increase agility

This is a new fresh approach and will surely be worth reading.

State of the Leaders

Let's give each NoSQL leader a quick consideration. As one of the clear market leaders, Hadoop is a strange animal. On one hand it has an enormous momentum. As mentioned before, each classic database vendor is in a hurry to announce Hadoop support. Companies such as Cloudera and MapR continue to grow and new Hadoop extensions and successors are announced every week.
Even Hive and Pig continue to get even better acceptance. Nevertheless, there is a fly in the ointment: Companies still complain about an unstructured mess (reading and parsing files could be even faster), MapReduce is far 'too batch' (even Google goes away from it), management is still hard, stability issues, and local training/consultants are still hard to find. But even if you could address some of the issues it's still a hot question, if Hadoop will grow as it is or it will change dramatically.

The second leader, MongoDB, also suffers from flame wars, and it might be the nature of things that leading DBs get the most criticism. Nevertheless, MongoDB goes at a fast pace and criticism mostly is:

a) concerning old versions or
b) due to the lack of knowledge on how to deal with it in a correct way. This recently culminated in absurd complaints that the 32 bit version can only handle 2GB, although MongoDB states this clearly in the download section and recommends the 64 bit version.

Anyway, MongoDBs partnerships and funding rounds push ambitious roadmaps with hot stuff:

the industry called for some security / LDAP features which are currently being developed
full text search will be in soon
V8 for MapReduce is coming
even a finer level then collection level locking will come
and a Hash Shard Key is on the way

Especially this last point catches the interest of many architects. MongoDB was often blamed (also by competitors) for not implementing a concise consistent hashing which is not entirely correct because such a key can be easily defined. But in the future there will be a config for a hash shard key. This means the user is up to decide if a hash key for sharding is useful or if he needs some (perhaps even rare) advantages of selecting his own sharding key. Surely this increases the pressure on other vendors and will lead to fruitful discussion when to use a sharding key.

Cassandra is the next in line and quite doing well adding more and nicer features such as better querying. However rumors won't stop telling that running a Cassandra cluster is not piece of cake and requires some hard work. But the most attractive issue here is surely DataStax. The new Company on top of Cassandra - 25 Million round C funding - is mostly addressing analytics and some operational issues. Especially the analytics was a surprise for many because in the early days Cassandra was not known as a powerful query machine. But as this has changed in the latest version the query capabilities may be sufficient enough for some modern analytics.

The development speed of Redis is also remarkable. Despite Salvatore’s assertions that he would have achieved nothing without the community and the help of Pieter Noordhuis it still looks like a stunning one man show. The sentinel failover and server side scripting with the Lua programming language are the latest achievements. The decision for Lua was a bit of a shock for the community because everyone integrates JavaScript as a server-side language. Nevertheless, Lua is a neat language and will help Redis open up a new pandora of possibilities.

CouchBase also looks like a brilliant solution in terms of scalability and latency despite the strong winds that Facebook and hence Zynga are now facing. It's surely not a hot query machine but if they could improve querying in the future the portfolio would be quite complete. The merger with the CouchDB founders was definitely a strong step and it's worthwhile to see the great influences of CouchDB in CouchBase. On every database conference it's also funny to hear the discussions, if CouchDB is doing better or worse after Damien, Chris and Jan have left. One can only hear extreme opinions here. But who cares as long as the DB is doing fine. And it looks like it does.

The last NoSQL DB to be mentioned here is of course Riak, which also improved dramatically in functionality and monitoring. It continues to have a good reputation mostly in terms of stability: "rock solid, invisible and good for your sleep". The Riak CS fork also looks interesting in terms of the modularity of this technology.

Interesting Newcomers

Beside the market leaders, newcomers are always interesting to evaluate. Let's dig into some of them.

Elastic Search surely is one of the hottest new NoSQL products and just got a 10m $ in series A funding, and that for a good reason. As a scalable search engine on top of Lucene it brings many advantages: a) a company on top providing services and b) leveraging all the achievements that Lucene has conceived in the last years. It will surely infiltrate the industry now more than before, attacking all the big players in the semi-structured information space.

Google also send it's small but fast LevelDB into the field. And it serves as a basis for many usages with specific requirements such as compression integration. Even Riak integrated LevelDB. It remains to be seen when all the new Google internal databases such as Dremel or Spanner will find their way out as open-source projects (like Apache Drill or Cloudera Impala).

Another tectonic shift surely was DynamoDB at the start of 2012. They call it the fastest growing service ever launched at Amazon. It's the ultimate scaling machine. New features are coming slowly but the focus on SSDs and latency is quite amazing.

Multi-model databases are also a field worthwhile to have a look on. OrientDB, its famous representative, is by far not a newcomer but it is improving its capabilities quite fast. Perhaps too fast because some customers might now be happy that OrientDB has reached Version 1.0 and thus hopefully gained a lot more stability. Graph, Document, Key-Value support combined with transactions and SQL are reasons enough to give it second try. Especially the good SQL support makes it interesting for analytic solutions such as Penthao. Another newcomer in this space is ArangoDB. It is moving fast and it doesn't flinch from comparing itself in benchmarks against the established players.
However, again the native JSON and graph support saves a lot of effort if new requirements have to be implemented and the new data has a different model that must be persisted.

By far the biggest surprise this year was Datomic. Written by some rock stars of the Clojure programming language in an incredible short time, it unveils a whole bunch of new paradigms. Furthermore it has made its way into the ThoughtWorks radar with the recommendation to have a look at it. And although it is 'just' a layer on top of established databases it brings a huge amount of advantages, such as:

transactions
a time machine
a fresh and powerful query approach
a new schema approach
caching & scaling features

Currently, DynamoDB, Riak, CouchBase, Infinispan and SQL are supported as the underlying storage engine. It even allows you to mix and query different DBs simultaneously. Many veterans have been surprised that such a radical paradigm shift can be possible. Luckily it is.

Summary

To conclude, let us address three points:

Some new articles by Eric Brewer on the CAP theorem should have come several years earlier. In this article he states that "2 of 3" is misleading, explaining the reasons, why the world is more complicated than a simple CP/AP i.e. ACID/BASE choice. Nevertheless, thousands of talks and articles kept on praising the CAP theorem without any critical review for years. Michael Stonebraker was the strongest censor of the NoSQL movement (and the NoSQL space owes him a lot), pointing to these issues some years ago! Unfortunately, not many are listening. But now that Eric Brewer updated his theorem, the time of simple CAP statements is definitely over. Please be at the very front in pointing out the true and diverse CAP implications.
As we all know, the weaknesses of the classical relational databases have lead to the NoSQL field. But it was just a matter of time for the empire to strike back. Under the term "NewSQL" we can see a bunch of new engines (such as database.com, VoltDB, GenieDB, etc. see picture 2), improving classic solutions, sharding and cloud solutions. Thanks to the NoSQL movement.

But as many DBs try to implement every feature, clear frontiers vanish.
The decision for a database is getting more complicated than ever.
You have to know about 50 use cases, 50 DBs and you should answer at least 50 questions. The latter have been gathered by the author in over 2 years of NoSQL consulting and can be found here: Select the Right Database, Choosing between NoSQL and NewSQL.
It's common wisdom that every technology shift - since client-server and before - is about ten times more costly to switch to. For example, switching from Mainframe to Client-Server, Client-Server to SOA, SOA to WEB, RDBMS to Hybrid Persistence, etc. And as a consequence, many companies hesitate and struggle in adding NoSQL to their portfolio. But it is also known that the early adopters who are trying to get the best out from both worlds and thus integrate NoSQL fast will be better positioned for the future. In this regard, NoSQL solutions will be here to stay and always a gainful area for evaluations.

About the Author

Prof. Dr. Stefan Edlich is a senior lecturer at Beuth HS of Technology Berlin (University of App. Sc.). He wrote more than 10 IT books for publishers such as Apress, OReilly, Spektrum/Elsevier and others. He runs the NoSQL Archive, did NoSQL consulting, organizes NoSQL Conferences, wrote the world’s first two NoSQL books and is addicted to the Clojure programming language.