Friday, 15 August 2014

Cassandra Mythology

Like the prophetess of Troy it was named for, Apache Cassandra has seen some myths accrue around it. Like most myths, these were once at least partly true, but have become outdated as Cassandra evolved and improved. In this article, I'll discuss five common areas of concern and clarify the confusion.

Myth: Cassandra is a map of maps

As applications using Cassandra became more complex, it became clear that schema and data typing make development and maintenance much easier at scale than "everything is a bytebuffer," or "everything is a string."
Today, the best way to think of Cassandra's data models is as tables and rows. Similar to a relational database, Cassandra columns are strongly typed and indexable.
Other things you might have heard:
  • "Cassandra is a column database."Column databases store all values for a given column together on disk. This makes them suitable for data warehouse workloads, but not for running applications that require fast access to specific rows.
  • "Cassandra is a wide-row database."There is a grain of truth here, which is that Cassandra's storage engine is inspired by Bigtable, the grandfather of wide-row databases. But wide-row databases tie their data model too closely to that storage engine, which is easier to implement but more difficult to develop against, and prevents manyoptimizations.
One of the reasons we shied away from "tables and rows" to start with is that Cassandra tables do have some subtle differences from the relational ones you're familiar with. First, the first element of a primary key is the partition key. Rows in the same partition will be owned by the same servers, and partitions are clustered.
Second, Cassandra does not support joins or subqueries, because joins across machines in a distributed system are not performant. Instead, Cassandra encourages denormalization to get the data you need from a single table, and provides tools like collections to make this easier.
For example, consider the users table shown in the following code example:
CREATE TABLE users (
  user_id uuid PRIMARY KEY,
  name text,
  state text,
  birth_year int
);
Most modern services understand now that users have multiple email addresses. In the relational world, we'd add a many-to-one relationship and correlate addresses to users with a join, like the following example:
CREATE TABLE users_addresses (
  user_id uuid REFERENCES users,
  email text
);

SELECT *
FROM users NATURAL JOIN users_addresses;
In Cassandra, we'd denormalize by adding the email addresses directly to the users table. A set collection is perfect for this job:
ALTER TABLE users ADD email_addresses set<text>;
We can then add addresses to a user record like this:
UPDATE users
SET email_addresses = {‘jbe@gmail.com’, ‘jbe@datastax.com’}
WHERE user_id = ‘73844cd1-c16e-11e2-8bbd-7cd1c3f676e3’
See the documentation for more on the Cassandra data model, including self-expiring data and distributed counters.

Myth: Cassandra is slow at reads

While Cassandra's log-structured storage engine means that it does not seek for updates on hard disks or cause write amplification on solid-state disks, it is also fast at reads.
Here are the throughput numbers from the random-access read, random-access and sequential-scan, and mixed read/write workloads from the University of Toronto's NoSQL benchmark results:
The Endpoint benchmark comparing Cassandra, HBase, and MongDB corroborates these results.
How does this work? At a very high level, Cassandra's storage engine looks similar to Bigtable, and uses some of the same terminology. Updates are appended to a commitlog, then collected into a "memtable" that is eventually flushed to disk and indexed, as an "sstable:"
Naive log-structured storage engines do tend to be slower at reads, for the same reason they are fast at writes: new values in rows do not overwrite the old ones in-place, but must be merged in the background by compaction. So in the worst case, you will have to check multiple sstables to retrieve all the columns for a "fragmented" row:
Cassandra makes several improvements to this basic design to achieve good read performance:

Myth: Cassandra is hard to run

There are three aspects to running a distributed system that tend to be more complicated than running a single-machine database:
  1. Initial deployment and configuration
  2. Routine maintenance such as upgrading, adding new nodes, or replacing failed ones
  3. Troubleshooting
Cassandra is a fully distributed system: every machine in a Cassandra cluster has the same role. There are no metadata servers that have to fit everything in memory. There are no configuration servers to replicate. There are no masters, and no failover. This makes every aspect of running Cassandra simpler than the alternatives. It also means that bringing up a single-node cluster to develop and test against is trivial, and behaves exactly the way a full cluster of dozens of nodes would.
Initial deployment is the least important concern here in one sense: other things being equal, even a relatively complex initial setup will be insignificant when amortized over the lifetime of the system, and automated installation tools can hide most of the gory details. But! If you can barely understand a system well enough to install it manually, there's going to be trouble when you need to troubleshoot a problem, which requires much more intimate knowledge of how all the pieces fit together.
Thus, my advice would be to make sure you understand what is going on during installation, as in this two-minute example of setting up a Cassandra cluster, before relying on tools like theWindows MSI installer, OpsCenter provisioning or the self-configuring AMI.
Cassandra makes routine maintenance easy. Upgrades can be done one node at a time. While a node is down, other nodes will save updates that it missed and forward them when it comes back up. Adding new nodes is parallelized across the cluster; there is no need to rebalance afterwards.
Even dealing with longer, unplanned outages is straightforward. Cassandra's active repair is like rsync for databases, only transferring missing data and keeping network traffic minimal. You might not even notice anything happened if you're not paying close attention.
Cassandra's industry leading support for multiple datacenters even makes it straightforward to survive an entire AWS region going down or losing a datacenter to a hurricane.
Finally, DataStax OpsCenter simplifies troubleshooting by making the most important metrics in your cluster available at a glance, allowing you to easily correlate historical activity with the events causing service degradation. The DataStax Community Edition Cassandra distribution includes a "lite" version of OpsCenter, free for production use. DataStax Enterprise also includes scheduled backup and restore, configurable alerts, and more.

Myth: Cassandra is hard to develop against

The original Cassandra Thrift API achieved its goal of giving us a cross-platform base for a minimum of effort, but the result was admittedly difficult to work with. CQL, Cassandra's SQL dialect, replaces that with an easier interface, a gentler learning curve, and an asynchronous protocol.
CQL has been available for early adopters beginning with version 0.8 two years ago; with the release of version 1.2 in January, CQL is production ready, with many drivers available andbetter performance than Thrift. DataStax is also officially supporting the most popular CQL drivers, which helps avoid the sometimes indifferent support seen with the community Thrift drivers.
Patrick McFadin's Next Top Data Model presentations (one, two) are a good introduction to CQL beyond the basics in the documentation.

Myth: Cassandra is still bleeding edge

From an open source perspective, Apache Cassandra is now almost five years old and has many releases under its belt, with version 2.0 coming up in July. From an enterprise point of view, DataStax provides DataStax Enterprise, which includes a certified version of Cassandra that has been specifically tested, benchmarked, and approved for production environments.
Businesses have seen the value that Cassandra brings to their organizations. Over 20 of the Fortune 100 rely on Cassandra to power their mission-critical applications in nearly every industry, including financial, health care, retail, entertainment, online advertising and marketing.
The most common reason to move to Cassandra is when the existing technology can't scale to the demands of modern big data applications. Netflix, the largest cloud application in the world,moved 95% of their data from Oracle to Cassandra. Barracuda Networks replaced MySQL with Cassandra when MySQL couldn't handle the volume of requests needed to combat modern spammers. Ooyala handles two billion data points every day, on a Cassandra deployment of more than two petabytes.
Cassandra is also augmenting or replacing legacy relational databases that have proven either too costly to manage and maintain. Constant Contact's initial project with Cassandra took three months and $250,000, compared to nine months and $2,500,000 on their traditional RDBMS. Today they have six clusters and more than 100TB of data trusted to Cassandra.
Many other examples can be found in DataStax's case studies and Planet Cassandra’s user interviews.

Not a myth: the 2013 Cassandra Summit in San Francisco

We just completed the best way to learn more about Cassandra, with over 1100 attendees and 65 sessions from Accenture, Barracuda Networks, Blue Mountain Capital, Comcast, Constant Contact, eBay, Fusion-io, Intuit, Netflix, Sony, Splunk, Spotify, Walmart, and more. Slides are already up; follow Planet Cassandra to be notified when videos are available. 
 

About the Author

Jonathan Ellis is CTO and co-founder at DataStax. Prior to DataStax, he worked extensively with Apache Cassandra while employed at Rackspace. Prior to Rackspace, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy.

Apache MetaModel – Providing Uniform Data Access Across Various Data Stores

Recently, Human Inference and the Apache Software Foundation (ASF) announced the donation and acceptance into incubation of the MetaModel project. Previously, MetaModel had been available under the LGPL license, and governed by Human Inference’s product development, but is now being moved to the ASF, getting a new license, community and governance. So what is this project all about, and what is it useful for?
MetaModel is a Java library which aims to provide a single interface for interacting with any data store; be it a relational database, a NoSQL database, a spreadsheet file or another file format. By interacting we mean exploring metadata, querying and writing/changing the data contained in the store. Certainly any abstraction like this will leave out details and thereby there’s a risk of over-generalizing and loosing important features. You would not want the functionality of your relational SQL database reduced to only full table scan (SELECT * FROM [table]) like queries. On the other hand you would also not want to expose functionality that is unlikely to work on any other data stores than just your specific SQL server brand and version. Lastly, you would want to build upon existing common skills for interacting with data, such as SQL.

Dealing with Metadata

So what is then the approach to data store abstraction chosen by the MetaModel project? The project exposes a query model through Java interfaces (or optionally parsed from a String) that is very similar to SQL. Since the query is defined as a regular Java object, it can be interpreted easily and – depending on the underlying technology –the best strategy for actually executing it will be chosen. This means that MetaModel not only provides an interface; it also includes a full query engine which can be fitted to take care of some or all of the tasks involved in handling a query. In the case of relational JDBC databases, 99% of the query execution will still be happening in the database’s native engine. But with MetaModel you can also fire the same queries on a CSV file or an Excel spreadsheet, and thereby utilize MetaModel’s query engine to properly slice and dice the data. You won’t have to change the query at all.

Of course, this is assuming the metadata and the structure of your data stores are compatible. Different data stores have different ways of exposing or inferring their metadata. JDBC databases typically expose their metadata through the JDBC metadata API. File formats like CSV and Excel sheets are a bit less well-defined; they have their metadata explored by reading the header lines of the files. And, as the extreme example, there are several NoSQL databases that explicitly do not have metadata. MetaModel provides the option for you to specify the metadata programmatically or to infer the metadata by inspecting the first N records of the data store.
MetaModel’s most central construct is the DataContext interface, which represents the data store and is used to explore and query it. Additionally, the UpdateableDataContext sub-interface is available for writeable data stores where updates to the data can be performed. The whole library can more or less be learned using basic code-completion, once you just ensure you have a DataContext instance. Here are a couple of examples of common DataContext implementations and how to instantiate them:
          // a DataContext for a CSV file  
          UpdateableDataContext csv = new CsvDataContext(new File(“data.csv”));  

          // a DataContext for an Excel spreadsheet  
          UpdateableDataContext excel = new ExcelDataContext(new File(“spreadsheet.xlsx”));  

          // a DataContext for a JDBC database (can use either DataSource or Connection)  
          java.sql.DataSource dataSource = …  
          UpdateableDataContext jdbc = new JdbcDataContext(dataSource);  

          // a DataContext for an XML file (where metadata is automatically inferred)  
          DataContext xml = new XmlDomDataContext(new File(“data.xml”)); 
 
          // a DataContext for connecting to Salesforce.com’s data web services  
          UpdateableDataContext salesforce =  
              new SalesforceDataContext(username, pw, securityToken);  

          // a in-memory DataContext for POJOs (useful for testing and mocking)  
          Person record1 = ...  
          Person record2 = ...  
          TableDataProvider tableDataProvider = new ObjectTableDataProvider(  
             “persons”, Person.class, Arrays.asList(record1, record2));  
          UpdateableDataContext pojos = new PojoDataContext(“schema”, tableDataProvider);
Metadata is important for MetaModel not only for exploring the data structure, but also for defining queries. A lot of effort went into making sure that queries are safe to fire, if you’re just working with proper metadata. So before querying, the first thing you need to do as a developer is to get hold of the metadata objects. For instance, if you know there’s a table called ORDER_LINE with a price column and a order_id column, then the metadata needed for querying it is resolvable in a typically hardcoded manner (which obviously only works when you know the data store):
          DataContext dataContext = ... // the DataContext object represents the ‘connection’  
          Table orderLines = dataContext.getTableByQualifiedLabel(“ORDER_LINES”);  
          Column price = orderLines.getColumnByName(“price”);
          Column orderId = orderLines.getColumnByName(“order_id”);
However, the API also allows you to dynamically fetch the metadata based on discovery. This is useful for applications where you wish to present the user with the available tables, columns etc. and let the user himself make choices that affect the query:
          Schema[] schema = dataContext.getSchemas();  
          Table[] tables = schemas[0].getTables();  
          Column[] columns = tables[0].getColumns();
Another important aspect of MetaModel is to treat metadata, queries and other entities around data interaction as objects. A query in MetaModel is a regular Java object that you can manipulate and pass around before execution. This enables applications to create complex workflows where different pieces of the code participate in the building and optimization of the query plan without having to turn to e.g. tedious SQL string manipulation. It also helps with type-safety since e.g. the Query model is based on type-safe constructions like columns, tables, etc., instead of vague String literals.

Querying a Data Store

So, let’s see how a query in MetaModel looks like.
There are three ways you can go about firing the same query:
1. Compose it from scratch: This is the traditional POJO oriented way of doing it. It’s quite verbose but allows for all the flexibility you want.
         Query q = new Query();  
         q.select(SUM, price);  
         q.select(orderId);  
         q.from(orderLines);  
         q.groupBy(orderId);  
         q.setMaxRows(100);  
         DataSet dataSet = dataContext.executeQuery(q);
2. Use the fluent Builder API: The Builder API was added to allow another type-safe way of querying but with less verbosity. Also this API, through the builder-pattern, provides some direction to the developer about which clauses of the query to logically fill in next. This is the preferred way of defining the query, when there’s just a single component for building it:
         Query q = dataContext.query().from(orderLines)  
                             .select(SUM, price).and(orderId)  
                             .groupBy(orderId).maxRows(100).toQuery();  
         DataSet dataSet = dataContext.executeQuery(q);
3. Have it parsed from a String: 
Sometimes you might want to cut corners and fall back to a more traditional SQL string approach. MetaModel can also parse queries coming from plain strings, but it comes with the risk of less type-safety since String queries can be validated only at runtime.
         Query q = dataContext.parseQuery(  
             “SELECT SUM(price), order_id FROM order_lines GROUP BY order_id LIMIT 100”);  
         DataSet dataSet = dataContext.executeQuery(q);
As you can see, the end-result of all three approaches is a DataSet, an object type representing the tabular query result. Without going into too much detail with all the DataSet features, you could simply iterate through it like this:
Try {  
   while (dataSet.next()) {  
       Row row = dataSet.getRow();  
       System.out.println(row.toString());  
   }  
} finally {  
   dataset.close();  
}

Updating a Data Store

Executing updates with MetaModel is performed in a similar type-safe and metadata-driven approach. As mentioned above, not all data stores are writeable, which is why you need a DataContext object that also implements the UpdateableDataContext interface. Given that, let’s try to update order data in our example:
         dataContext.executeUpdate(new UpdateScript() {  
             @Override  
             public void run(UpdateCallback cb) {  
                 // insert a new order line  
                 cb.insertInto(orderLines).value(orderId, 123).value(price, 395).execute();   

                 // update the price of orderlines where order_id = 122  
                 cb.update(orderLines).where(orderId).eq(122).value(price, 295).execute();  
             }  
         });
Notice here that the UpdateScript is the construct that sets the logical transaction boundaries. Depending on the underlying data technology, an appropriate transactional strategy will be applied; JDBC databases will apply ACID transactions, most file formats will use synchronized writing and so on. The net result is that you have a single syntax for writing data in all your data stores.
The syntax here isn’t particularly beautiful because of the anonymous inner class. Obviously this will improve with Closures in Java 8. But also, if you’re only looking to do a single operation, a couple of convenient prebuilt UpdateScript classes are available out of the box:
         dataContext.executeUpdate(  
             new InsertInto(orderLines).value(orderId, 123).value(price, 395));
Furthermore the executeUpdate method can be used to create and drop tables, as well as delete records.

Adding Support for New Data Stores

Finally, the expert user of MetaModel might ask ‘What if I want to connect to [XYZ]?’ (where XYZ is an exotic data store that we don’t support yet). Obviously we want MetaModel to be easy to extend; that was part of the reason to make the query engine pluggable. What you need to do is to build your own implementation of the DataContext interface; but that’s not really a trivial thing to do if you start from the ground up. So we provide an abstract implementation with a number of extension points. Here’s the walk-through:
  • Let your class extend the abstract class QueryPostprocessDataContext. You will see that you need to implement a few abstract methods:
    • getMainSchema() 
      This method should be implemented to provide the schema model that your DataContext exposes.
    • materializeMainSchemaTable(Table, Column[], int) 
      This method should be implemented to provide the equivalent of a full table scan for a particular table.
  • Your DataContext is now functional and you can start using MetaModel with your exotic data store!
  • But let’s also optimize it a bit! Although our DataContext is now fully functional, it might not perform greatly for certain queries since MetaModel’s query engine has to rely on the materializeMainSchemaTable(...) method as the source for handling almost any query. Here’s a couple of methods that you might additionally want to override:
    • executeCountQuery(...) 
      Many data stores have a simple way of determining the count of records for a particular table. Since this is also a common query type, overriding this method will often help.
    • materializeMainSchemaTable(Table, Column[], int, int) 
      Often times people scan through table data in pages (from record no. X to Y). The extra int-arguments of this method allows you to optimize the way you query, such as a single page of data instead of the full table.
    • executeQuery(Query) 
      If you want to do further optimization, using WHERE or GROUP BY clauses, override this method. Beware of many corner cases though, since the Query argument is a rich object type. Good examples for inspiration here include the SalesforceDataContext class and MongoDbDataContext class found in the source code.

Final Words

In this article we have introduced MetaModel, a library providing access to various data stores, explaining how to deal with metadata, how to interrogate the store, and how to perform updates.
In the future, we intend to develop MetaModel even further in it’s new home at Apache. We’ll be adding more built-in DataContext implementations for HBase, Cassandra and other modern databases. As well as further expanding the feature set of what you can do with metadata. Some of the ideas that we are working on are richer metadata about nested structures (maps and lists, as are available in many NoSQL databases), ability to create virtual tables (similar to VIEWs, but facilitated on the client instead of the server), support for mapping POJOs to DataSet rows and plugging in more functions to the query engine.
Apache MetaModel is currently undergoing incubation at The Apache Software Foundation. If you are interested in this project, please find the mailing lists, bug tracking, etc., available on theApache Incubator MetaModel page.

About the Author

Kasper Sørensen works as Principal Software Engineer and Product Manager atHuman Inference. His professional expertise and interests are in product development for data-intensive applications. While graduating from Copenhagen Business School, he went on to found the DataCleaner and MetaModel open source projects as part of his Master’s Thesis. You can read more on his blog:Kasper’s Source.

Related Posts Plugin for WordPress, Blogger...