Showing posts with label Cassandra. Show all posts

Friday, 31 October 2014

FaceBook Underlying Technology of Messages

We're launching a new version of Messages today that combines chat, SMS, email, and Messages into a real-time conversation. The product team spent the last year building out a robust, scalable infrastructure. As we launch the product, we wanted to share some details about the technology.

The current Messages infrastructure handles over 350 million users sending over 15 billion person-to-person messages per month. Our chat service supports over 300 million users who send over 120 billion messages per month. By monitoring usage, two general data patterns emerged:

A short set of temporal data that tends to be volatile
An ever-growing set of data that rarely gets accessed

When we started investigating a replacement for the existing Messages infrastructure, we wanted to take an objective approach to storage for these two usage patterns. In 2008 we open-sourced Cassandra, an eventual-consistency key-value store that was already in production serving traffic for Inbox Search. Our Operations and Databases teams have extensive knowledge in managing and running MySQL, so switching off of either technology was a serious concern. We either had to move away from our investment in Cassandra or train our Operations teams to support a new, large system.

We spent a few weeks setting up a test framework to evaluate clusters of MySQL, Apache Cassandra, Apache HBase, and a couple of other systems. We ultimately chose HBase. MySQL proved to not handle the long tail of data well; as indexes and data sets grew large, performance suffered. We found Cassandra's eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure.

HBase comes with very good scalability and performance for this workload and a simpler consistency model than Cassandra. While we’ve done a lot of work on HBase itself over the past year, when we started we also found it to be the most feature rich in terms of our requirements (auto load balancing and failover, compression support, multiple shards per server, etc.). HDFS, the underlying filesystem used by HBase, provides several nice features such as replication, end-to-end checksums, and automatic rebalancing. Additionally, our technical teams already had a lot of development and operational expertise in HDFS from data processing with Hadoop. Since we started working on HBase, we've been focused on committing our changes back to HBase itself and working closely with the community. The open source release of HBase is what we’re running today.

Since Messages accepts data from many sources such as email and SMS, we decided to write an application server from scratch instead of using our generic Web infrastructure to handle all decision making for a user's messages. It interfaces with a large number of other services: we store attachments in Haystack, wrote a user discovery service on top of Apache ZooKeeper, and talk to other infrastructure services for email account verification, friend relationships, privacy decisions, and delivery decisions (for example, should a message be sent over chat or SMS). We spent a lot of time making sure each of these services are reliable, robust, and performant enough to handle a real-time messaging system.

The new Messages will launch over 20 new infrastructure services to ensure you have a great product experience. We hope you enjoy using it.

Kannan is a software engineer at Facebook.

Friday, 17 October 2014

Top Hadoop Issues

Map Reduce data locality

Data locality is one of the key advantage of Hadoop Map/Reduce; the fact that the map code is executed on the same data node where the data resides. Interestingly many people found that this is not always the case in practice. Some of the reasons they stated were:

Speculative execution
Heterogeneous clusters
Data distribution and placement
Data Layout and Input Splitter

The challenge becomes more prevalent in larger clusters, meaning the more data nodes and data I have the less locality I get. Larger clusters tend not to be complete homogeneous, some nodes are newer and faster then others, bringing the data to compute ratio out of balance. Speculative execution will attempt to use compute power even though the data might not be local. The nodes that contain the data in question might be computing something else, leading to another node doing non-local processing.The root cause might also lie in the data layout/placement and the used Input Splitter. Whatever the reason non-local data processing puts a strain on the network which poses a problem to scalability. The network becomes the bottleneck. Additionally, the problem is hard to diagnose because it is not easy to see the data locality.

To improve data locality, you need to first detect which of your jobs have a data locality problem or degrade over time. With APM solutions you can capture which tasks access which data nodes. Solving the problem is more complex and can involve changing the data placement and data layout, using a different scheduler or simply changing the number of mapper and reducer slots for a job. Afterwards, you can verify whether a new execution of the same workload has a better data locality ratio.

Job code inefficiencies and “profiling” Hadoop workloads

The next item confirmed our own views and is very interesting: many Hadoop workloads suffer from inefficiencies. It is important to note that this is not a critique on Hadoop but on the jobs that are run on it. However “profiling” jobs in larger Hadoop clusters is a major pain point. Black box monitoring is not enough and traditional profilers cannot deal with the distributed nature of a Hadoop cluster. Our solution to this problem was well received by a lot of experienced Hadoop developers. We also received a lot of interesting feedback on how to make our Hadoop job “profiling” even better.

TaskTracker performance and the impact on shuffle time

It is well known that shuffle is one of the main performance critical areas in any Hadoop job. Optimizing the amount of map intermediate data (e.g. with combiners), shuffle distribution (with partitioners) and pure read/merge performance (number of threads, memory on the reducing side) are described in many Performance Tuning articles about Hadoop. Something that is less often talked about but is widely discussed by the long-term “Hadoopers” is the problem of a slowdown of particular TaskTrackers.

When particular compute nodes are under high pressure, have degrading hardware, or run into cascading effects, the local TaskTracker can be negatively impacted. To put it in more simple terms, in larger systems some nodes will degrade in performance!

The result is that the TaskTracker nodes cannot deliver the shuffle data to the reducers as fast as they should or may react with errors while doing so. This has a negative impact on virtually all reducers and because shuffle is a choke point the entire job time can and will increase. While small clusters allow us to monitor the performance of the handful of running TaskTrackers, real world clusters make that infeasible. Monitoring with Ganglia based on averages effectively hides which jobs trigger this, which are impacted and which TaskTrackers are responsible and why.

The solution to this is a baselining approach, coupled with a PurePath/PureStack model. Baselining of TaskTracker requests solves the averaging and monitoring problem and will tell us immediately if we experience a degradation of TaskTracker mapOutput performance. By always knowing which TaskTrackers slow down, we can correlate the underlying JVM host health and we are able to identify if that slowdown is due to infrastructure or Hadoop configuration issues or tied to a specific operating system version that we recently introduced. Finally, by tracing all jobs, task attempts as well as all mapOutput requests from their respective task attempts and jobs we know which jobs may trigger a TaskTracker slowdown and which jobs suffer from it.

NameNode and DataNode slowdowns

Similar to the TaskTrackers and their effect on job performance, a slowdown of the NameNode or slowdowns of particular DataNodes have a deteriorating effect on the whole cluster. Requests can easily be baselined, making the monitoring and degradation detection automatic. Similarly, we can see which jobs and clients are impacted by the slowdown and the reason for the slowdown, be it infrastructure issues, high utilization or errors in the services.

Top Cassandra Issues

One of the best presentations about Cassandra performance was done by Spotify at the Cassandra Summit. If you use Cassandra or plan to use it you; I highly recommended to watch it!

Read Time degradation over time

As it turns out Cassandra is always fast when first deployed but there are many cases where read time degrades over time. Virtually all of the use cases center around the fact that over time, the rows get spread out over many SStables and/or deletes, which lead to tombstones. All of these cases can be attributed to wrong access patterns and wrong schema design and are often data specific. For example if you write new data to the same row over a long period of time (several months) then this row will be spread out over many SStables. Access to it will become slow while access to a “younger” row (which will reside in only one SSTable) will still be snappy. Even worse is a delete/insert pattern; adding and removing columns to the same row over time. Not only will the row be spread out, it will be full of tombstones and read performance will be quite horrible. The result is that the average performance might degrade only slightly over time (averaging effect). When in reality the performance of the older rows will degrade dramatically, while the younger rows stay fast.

To avoid this, never ever delete data as general pattern in your application and never write to the same row over long periods of time. To catch such a scenario you should baseline Cassandra read requests on a per column family basis. Base-lining approaches as compared to averages will detect a change in distribution and will notify you if a percentage of your requests degrade while the majority or some stay super fast. In addition by tying the Cassandra requests to the actual types of end-user requests you will be able to quickly figure out where that access anti-pattern originates.

Some slow Nodes can bring down the cluster

Like every real world application, Cassandra Nodes can slow down due to many issues (hardware, compaction, GC, network, disk etc.). Cassandra is a clustered database where every row exists multiple times in the cluster and every write request is sent to all nodes that contain the row (even on consistency level one). It is no big deal if a single node fails because others have the same data; all read and write requests can be fulfilled. In theory a super slow node should not be a problem unless we explicitly request data with consistency level “ALL,” because Cassandra would return when the required amount of nodes responded. However internally every node has a coordinator queue that will wait for all requests to finish, even if it would respond to the client before that has happened. That queue can fill up due to one super slow node and would effectively render a single node unable to respond to any requests. This can quickly lead to a complete cluster not responding to any requests.

The solution to this is twofold. If you can, use a token-aware client like Astyanax. By talking directly to the nodes containing the data item, this client effectively bypasses the coordinator problem. In addition you should baseline the response time of Cassandra requests on the server nodes and alert yourself if a node slows down. Funnily enough bringing down the node would solve the problem temporarily because Cassandra will deal with that issue nearly instantaneously.

Too many read round trips/Too much data read

Another typical performance problem with Cassandra reminds us of the SQL days and is typical for Cassandra beginners. It is a database design issue and leads to transactions that make too many requests per end-user transaction or read too much data. This is not a problem for Cassandra itself, but the simple fact of making many requests or reading more data slows down the actual transaction. While this issue can be easily monitored and discovered with an APM solution, the fix is not as trivial as in most cases it requires a change of code and the data model.

Summary

Hadoop and Cassandra are both very scalable systems! But as often stated scalability does not solve performance efficiency issues and as such neither of these systems is immune to such problems, nor to simple misuse.

Some of the prevalent performance problems are very specific to these systems and we have not seen them in more traditional systems. Other issues are not really new, except for fact that they now occur in systems that are tremendously more distributed and bigger than before. The very scalability and size makes these problems harder to diagnose (especially in Hadoop) while often having a very high impact (as in bringing down a Cassandra cluster). Performance experts can rejoice, they will have a job for a long time to come.

Friday, 15 August 2014

Cassandra Mythology

Like the prophetess of Troy it was named for, Apache Cassandra has seen some myths accrue around it. Like most myths, these were once at least partly true, but have become outdated as Cassandra evolved and improved. In this article, I'll discuss five common areas of concern and clarify the confusion.

Myth: Cassandra is a map of maps

As applications using Cassandra became more complex, it became clear that schema and data typing make development and maintenance much easier at scale than "everything is a bytebuffer," or "everything is a string."

Today, the best way to think of Cassandra's data models is as tables and rows. Similar to a relational database, Cassandra columns are strongly typed and indexable.

Other things you might have heard:

"Cassandra is a column database."Column databases store all values for a given column together on disk. This makes them suitable for data warehouse workloads, but not for running applications that require fast access to specific rows.
"Cassandra is a wide-row database."There is a grain of truth here, which is that Cassandra's storage engine is inspired by Bigtable, the grandfather of wide-row databases. But wide-row databases tie their data model too closely to that storage engine, which is easier to implement but more difficult to develop against, and prevents many optimizations.

One of the reasons we shied away from "tables and rows" to start with is that Cassandra tables do have some subtle differences from the relational ones you're familiar with. First, the first element of a primary key is the partition key. Rows in the same partition will be owned by the same servers, and partitions are clustered.

Second, Cassandra does not support joins or subqueries, because joins across machines in a distributed system are not performant. Instead, Cassandra encourages denormalization to get the data you need from a single table, and provides tools like collections to make this easier.

For example, consider the users table shown in the following code example:

CREATE TABLE users (
  user_id uuid PRIMARY KEY,
  name text,
  state text,
  birth_year int
);

Most modern services understand now that users have multiple email addresses. In the relational world, we'd add a many-to-one relationship and correlate addresses to users with a join, like the following example:

CREATE TABLE users_addresses (
  user_id uuid REFERENCES users,
  email text
);

SELECT *
FROM users NATURAL JOIN users_addresses;

In Cassandra, we'd denormalize by adding the email addresses directly to the users table. A set collection is perfect for this job:

ALTER TABLE users ADD email_addresses set<text>;

We can then add addresses to a user record like this:

UPDATE users
SET email_addresses = {‘jbe@gmail.com’, ‘jbe@datastax.com’}
WHERE user_id = ‘73844cd1-c16e-11e2-8bbd-7cd1c3f676e3’

See the documentation for more on the Cassandra data model, including self-expiring data and distributed counters.

Myth: Cassandra is slow at reads

While Cassandra's log-structured storage engine means that it does not seek for updates on hard disks or cause write amplification on solid-state disks, it is also fast at reads.

Here are the throughput numbers from the random-access read, random-access and sequential-scan, and mixed read/write workloads from the University of Toronto's NoSQL benchmark results:

The Endpoint benchmark comparing Cassandra, HBase, and MongDB corroborates these results.

How does this work? At a very high level, Cassandra's storage engine looks similar to Bigtable, and uses some of the same terminology. Updates are appended to a commitlog, then collected into a "memtable" that is eventually flushed to disk and indexed, as an "sstable:"

Naive log-structured storage engines do tend to be slower at reads, for the same reason they are fast at writes: new values in rows do not overwrite the old ones in-place, but must be merged in the background by compaction. So in the worst case, you will have to check multiple sstables to retrieve all the columns for a "fragmented" row:

Cassandra makes several improvements to this basic design to achieve good read performance:

Compaction strategies are pluggable. LeveledCompactionStrategy, for instance, optimizes for reads by more aggressively combining overlapping sstables.
Cassandra checks sstables in reverse chronological order; if you ask Cassandra to SELECT x, y FROM foo WHERE key = 42, then once Cassandra has found the most recently written values for x and y it doesn't need to check older sstables. Applying the same principle to range scans is slightly trickier but also possible.
If we do have to read from multiple sstables, we re-write the defragmented result at read time, so subsequent reads will only have to check a single one.
Once a partition is accessed, its index is cached, and rows within it can be accessed with a single seek (per sstable).
Storage engine metadata is kept off heap, to avoid performance problems from garbage collection.

Myth: Cassandra is hard to run

There are three aspects to running a distributed system that tend to be more complicated than running a single-machine database:

Initial deployment and configuration
Routine maintenance such as upgrading, adding new nodes, or replacing failed ones
Troubleshooting

Cassandra is a fully distributed system: every machine in a Cassandra cluster has the same role. There are no metadata servers that have to fit everything in memory. There are no configuration servers to replicate. There are no masters, and no failover. This makes every aspect of running Cassandra simpler than the alternatives. It also means that bringing up a single-node cluster to develop and test against is trivial, and behaves exactly the way a full cluster of dozens of nodes would.

Initial deployment is the least important concern here in one sense: other things being equal, even a relatively complex initial setup will be insignificant when amortized over the lifetime of the system, and automated installation tools can hide most of the gory details. But! If you can barely understand a system well enough to install it manually, there's going to be trouble when you need to troubleshoot a problem, which requires much more intimate knowledge of how all the pieces fit together.

Thus, my advice would be to make sure you understand what is going on during installation, as in this two-minute example of setting up a Cassandra cluster, before relying on tools like theWindows MSI installer, OpsCenter provisioning or the self-configuring AMI.

Cassandra makes routine maintenance easy. Upgrades can be done one node at a time. While a node is down, other nodes will save updates that it missed and forward them when it comes back up. Adding new nodes is parallelized across the cluster; there is no need to rebalance afterwards.

Even dealing with longer, unplanned outages is straightforward. Cassandra's active repair is like rsync for databases, only transferring missing data and keeping network traffic minimal. You might not even notice anything happened if you're not paying close attention.

Cassandra's industry leading support for multiple datacenters even makes it straightforward to survive an entire AWS region going down or losing a datacenter to a hurricane.

Finally, DataStax OpsCenter simplifies troubleshooting by making the most important metrics in your cluster available at a glance, allowing you to easily correlate historical activity with the events causing service degradation. The DataStax Community Edition Cassandra distribution includes a "lite" version of OpsCenter, free for production use. DataStax Enterprise also includes scheduled backup and restore, configurable alerts, and more.

Myth: Cassandra is hard to develop against

The original Cassandra Thrift API achieved its goal of giving us a cross-platform base for a minimum of effort, but the result was admittedly difficult to work with. CQL, Cassandra's SQL dialect, replaces that with an easier interface, a gentler learning curve, and an asynchronous protocol.

CQL has been available for early adopters beginning with version 0.8 two years ago; with the release of version 1.2 in January, CQL is production ready, with many drivers available andbetter performance than Thrift. DataStax is also officially supporting the most popular CQL drivers, which helps avoid the sometimes indifferent support seen with the community Thrift drivers.

Patrick McFadin's Next Top Data Model presentations (one, two) are a good introduction to CQL beyond the basics in the documentation.

Myth: Cassandra is still bleeding edge

From an open source perspective, Apache Cassandra is now almost five years old and has many releases under its belt, with version 2.0 coming up in July. From an enterprise point of view, DataStax provides DataStax Enterprise, which includes a certified version of Cassandra that has been specifically tested, benchmarked, and approved for production environments.

Businesses have seen the value that Cassandra brings to their organizations. Over 20 of the Fortune 100 rely on Cassandra to power their mission-critical applications in nearly every industry, including financial, health care, retail, entertainment, online advertising and marketing.

The most common reason to move to Cassandra is when the existing technology can't scale to the demands of modern big data applications. Netflix, the largest cloud application in the world,moved 95% of their data from Oracle to Cassandra. Barracuda Networks replaced MySQL with Cassandra when MySQL couldn't handle the volume of requests needed to combat modern spammers. Ooyala handles two billion data points every day, on a Cassandra deployment of more than two petabytes.

Cassandra is also augmenting or replacing legacy relational databases that have proven either too costly to manage and maintain. Constant Contact's initial project with Cassandra took three months and $250,000, compared to nine months and $2,500,000 on their traditional RDBMS. Today they have six clusters and more than 100TB of data trusted to Cassandra.

Many other examples can be found in DataStax's case studies and Planet Cassandra’s user interviews.

Not a myth: the 2013 Cassandra Summit in San Francisco

We just completed the best way to learn more about Cassandra, with over 1100 attendees and 65 sessions from Accenture, Barracuda Networks, Blue Mountain Capital, Comcast, Constant Contact, eBay, Fusion-io, Intuit, Netflix, Sony, Splunk, Spotify, Walmart, and more. Slides are already up; follow Planet Cassandra to be notified when videos are available.

About the Author

Jonathan Ellis is CTO and co-founder at DataStax. Prior to DataStax, he worked extensively with Apache Cassandra while employed at Rackspace. Prior to Rackspace, Jonathan built a multi-petabyte, scalable storage system based on Reed-Solomon encoding for backup provider Mozy.

Cassandra CLI Internals Using JArchitect

Relational Database Management Systems (RDBMS) are the most commonly used systems to store and use data, but for extremely large amounts of data, these databases don’t scale up well.

The concept of NoSQL has been gaining lot of popularity in the recent years due to the growing demand for relational database alternatives. The biggest motivation behind NoSQL is scalability. NoSQL database solutions offer a way to store and use extremely large amounts of data, but with less overhead, less work, better performance, and less downtime.

Apache Cassandra is a Column based NoSQL database. It was developed at Facebook to power their Inbox Search feature, and it became an Apache open source project. Twitter, Digg, Reddit and quite a few other organizations started using it.

Cassandra ships with a very basic interactive command line interface (CLI). Using the CLI you can connect to remote nodes in the cluster to create or update your schema and set and retrieve records.

The CLI is a useful tool for Cassandra administrators, and even if it provides only basic commands, it’s a good example to know how to implement a Cassandra client. We have to understand how the CLI works internally to develop our custom Cassandra clients or even extend the CLI tool.

In this article, we will explore Cassandra CLI architecture model using the JArchitecttool and the CQLinq language to analyze its code base. JArchitect tool is used to analyze code structure and specify design rules to achieve better code quality. With JArchitect, software quality can be measured using code metrics, visualized using graphs and treemaps, and enforced using standard and custom rules.

Here’s the dependency graph after analysis:

Cassandra uses some known jars like antlr, log4j, slf4j, commons-lang, and also some less known jars like the following:

Libthrift: it’s an API spanning a variety of programming languages and use cases. The goal is to make reliable, performant communication and data serialization across languages as efficient and seamless as possible.
Snakeyaml : YAML is a data serialization format designed for human readability and interaction with scripting languages. Cassandra us this format for its configuration files.
Jackson: A High-performance JSON processor.
Snappy: The snappy-java is a Java port of the snappy, a fast compresser/decompresser written in C++, originally developed by Google.
High-scale-lib: A collection of Concurrent and Highly Scalable Utilities. These are intended as direct replacements for the java.util.* or java.util.concurrent.* collections but with better performance when many CPUs are using the collection concurrently.

The Matrix view below gives us more details about the dependency weight between these JAR files.

Cassandra Command Line Interface

The command line interface logic is implemented in org.apache.cassandra.cli package, and the entry point is the CliMain class.

Let’s search for the methods invoked from the main method by using the following CQLinq query:

from m in Methods where m.IsUsedBy ("org.apache.cassandra.cli.CliMain.main(String[])") 
select new { m, m.NbBCInstructions }

The main method uses JLine which is a Java library for handling console input. It can be used to write nice CLI applications without much effort. It has out of the box support for Command History, Tab completion, Line editing, Custom Key Bindings, and Character masking.

And two interesting methods are used from the main method are:

connect: The connect method is used to connect to the Cassandra database server.
processStatetementInteractive: This method is used to execute commands from the user.

Communication between CLI and Cassandra Server

Before interacting with Cassandra server the client must connect to it using the connect method.

Let’s search for all methods used directly or indirectly by the connect method:

from m in Methods
let depth0 = m.DepthOfIsUsedBy("org.apache.cassandra.cli.CliMain.connect(String,int)")
where depth0 >= 0 orderby depth0
select new { m, depth0 }

The CLI communicates with the server using Thrift library which allows you to define the data types and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be used to easily build RPC clients and servers that communicate seamlessly across programming languages. Instead of writing a lot of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business.

Here’s a simple example of an implementation of a Thrift server:

public class Server {

       public static class SomethingHandler implements Something.Iface {

              public SomethingHandler() {}

              public int ping() {

                     return 1;

                     }

       }

public static void main(String [] args) {

       SomethingHandler handler = new SomethingHandler();

       Something.Processor processor = new Something.Processor(handler);

       TServerTransport serverTransport = new TServerSocket(9090);

       TServer server = new TSimpleServer(processor, serverTransport);

       //Or Use this for a multithreaded server

       // server = new TThreadPoolServer(processor, serverTransport)

       server.serve();

}

}

The Thrift server implements org.apache.thrift.server.TServer interface and the constructor of its implementation takes as parameters a processor and a server transport specification. The processor needs a handler to treat the incoming requests.

Let’s discover all these elements in the Cassandra server. For that we can begin by searching all classes that inherit from TServer class.

from t in Types
let depth0 = t.DepthOfDeriveFrom("org.apache.thrift.server.TServer")
where depth0 >= 0 orderby depth0
select new { t, depth0 }

Cassandra defines:

CustomTThreadPoolServer: It’s a slightly modified version of the Apache Thrift TThreadPoolServer which would use a thread pool to serve incoming requests.

CustomTHsHaServer: The goal of this server is to avoid sticking to one CPU for IO's. For better throughput it is spread across multiple threads. Number of selector threads can be the number of CPUs available.

CustomTNonBlockingServer: which uses a nonblocking socket transport.

And here’s what happens when the ThriftServer is started:

A factory is used to create a TServer and the CassandraSever handler is created to treat incoming requests, it implements Cassandra.Iface which contains all commands supported by Cassandra. Below diagram shows some of these methods:

As shown in the previous Thrift server example, we need the processors to process incoming requests; all these processors inherit from ProcessFunction.

Here are some Cassandra processors:

After discovering the Cassandra thrift server parts, let’s come back to the client and discover what happen when the connect method is invoked from the main method.

The org.apache.thrift.TServiceClient is used to communicate between the client and the server, and the method sendBase is invoked to send a message to the thrift server.

On the server, the login processor receives this request and invokes the login method.

And here’s the dependency graph showing some methods invoked from the login method.

Steps to extend the CLI by adding a new method MyMethod.

After discovering how the CLI works internally, we can easily add a new method to it, and here are the major steps needed to do it:

I – extending the server:

Add the method to Cassandra.Iface
Add the method implementation to the CassandraServer class
Add a new class Cassandra.Processor.MyMethod<l> inheriting fromProcessFunction<T>.
Add an instance of the new processor in the Map returned by theCassandra.Processor<l>.getProcessMap method.

II- Extending the client:

Add a new switch and process it from CliOptions.processArgs method.
Add a method to the Cassandra.Client class and invoke the server by using theTServiceClient.sendBase method.

Conclusion

The command line interface is a good example to learn how to implement a Cassandra client, and learn from real projects is preferable than just search for samples in the web. So to develop a Cassandra client don’t hesitate to go inside its source code and enjoy.

About the Author

Dane Dennis is the JArchitect Product Manager. He works at CoderGears, a company developing tools for developers and architects.

Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Pages

Monday, 7 December 2015

How to Install Cassandra in Ubuntu| Kalyan Hadoop Training in Hyderabad

How to Install Cassandra in Ubuntu| Kalyan Hadoop Training in Hyderabad

Friday, 31 October 2014

FaceBook Underlying Technology of Messages

Friday, 17 October 2014

Top Performance Problems discussed at the Hadoop and Cassandra Summits

Top Hadoop Issues

Map Reduce data locality

Job code inefficiencies and “profiling” Hadoop workloads

TaskTracker performance and the impact on shuffle time

NameNode and DataNode slowdowns

Top Cassandra Issues

Read Time degradation over time

Some slow Nodes can bring down the cluster

Too many read round trips/Too much data read

Summary

Friday, 15 August 2014

Cassandra Mythology

Myth: Cassandra is a map of maps

Myth: Cassandra is slow at reads

Myth: Cassandra is hard to run

Myth: Cassandra is hard to develop against

Myth: Cassandra is still bleeding edge

Not a myth: the 2013 Cassandra Summit in San Francisco

About the Author

Cassandra CLI Internals Using JArchitect

Cassandra Command Line Interface

Communication between CLI and Cassandra Server

Conclusion

About the Author