Saturday, 2 August 2014

Scheduling Workflows Using Oozie Coordinator

Scheduling Workflows Using Oozie Coordinator

Introduction 

In Hadoop Ecosystem, most of the functionality like map-reduce jobs, pig scripts and hive queries are executed as batch jobs. This creates a lot of overhead in deployment and maintenance of hadoop components. As a solution to this, Oozie provides workflows in xml format using which we can define multiple Map/Reduce jobs into a logical unit of work, accomplishing the larger task [4]. This helps in chaining the related MapReduce jobs which can be either Hive queries or Pig scripts like mentioned in below diagram.


Workflows work perfectly when invoked on demand or manually. But for achieving higher level of automation and effectiveness, it becomes necessary to run them based on one or more of the following parameters: regular time intervals, data availability or external events. Then, we need more functionality than provided by Oozie workflows.

In this paper, Oozie Coordinator Jobs will be discussed which provide options to embed workflows and trigger them on regular time-intervals or on basis of data availability. 

The Oozie coordinator allows expressing conditions to trigger execution of workflow in the form of the predicates [1]. These predicates are conditional statements on parameters like time, data and external events. If the predicate is satisfied, then only the workflow job/action is started.

Oozie Coordinator System

As stated at Oozie documentation page [1], “Oozie is a Java Web-Application that runs in a Java servlet-container”. It uses XML for taking configuration inputs from user and uses a database (default is derby but MySQL, HSQLDB or any RDBMS database can also be used) to store:
  • Definitions of Workflow and Coordinator
  • Currently running workflow and Coordinator instances, including instance states, configuration variables and parameters.
Oozie Coordinator is a collection of predicates (conditional statements based on time-frequency and data availability) and actions (i.e. Hadoop Map/Reduce jobs, Hadoop file system, Hadoop Streaming, Pig, Java and Oozie sub-workflow). Actions are recurrent workflow jobs invoked each time predicate returns true. 

Oozie version 2 and higher supports Coordinator Jobs. Coordinator Job is defined in the XML Process Definition Language. 

Predicates are conditional statements, defined using attributes “interval, start-time and end-time” for time-based triggering and xml-tags “dataset and input-events” for data-availability based triggering of workflows. 
 
Actions are the mechanism by which a workflow is triggered for the execution of a computation/processing task. Action contains description of one or more workflows to be executed.

Oozie is lightweight as it uses existing Hadoop Map/Reduce framework for executing all tasks in a workflow. This approach allows it to leverage existing Hadoop installation for providing scalability, reliability, parallelism, etc.

On the basis of functionality, Coordinator can be sub-divided into two major groups [2]:
1. Time-Based Coordinator: This type of Coordinator definition is used for invoking the workflow repeatedly after an interval between a specified period of time.

2.File-Based Coordinator: This type of Coordinator definition is used for invoking the workflow on the basis of data availability and data polling.

2.1  Simple File-Based Coordinator: The action is invoked whenever data available predicate is true.

2.2 Sliding Window-Based Coordinator:  It is invoked frequently and data is aggregated over multiple overlapping previous instances. For example, invoking it at a frequency of 5 minutes and running action on aggregated previous 4 instances of 15 minutes data.

2.3Rollups-Based Coordinator: It is invoked after a long period of time and data is aggregated over multiple previous instances from last time of invocation. For example, it will run once a day, and will trigger a workflow that aggregates 24 instances of hourly data.
  
Oozie Coordinator Components and Variables
  • Coordinator-App: It is a wrapper component that defines the attributes of a coordinator and includes all other components.
Attributes are:
  • start , end :  describes the start and end time in yyyy-mm-ddThh:mmZ format 
  • Time zone: describes the time zone (is the value of Z in the above time format) like UTC. 
  • Controls: It contains parameters like timeout, concurrency, etc. to configure the execution of coordinator job.
  • Datasets: It contains the definition of multiple data sources and frequency of data polling.
Attributes are:
  • Frequency: interval of time at which data polling is done.
  • Initial-Instance: start time of data polling in yyyy-mm-ddThh:mmZ format.
  • Uri-Template: URI of the data source. Expression language can be used. For example, ${YEAR} corresponds to current year. It helps in dynamic selection of data source directories.
  • Done-flag: This flag denotes the success of data polling. It can be a file in which case the presence of file is checked before calling action. It can be left empty otherwise for implicit success message.
  • Input-Events:  denotes the processing of the input data before running the action.
  • Data-in: it denotes the aggregated output data of input-event.
  • Start-instance and end-instance: boundary of data instances that needs to be aggregated.
  • Output-Events:  denotes the processing of the output data after running the action.
  • Data-out: it denotes the output dataset.
  • Instance:  instance of dataset that is to be used as sink for output.
  • Action: It includes the path of the workflow that has to be invoked when predicate return true.
It could also be configured to record the events required to evaluate SLA compliance.

Oozie Coordinator Lifecycle Operations


The lifecycle operations of coordinator are similar to those of oozie workflow except start operation. “Start” is not applicable for coordinators.
  • Submit/Run: Both operations submit the coordinator job to oozie. The job will be in PREP state till the mentioned start-time of the coordinator. 
  • Suspend: Suspends/pause the coordinator job. 
  • Resume: Resumes the execution of the coordinator job. 
  • Kill: kill the coordinator job and ends its execution. 
  • reRun: re-submitting the coordinator job/actions with new parameters. 

Oozie Coordinator Example   

In this section, we will see how to use oozie coordinator for scheduling and triggering of the workflows.

  • A Sample Workflow: First of all, we need a oozie workflow job. For example purpose, I have taken the simple wordcount example provided by Apache-Hadoop-Distribution in hadoop-examples-0.20.2-cdh3u0.jar [6].
The workflow for wordcount is:
<workflow-app xmlns='uri:oozie:workflow:0.1' name='java-main-wf'>
<start to='mapreduce-wordcount-example' />
<action name='mapreduce-wordcount-example'>
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
</configuration>
<main-class>org.apache.hadoop.examples.ExampleDriver</main-class>
<arg>wordcount</arg>
<arg>${inputDir}</arg>
<arg>${outputDir}</arg>
</java>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow-app>

Once workflow is created it has to be deployed correctly. A typical Oozie deployment is a HDFS directory, containing workflow.xml and a lib subdirectory, containing jar files of classes used by workflow actions.
 
For example, the directory structure in hadoop will be as shown below. (If user.name is training)

[hadoop@localhost ~]$ hadoop dfs -ls /user/training/oozie/workflow/wordcount
Found 2 items
drwxr-xr-x   - training supergroup          0 2012-09-18 12:05 /user/training/oozie/workflow/wordcount/lib
-rw-r--r--   1 training supergroup        918 2012-09-18 11:47 /user/training/oozie/workflow/wordcount/workflow.xml

The job.properties file will have following properties:
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
queueName=default
inputDir=${nameNode}/data.in 
outputDir=${nameNode}/out

user.name=training
oozie.wf.application.path=${nameNode}/user/${user.name}/oozie/workflow/wordcount/

With job properties in place, this workflow can be invoked manually using the oozie workflows submit command from command-line.
[training@localhost Desktop]$ oozie job -oozie=http://localhost:11000/oozie/ -config oozie/wordcount-demo/workflow/job.properties -run;
job: 0000000-120918134457517-oozie-oozi-W

2. Oozie Coordinator Definition: As discussed above, coordinator-definitions will be different for different kind of triggering and scheduling.
So, we will take each kind of Coordinator one by one and schedule wordcount example on the basis of that.

Moreover, Oozie coordinators can be parameterized using variables like ${inputDir}, ${startTime}, etc. within the coordinator definition. When submitting a coordinator job, values for the parameters must be provided as input. As parameters are key-value pairs, they can be written in a job.properties file or a XML file. Parameters can also be provided in form of a java Map object if using JAVA API to invoke a coordinator job.
  • Time-Based Coordinator
The generic definition for this kind of coordinator is
<coordinator-app name="coordinator1" frequency="${frequency}" start="${startTime}" end="${endTime}" timezone="${timezone}" xmlns="uri:oozie:coordinator:0.1">
<action>
<workflow>
<app-path>${workflowPath}</app-path>
</workflow>
</action>           
</coordinator-app>

Save the file as coordinator.xml in a HDFS directory. (Please note that coordinator.xml is the only name which can be given to the file as oozie uses this default name for reading file in HDFS directory.)
The coordinatorjob.properties can be defined as 
frequency=60
startTime=2012-08-31T20\:20Z
endTime=2013-08-31T20\:20Z
timezone=GMT+0530

workflowPath=${nameNode}/user/${user.name}/oozie/workflow/wordcount/
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
queueName=default
inputDir=${nameNode}/data.in 
outputDir=${nameNode}/out

oozie.coord.application.path=${nameNode}/user/${user.name}/coordOozie/coordinatorTimrBased

The coordinator application path must be specified in the file with the oozie.coord.application.path property. Specified path must be an HDFS path.
  • File-Based Coordinator
<coordinator-app name="coordinator1" frequency="${frequency}" start="${startTime}" end="${endTime}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1">
<datasets>
<dataset name="input1" frequency="${datasetfrequency}" initial-instance="${datasetinitialinstance}"
timezone="${datasettimezone}">
<uri-template>${dataseturitemplate}/${YEAR}/${MONTH}/${DAY}/${HOUR}/
${MINUTE}</uri-template>
<done-flag> </done-flag>
</dataset>
</datasets>
<input-events>
<data-in name="coordInput1" dataset="input1">
<start-instance>${inputeventstartinstance}</start-instance>
<end-instance>${inputeventendinstance}</end-instance>
</data-in>
</input-events>
<action>
<workflow>
<app-path>${workflowPath}</app-path>
</workflow>
</action>     
</coordinator-app>

Save the file as coordinator.xml in a HDFS directory. (Please note that coordinator.xml is the only name which can be given to the file as oozie uses this default name for reading file in HDFS directory.)

The coordinatorjob.properties can be defined as
frequency=60
startTime=2012-08-21T15:25Z
endTime=2012-08-22T15:25Z
timezone=UTC
datasetfrequency=15
datasetinitialinstance=2012-08-21T15:30Z
datasettimezone=UTC
dataseturitemplate=${namenode}/user/hadoop/oozie/coordinator/in
inputeventstartinstance=${coord:current(0)}
inputeventendinstance=${coord:current(0)}

workflowPath=${nameNode}/user/${user.name}/oozie/workflow/wordcount/
nameNode=hdfs://localhost:8020
jobTracker=localhost:8021
queueName=default
inputDir= ${coord:dataIn('coordInput1')} 
outputDir=${nameNode}/out

oozie.coord.application.path=${nameNode}/user/${user.name}/coordOozie/coordinatorFileBased

The coordinator application path must be specified in the file with the oozie.coord.application.path property. Specified path must be an HDFS path.
  • Sliding-Window Based Coordinator
This is a specific usecase for the File-Based Coordinator where coordinator is invoked frequently and data is aggregated over multiple overlapping previous instances.
 
The rule for this can be generalized as
Coordinator-frequency < DataSet-Frequency
For example, the coordinator job.properties will be like
frequency=5

datasetfrequency=15
……
  • Rollups Based Coordinator
This is a specific usecase for the File-Based Coordinator where coordinator is invoked after a long period of time and data is aggregated over multiple previous instances from last time of invocation. 

The rule for this can be generalized as
Coordinator-frequency > DataSet-Frequency
frequency=1440
….
datasetfrequency=60
…….

Running Coordinator Example from Command line

  • Submitting/Running the coordinator job
$ oozie job -oozie http://localhost:11000/oozie -config coordinatorjob.properties [-submit][-run]
job: 0000672-120823182447665-oozie-hado-C
The parameters for the job must be provided in a file, either a Java Properties file (.properties) or a Hadoop XML Configuration file (.xml). This file must be specified with the -config option.
  • Suspending the coordinator job
$ oozie job -oozie http://localhost:11000/oozie -suspend 0000673-120823182447665-oozie-hado-C
  • Resuming a Coordinator Job
$ oozie job -oozie http://localhost:11000/oozie -resume 0000673-120823182447665-oozie-hado-C
  • Killing a Coordinator Job
$ oozie job -oozie http://localhost:11000/oozie -kill 0000673-120823182447665-oozie-hado-C
  • Rerunning a Coordinator Action or Multiple Actions
$ oozie job -rerun 0000673-120823182447665-oozie-hado-C [-nocleanup] 
[-refresh][-action 1,3-5] [-date 2012-01-01T01:00Z::2012-05-31T23:59Z, 2012-11-10T01:00Z, 2012-12-31T22:00Z] 
-action or -date is required to rerun. If neither -action nor -date is given, the exception will be thrown.
  • Checking the Status of a Coordinator/Workflow job or a Coordinator Action
$ oozie job -oozie http://localhost:11000/oozie -info 0000673-20823182447665-oozie-hado-C
The info option can display information about a workflow job or coordinator job or coordinator action.

Invoking Coordinator Jobs from Java Client

The Oozie has exposed a JAVA API for invoking and controlling the workflows programmatically. Same API is also made applicable for coordinator but with some changes as coordinator and workflow differ in functioning.


//The service for executing coordinators on oozie
public class CoordinatorOozieService
{        
    // Oozie Client
    OozieClient oozieClient = null; 

    public CoordinatorOozieService(String url){      
        oozieClient = new OozieClient(url);        
    }              

    //To submit the coordinator job on oozie
    public String submitJob(String jobPropertyFilePath) throws OozieClientException, IOException{                       

        // create an empty coordinator job configuration object
        //with just the USER_NAME set to the JVM user name
        Properties conf = oozieClient.createConfiguration();

        conf.setProperty("user.name", "training");

        //set the coordinator properties
        conf.load(new FileInputStream(jobPropertyFilePath));

        // submit the coordinator job    
        return oozieClient.submit(conf);
    } 

    //To submit the coordinator job on oozie
    public String submitJob(Properties workflowProperties) throws OozieClientException, IOException{                       

        // create an empty coordinator job configuration object
        //with just the USER_NAME set to the JVM user name
        Properties conf = oozieClient.createConfiguration();

        //set the coordinator properties
        conf.putAll(workflowProperties);

        conf.setProperty("user.name", "training");


        // submit the coordinator job    
        return oozieClient.submit(conf);
        }

        // To run (submit and start) the coordinator job on oozie
        public String runJob(String jobPropertyFilePath) throws OozieClientException, IOException{                       

        // create an empty coordinator job configuration object
        //with just the USER_NAME set to the JVM user name

        Properties conf = oozieClient.createConfiguration();

        conf.setProperty("user.name", "training");

        //set the coordinator properties
        conf.load(new FileInputStream(jobPropertyFilePath));

        // submit and start the coordinator job    
        return oozieClient.run(conf);
    } 

    // To suspend the coordinator job on oozie
    public void suspendJob(String jobId) throws OozieClientException {                    
        // start the coordinator job    
        oozieClient.suspend(jobId);
    }

    // To resume the coordinator job on oozie
    public void resumeJob(String jobId) throws OozieClientException {                       
        // start the coordinator job    
        oozieClient.resume(jobId);
    }

    //To kill the coordinator job on oozie
    public void killJob(String jobId) throws OozieClientException {                       
        // start the coordinator job    
        oozieClient.kill(jobId);
    }

    //To get the status of the Coordinator Job with id <jobID>
    public Status getJobStatus(String jobID) throws OozieClientException{ 
        CoordinatorJob job = oozieClient.getCoordJobInfo(jobID);                
        return job.getStatus();
    }              
}

Conclusion

The Oozie Coordinator can be used for efficient scheduling of the Hadoop-related workflows. It also helps in triggering the same on the basis of availability of the data or external events. Moreover, it provides lot of configurable and pluggable components which helps in easy and effective deployment and maintenance of the Oozie workflow jobs.
As the coordinator is specified in XML, it is easy to integrate it with the J2EE applications. Invoking of coordinator jobs through java has already been explained above.

Enhancements

Oozie provides a new component, “Bundle” in its latest version 3. It provides a higher-level abstraction in which it creates a set of coordinator applications often called a Data Pipeline. Data Dependency can be inserted between multiple coordinator jobs to create an implicit data application pipeline. Oozie Lifecycle operations (start/stop/suspend/resume/rerun) can also be applied at the bundle level which  results in a better and easy operational control.

References

[1] Oozie Yahoo! Workflow Engine for Hadoop: http://incubator.apache.org/oozie/docs/3.1.3-incubating/docs/ 
[2] Oozie Coord Use Cases: https://github.com/yahoo/oozie/wiki/Oozie-Coord-Use-Cases
[3] Better Workflow Management in CDH Using Oozie 2: https://github.com/yahoo/oozie/wiki/Oozie-Coord-Use-Cases
[5] Apache Hadoop: http://hadoop.apache.org/
[6] Index of public/org/apache/hadoop/hadoop-examples/0.20.2-cdh3u0: 
https://repository.cloudera.com/artifactory/public/org/apache/hadoop/hadoop-examples/0.20.2-cdh3u0/

Thursday, 31 July 2014

Compare Big Data Platforms vs SQL Server

Compare Big Data Platforms vs SQL Server

Problem

SQL is derided by modern developers as 'Scarcely Qualifies as a Language'. But just how efficient are the new wave of NoSQL languages touted by bleeding-edge skunk works? This tip is a defense of SQL and of the relational model, and argues for the efficiency and suitability of relational technology.  Check out this tip to learn more.

Solution

SQL, or Structured Query Language, is one of the most versatile tools that an information analyst can use when working with relational databases. Initially developed around 40 years ago as an application of the mathematical theories of Codd, Boyce and others, it has since been embraced by the information technology industry and transformed into a range of dialects for a variety of platforms. With the core syntax and features reflected in easy-to-remember heuristics (SELECT, INSERT, UPDATE, DELETE), its popularity has soared and over the last twenty years, three vendors have come to the fore: Microsoft, with SQL Server; Oracle, with Oracle Database; and MySQL, developed by Sun and now also owned by Oracle.

However, as astute readers of technology blogs may know, Big Data has come galloping onto the horizon. Arguably, the applications of Big Data originated (in recent times) with Google's MapReduce algorithms, and it has evolved with tremendous speed into a range of products: Node.js, Hadoop, MongoDB, RavenDB and more. Searching Google, one can find multitudes of articles expounding the benefits of Big Data. Marketers, long deprived of sexy sales slogans for 'boring' relational databases, have taken the term to heart, with the result that Big Data is now not only the 'next big thing' but is being touted as the only way forward in the shiny new 'Information Age'. Companies are being exhorted to use these new tools to leverage their existing data and expand their businesses in ways they have never been able to before.

Storage Format for Relational Databases vs. NoSQL

So, what's the difference between relational data and non-relational data - or SQL, and NoSQL (aka NewSQL)? Relational data is defined at the basic level by a series of table entities which contain columns and rows, linked to other table entities by shared attributes. So, for example, as the owner of a small online business you might have a MySQL database behind your website with a table recording the name and email address of your customers. Another table might record your product names and their prices. A third table might link the two, recording which customers bought which products, with additional information such as the date of purchase and whether or not any discount was applied.
You can quickly see how this information could be useful; some analysis will give you the average spend per customer; a list of regular customers, and a list of inactive ones; a list of the most popular products. From this simple data you can make good business decisions: you can tempt back inactive customers with a targeted email campaign, you can adjust your stock to prioritize the most popular products. The possibilities are endless. At the core level, these queries against your data are performed in SQL, although various tools exist to hide this from the casual user.

Non-relational data, however, is not (by and large) stored in tables. Often called 'unstructured data', this data consists of separate records with attributes that vary, often per record. Say, for example, your small online business was a dating website recording, from a web page, the interests and ambitions of your members. Due to the huge variety of interests, you might wish to store each member as a document with the interests, and any attributes of those interests, laid out hierarchically. Relational databases are no good for this purpose - an 'Interests' table might list every conceivable interest, but recording an unknown number and combination of interests against each member forces the relational model to expand in ways it wasn't designed for. Standards such as XML and JSON, however, with variable nodes and node attributes, can be ideal since each record is stored uniquely with complex algorithms used for analysis and presentation.
Relational Data - Example Table Structure
Relational Data - Example Table Structure
NoSQL - Example Record in JSON
NoSQL - Example Record in JSON

Data Variation

So Big Data is all very well when dealing with unstructured data. When caching common Google queries, or recording masses of highly-variable customer-centric information, the technologies that use it can perform well. But what about the majority of businesses? How often will their data structures change? After all, most online merchants will have customers, products, sales, suppliers, etc. In terms of the variation of the data - the measure of how the structure of data is subject to change over time - most merchants, through the very nature of business, will find they're working to a fairly static data model.

After all, if you are a curtain manufacturer, your products will all have identical attributes - color, width, drop, price, fabric, style, thickness, etc. If you're a factory selling ball bearings, your ball bearings will have similar properties - diameter, density, material, price, quantity per batch, etc. These attributes are not likely to expand over time, and any expansion would take place within the data facts, rather than the data dimensions - say, for example, a new type of metal alloy was invented, this alloy would be added as a row in the materials table, rather than as a column in the products table.

So if we accept the argument that the rate of data variation is low amongst most businesses, then how important is the rise of Hadoop, Node.js, MongoDB/RavenDB and all the other NoSQL platforms? Is the rise of NoSQL simply the public perception of an inflated marketing bubble? We can argue that for OnLine Transaction Processing (OLTP) systems, which match the requirements for most businesses, NoSQL is not the right tool for the job.

Searching for Data

Let's now look at the algorithms for searching for data. Under a NoSQL data model, such as Hadoop, there exist a number of variations on a theme (and you can write your own) for searching for data using the MapReduce paradigm. This paradigm is that a typical analysis task, such as a series of read requests, is sent out to multiple nodes - servers - where the data resides. Each subset of the total request collects the data from the local data source, before a reduction algorithm aggregates and reduces the data to produce the desired result set. Some thought will show this approach works for certain types of unstructured data - for example, for string-searching unstructured tweets.

However, for more structured data, this approach is flawed. Structured data tends to be held in central repositories rather than 'sharded' across multiple servers, and under a relational model searching for data is extremely efficient. For example, in Microsoft SQL Server the search algorithm can approach a pre-sorted table (a table using a clustered index based on a balanced-tree format) and search for particular values using this index, and/or additional indexes (think of them like overlays to the data) to locate and return the data. I'll skip the mathematics here, but many millions of records can hence be reduced to a b-tree search problem so that a single- or double-digit number of reads are required to return the desired data. In addition, partitioning can replicate some of the performance benefits of sharding, since sharding is horizontal partitioning except distributed across servers.

Reads and Writes

Carrying on with this theme, Big Data platforms such as Hadoop are acknowledged to be quicker at writes than relational databases. Why? Because in Hadoop, writes are 'thrown over the fence' asynchronously with no wait on the commit from the database engine. In relational systems, a write is an atomic, consistent, isolated, durable (ACID) transaction and only completes on commit - hence synchronous in it's very nature. This can cause unnecessary delays for applications that don't wish to wait for commit acknowledgement. Hadoop easily outperforms relational systems here, for the sheer asynchronism of the process.

But what about reads? As argued above, reads take much longer, since the result set must be located (through mapping key-value pairs) then reduced across multiple nodes. Take a look at the JSON record in the image above - locating every unique product bought in interval X to Y would require a search on every customer record. If there are thousands of customers for a dozen products, this is extremely wasteful. Using a relational table, the query is broken down, optimized, filtered and returned much quicker. With relational systems, efficient structured reads (even on heaps, which tend to have sequential data stored on adjacent disk pages) can take place resulting in returns in the order of milliseconds, even for large result sets. Indeed, often the client application takes longer to render the result set than the server takes to supply it!

Total Cost of Ownership

'But it's FREE!', comes the wail from the IT manager. Is it? Apache Hadoop, to take an example, is indeed free at the point of use. The Apache Foundation have released Hadoop as open-source, so anyone can view the source code and see how it works. Anyone can download it. There's no core licensing fees. However, this doesn't mean it doesn't cost anything. The servers on which Hadoop must reside do indeed cost money. And as Big Data platforms are designed to scale out, rather than scale up, you'll need more and more servers to service your data requirements. Whereas production database systems servicing major businesses (and I can name a few) can sit on just a handful of relational database servers, Hadoop platforms are designed for large clusters of nodes and if you can't afford them - Hadoop may not be for you.

'Just use commodity hardware!', is the next argument. Really - tie together hundreds of old PCs or battered low-rent servers in a crazy Beowulf-like cluster? Not only does this present major headaches when considering the long-term costs of storage, cooling, maintenance, this simply isn't viable for small- to mid-size businesses. The fact is that the TCO for a Big Data platform is simply too high, even when compared with Microsoft's frankly appalling licensing costs (for SQL Server). There is a third way - using an open-source relational platform, such as MySQL, but there are some missing features when compared against larger offerings and there's still the cost of support. However some notable names out there use MySQL - Facebook being the largest, and for certain applications this may be a suitable compromise.

Object-Relational Mapping

ORM, better known amongst DBAs and relational database developers as 'Object-Relational Mangling', is an umbrella term for a number of new tools that allow developers to call a connection to a database as an object, and interact with that object using pre-set methods. The ORM tool (such as nHibernate) will then translate the method call into a SQL query, which is sent to the database for processing. The result set is then returned to the entity that called the method. Since object-oriented programming does not play nicely with the relational model, there are inevitably some challenges when ORM tools generate SQL code. For example, nHibernate will generate large (sometimes, hundreds) of parameters to feed into a parameterized statement. This makes the resulting automatically generated statements hard to optimize, filling the plan cache (optimizer feature) in SQL Server with hundreds of ad-hoc plans, meaning increased delay in server response time thanks to increased I/O.

On the flip side, modern developers normally use object-oriented methods to develop new tools, and calling a method on a connection object, passing in input parameters to return a result set, is natural and fits with the paradigm of method calls for the other 'black-box' features or libraries used in their languages. From the developer perspective, the ideal is to call a method with, for example, a customer number and return some attributes of that customer from the database into a result set. Under the relational-only model, they would need to either roll their own method that dynamically builds the query based on input parameters, or use in-line SQL queries, which is bad coding practice. Products such as nHibernate help bridge the gap by providing that level of translation to the relational layer.

Right Tool For The Job?

One consistent argument used by developers when attempting to win over the stubborn DBA with NoSQL tooling is that SQL and NoSQL are different toolsets to be used for different purposes. Confusing the two when trying to solve a problem that clearly belongs in the remit of just one platform is equivalent to using a screwdriver to hammer in a nail. But how true is this? Take server logs, for example. An extended log file for IIS 6.0 might contain one or more of the following fields: date, time, c-ip, cs-username, s-ip, s-port, cs-method, cs-uri-stem, cs-uri-query, sc-status, cs(User-Agent), etc. Although the fields are defined in a #comment at the beginning of the file, these fields are unlikely to change through the lifetime of the product. The data recorded is a mixture of date, time, IP address and character fields.

Aggregating results from these logs is simple, and relational systems (with the right indexes) can do this with outstanding performance. In a NoSQL document store, each row is stored as a separate document inside a collection; to get, for example, the called URL from all records matching a particular IP, all documents will need to be scanned (string-searching on attribute), collated then reduced to the subset of data required. This approach is extremely inefficient, but analysis of server logs is frequently touted as a good application of NoSQL technology to real-world data mining.
There is an argument that issues such as the ORM/relational mismatch help provide the justification for a move to a non-relational platform. Various companies have tried it, and anecdotal evidence highlights various problems not just with the migration of data, but the ongoing benefits of it. For example, Foursquare's public apology following their catastrophic attempt to add a shard to their estate which resulted in an 11-hour outage; and the engaging account from Wireclub of the benefits (and pitfalls) of their migration from SQL Server to MongoDB should sound alarm bells with even the most pro-NoSQL DBA or developer - '...not only the database would crash and exit without providing any useful information but the data would also be left in a corrupted state.' (on MongoDB for Windows). However, the companies in both these examples determined NoSQL to be the data platform most suited to their future development as an organization, and they are learning to live with the features unavailable in more sophisticated relational platforms.

In Summary

Relational systems may have been born in the 1970s, but the feature set offered by the more advanced examples of relational platforms far outstrips the feature sets available in NoSQL implementations. Although NoSQL purports to be new, the concepts have been around for more than 20 years; look up 'network databases' and see the striking similarities to modern-day graph databases. The generation before us tried non-relational systems and the concepts simply didn't work - one can see the same story with new Agile and LEAN processes producing rafts of cheap code that's apt to break, versus the SSADM (Big Design Up Front) methodologies that gave us quality systems like our government defense and our banking infrastructure. NoSQL campaigners now say 'not only SQL' rather than 'no SQL' in belated realization that SQL actually has features to offer; these add-ons, such as Hive for Hadoop, provide a SQL-NoSQL translation layer. But the underlying document store, described by a colleague as a 'massive version of Notepad', is inefficient. This failure to build efficacious foundations of the new Information Age will be the undoing of our data-driven systems unless we tackle the structural problems at the heart of these new products, and rebuild them in an economical way.

Next Steps

For further reading on structured and unstructured data storage technologies and expositions on the arguments above, please consult the following sources and read the extensive opinions available on the web:

 

Related Posts Plugin for WordPress, Blogger...