Thursday, 31 July 2014

How a little open source project came to dominate big data

How a little open source project came to dominate big data

It began as a nagging technical problem that needed solving. Now, it’s driving a market that’s expected to be worth $50.2 billion by 2020.

There are countless open source projects with crazy names in the software world today, but the vast majority of them never make it onto enterprises’ collective radar. Hadoop is an exception of pachydermic proportions.

Named after a child’s toy elephant, Hadoop is now powering big data applications at companies such as Yahoo YHOO and Facebook FB ; more than half of the Fortune 50 use it, providers say.

The software’s “refreshingly unique approach to data management is transforming how companies store, process, analyze and share big data,” according to Forrester analyst Mike Gualtieri. “Forrester believes that Hadoop will become must-have infrastructure for large enterprises.”

Globally, the Hadoop market was valued at $1.5 billion in 2012; by 2020, it is expected to reach $50.2 billion.

It’s not often a grassroots open source project becomes a de facto standard in industry. So how did it happen?

‘A market that was in desperate need’

“Hadoop was a happy coincidence of a fundamentally differentiated technology, a permissively licensed open source codebase and a market that was in desperate need of a solution for exploding volumes of data,” said RedMonk cofounder and principal analyst Stephen O’Grady. “Its success in that respect is no surprise.”
Created by Doug Cutting and Mike Cafarella, the software—like so many other inventions—was born of necessity. In 2002, the pair were working on an open source search engine called Nutch. “We were making progress and running it on a small cluster, but it was hard to imagine how we’d scale it up to running on thousands of machines the way we suspected Google was,” Cutting said.

Shortly thereafter Google GOOG published a series of academic papers on its own Google File System and MapReduce infrastructure systems, and “it was immediately clear that we needed some similar infrastructure for Nutch,” Cafarella said.
“The way Google was approaching things was different and powerful,” Cutting explained. Whereas so far at that point “you had to build a special-purpose system for each distributed thing you wanted to do,” Google’s approach offered instead a general-purpose automated framework for distributed computing. “It took care of the hard part of distributed computing so you could focus just on your application,” Cutting said.

Both Cutting and Cafarella (who are now chief architect at Cloudera and University of Michigan assistant professor of computer science and engineering, respectively) knew they wanted to make a version of their own—not just for Nutch, but for the benefit of others as well—and they knew they wanted to make it open source.
“I don’t enjoy the business aspects,” Cutting said. “I’m a technical guy. I enjoy working on the code, tackling the problems with peers and trying to improve it, not trying to sell it. I’d much rather tell people, ‘It’s kind of OK at this; it’s terrible at that; maybe we can make it better.’ To be able to be brutally honest is really nice—it’s much harder to be that way in a commercial setting.”

But the pair knew that the potential upside of success could be staggering.  “If I was right and it was useful technology that lots of people wanted to use, I’d be able to pay my rent—and without having to risk my shirt on a startup,” Cutting said.
For Cafarella, “Making Nutch open source was part of a desire to see search engine technology outside the control of a few companies, but also a tactical decision that would maximize the likelihood of getting contributions from engineers at big companies. We specifically chose an open source license that made it easy for a company to contribute.”

It was a good decision. “Hadoop would not have become a big success without large investments from Yahoo and other firms,” Cafarella said.

‘How would you compete with open source?’

So Hadoop borrowed an idea from Google, made the concept open source, and both encouraged and got investment from powerhouses like Yahoo. But that wasn’t all that drove its success. Luck—in the form of sheer, unanticipated market demand—also played a key role.

“I knew other people would probably have similar problems, but I had no idea just how many other people,” Cutting said. “I thought it would be mostly people building text search engines. I didn’t see it being used by folks in insurance, banking, oil discovery—all these places where it’s being used today.”

Looking back, “my conjecture is that we were early enough, and that the combination of being first movers and being open source and being a substantial effort kept there from being a lot of competitors early on,” he said. “Mike and I got so far, but it took tens of engineers from Yahoo several more years to make it stable.”
And even if a competitor did manage to catch up, “how would you compete with something open source?” Cutting said. “Competing against open source is a tough game—everybody else is collaborating on it; the cost is zero. It’s easier to join than to fight.”

IBM IBM , Microsoft MSFT , and Oracle ORCL are among the large companies that chose to collaborate with Hadoop.
Though Cafarella isn’t surprised that Web companies use Hadoop, he is astonished at “how many people now have data management problems that 12 years ago were exceedingly rare,” he said. “Everyone now has the problems that used to belong to just Yahoo and Google.”

Hadoop represents “somewhat of a turning point in the primary drivers of open source software technology,” said Jay Lyman, a senior analyst for enterprise software with 451 Research. Before, open source software such as the Linux operating system were best known for offering a cost-effective alternative to proprietary software like Microsoft’s Windows. “Cost savings and efficiency drove much of the enterprise use,” Lyman said.

With the advent of NoSQL databases and Hadoop, however, “we saw innovation among the primary drivers of adoption and use,” Lyman said. “When it comes to NoSQL or Hadoop technology, there is not really a proprietary alternative.”
Hadoop’s success has come as a pleasant surprise to its creators. “I didn’t expect an open source project would ever take over an industry like this,” Cutting said. “I’m overjoyed.”

And it’s still on a roll. “Hadoop is now much bigger than the original components,” Cafarella said. “It’s an entire stack of tools, and the stack keeps growing. Individual components might have some competition—mainly MapReduce—but I don’t see any strong alternative to the overall Hadoop ecosystem.”

The project’s adaptability “argues for its continued success,” RedMonk’s O’Grady said. “Hadoop today is a very different, and more versatile, project than it was even a year or two ago.”

But there’s plenty of work to be done. Looking ahead, Cutting—with the support of Cloudera—has begun to focus on the policy needed to accommodate big data technology.

“Now that we have this technology and so much digitization of just about every aspect of commerce and government and we have these tools to process all this digital data, we need to make sure we’re using it in ways we think are in the interests of society,” he said. “In many ways, the policy needs to catch up with the technology.

“One way or other, we are going to end up with laws. We want them to be the right ones.”

 

Introduction to the Hadoop Ecosystem

Introduction to the Hadoop Ecosystem


When Hadoop 1.0.0 was released by Apache in 2011, comprising mainly HDFS and MapReduce, it soon became clear that Hadoop was not simply another application or service, but a platform around which an entire ecosystem of capabilities could be built. Since then, dozens of self-standing software projects have sprung into being around Hadoop, each addressing a variety of problem spaces and meeting different needs.
Many of these projects were begun by the same people or companies who were the major developers and early users of Hadoop; others were initiated by commercial Hadoop distributors. The majority of these projects now share a home with Hadoop at the Apache Software Foundation, which supports open-source software development and encourages the development of the communities surrounding these projects.
The following sections are meant to give the reader a brief introduction to the world of Hadoop and the core related software projects. There are countless commercial Hadoop-integrated products focused on making Hadoop more usable and layman-accessible, but the ones here were chosen because they provide core functionality and speed in Hadoop.
The so-called "Hadoop ecosystem" is, as befits an ecosystem, complex, evolving, and not easily parcelled into neat categories. Simply keeping track of all the project names may seem like a task of its own, but this pales in comparison to the task of tracking the functional and architectural differences between projects. These projects are not meant to all be used together, as parts of a single organism; some may even be seeking to solve the same problem in different ways. What unites them is that they each seek to tap into the scalability and power of Hadoop, particularly the HDFS component of Hadoop.

Additional Links



HDFS

The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines, rather than requiring a single machine to have disk capacity equal to/greater than the summed total size of the files. HDFS is designed to be fault-tolerant due to data replication and distribution of data. When a file is loaded into HDFS, it is replicated and broken up into "blocks" of data, which are stored across the cluster nodes designated for storage, a.k.a. DataNodes.


At the architectural level, HDFS requires a NameNode process to run on one node in the cluster and a DataNode service to run on each "slave" node that will be processing data. When data is loaded into HDFS, the data is replicated and split into blocks that are distributed across the DataNodes. The NameNode is responsible for storage and management of metadata, so that when MapReduce or another execution framework calls for the data, the NameNode informs it where the needed data resides.

HDFS Architecture
One significant drawback to HDFS is that it has a single point of failure (SPOF), which lies in the NameNode service. If the NameNode or the server hosting it goes down, HDFS is down for the entire cluster. The Secondary NameNode, which takes periodic snapshots of the NameNode and updates it, is not itself a backup NameNode.
Currently the most comprehensive solution to this problem comes from MapR, one of the major Hadoop distributors. MapR has developed a "distributed NameNode," where the HDFS metadata is distributed across the cluster in "Containers," which are tracked by the Container Location Database (CLDB).
Regular NameNode architecture vs. MapR's distributed NameNode architecture
Via MapR

The Apache community is also working to address this NameNode SPOF: Hadoop 2.0.2 will include an update to HDFS called HDFS High Availability (HA), which provides the user with " the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. This allows a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance." The active NameNode logs all changes to a directory that is also accessible by the standby NameNode, which then uses the log information to update itself.


Architecture of HDFS High Availability framework

MapReduce

The MapReduce paradigm for parallel processing comprises two sequential steps: map and reduce.
In the map phase, the input is a set of key-value pairs and the desired function is executed over each key/value pair in order to generate a set of intermediate key/value pairs.
In the reduce phase, the intermediate key/value pairs are grouped by key and the values are combined together according to the reduce code provided by the user; for example, summing. It is also possible that no reduce phase is required, given the type of operation coded by the user.

At the cluster level, the MapReduce processes are divided between two applications, JobTracker and TaskTracker. JobTracker runs on only 1 node of the cluster, while TaskTracker runs on every slave node in the cluster. Each MapReduce job is split into a number of tasks which are assigned to the various TaskTrackers depending on which data is stored on that node. JobTracker is responsible for scheduling job runs and managing computational resources across the cluster. JobTracker oversees the progress of each TaskTracker as they complete their individual tasks.


MapReduce Architecture

YARN

As Hadoop became more widely adopted and used on clusters with up to tens of thousands of nodes, it became obvious that MapReduce 1.0 had issues with scalability, memory usage, synchronization, and had its own SPOF issues. In response, YARN (Yet Another Resource Negotiator) was begun as a subproject in the Apache Hadoop Project, on par with other subprojects like HDFS, MapReduce, and Hadoop Common. YARN addresses problems with MapReduce 1.0's architecture, specifically with the JobTracker service.
Essentially, YARN "split[s] up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global Resource Manager (RM) and per-application ApplicationMaster (AM)." (source: Apache) Thus, rather than burdening a single node with handling scheduling and resource management for the entire cluster, YARN now distributes this responsibility across the cluster.


YARN Architecture

MapReduce 2.0

MapReduce 2.0, or MR2, contains the same execution framework as MapReduce 1.0, but it is built on the scheduling/resource management framework of YARN.
YARN, contrary to widespread misconceptions, is not the same as MapReduce 2.0 (MRv2). Rather, YARN is a general framework which can support multiple instances of distributed processing applications, of which MapReduce 2.0 is one.

Additional Links


 

Hadoop-Related Projects at Apache

With the exception of Chukwa, Drill, and HCatalog (incubator-level projects), all other Apache projects mentioned here are top-level projects.
This list is not meant to be all-inclusive, but it serves as an introduction to some of the most commonly used projects, and also illustrates the range of capabilities being developed around Hadoop. To name just a couple, Whirr and Crunch are other Hadoop-related Apache projects not described here.

Avro


Avro is a framework for performing remote procedure calls and data serialization. In the context of Hadoop, it can be used to pass data from one program or language to another, e.g. from C to Pig. It is particularly suited for use with scripting languages such as Pig, because data is always stored with its schema in Avro, and therefore the data is self-describing.
Avro can also handle changes in schema, a.k.a. "schema evolution," while still preserving access to the data. For example, different schemas could be used in serialization and deserialization of a given data set.

Additional Links


BigTop


BigTop is a project for packaging and testing the Hadoop ecosystem. Much of BigTop's code was initially developed and released as part of Cloudera's CDH distribution, but has since become its own project at Apache.
The current BigTop release (0.5.0) supports a number of Linux distributions and packages Hadoop together with the following projects: Zookeeper, Flume, HBase, Pig, Hive, Sqoop, Oozie, Whirr, Mahout, SolrCloud, Crunch, DataFu and Hue.

Additional Links


Chukwa


Chukwa, currently in incubation, is a data collection and analysis system built on top of HDFS and MapReduce. Tailored for collecting logs and other data from distributed monitoring systems, Chukwa provides a workflow that allows for incremental data collection, processing and storage in Hadoop. It is included in the Apache Hadoop distribution, but as an independent module.

Drill


Drill is an incubation-level project at Apache and is an open-source version of Google's Dremel. Drill is a distributed system for executing interactive analysis over large-scale datasets. Some explicit goals of the Drill project are to support real-time querying of nested data and to scale to clusters of 10,000 nodes or more.
Designed to support nested data, Drill also supports data with (e.g. Avro) or without (e.g. JSON) schemas. Its primary language is an SQL-like language, DrQL, though the Mongo Query Language can also be used.

Flume


Flume is a tool for harvesting, aggregating and moving large amounts of log data in and out of Hadoop. Flume "channels" data between "sources" and "sinks" and its data harvesting can either be scheduled or event-driven. Possible sources for Flume include Avro, files, and system logs, and possible sinks include HDFS and HBase. Flume itself has a query processing engine, so that there is the option to transform each new batch of data before it is shuttled to the intended sink.
Since July 2012, Flume has been released as Flume NG (New Generation), as it differs significantly from its original incarnation, a.k.a Flume OG (Original Generation)..

Additional Links


HBase


Based on Google's Bigtable, HBase "is an open-source, distributed, versioned, column-oriented store" that sits on top of HDFS. HBase is column-based rather than row-based, which enables high-speed execution of operations performed over similar values across massive data sets, e.g. read/write operations that involve all rows but only a small subset of all columns. HBase does not provide its own query or scripting language, but is accessible through Java, Thrift, and REST APIs.
HBase depends on Zookeeper and runs a Zookeeper instance by default.

Additional Links


HCatalog


An incubator-level project at Apache, HCatalog is a metadata and table storage management service for HDFS. HCatalog depends on the Hive metastore and exposes it to other services such as MapReduce and Pig with plans to expand to HBase using a common data model. HCatalog's goal is to simplify the user's interaction with HDFS data and enable data sharing between tools and execution platforms.

Additional Links


Hive


Hive provides a warehouse structure and SQL-like access for data in HDFS and other Hadoop input sources (e.g. Amazon S3). Hive's query language, HiveQL, compiles to MapReduce. It also allows user-defined functions (UDFs). Hive is widely used, and has itself become a "sub-platform" in the Hadoop ecosystem.
Hive's data model provides a structure that is more familiar than raw HDFS to most users. It is based primarily on three related data structures: tables, partitions, and buckets, where tables correspond to HDFS directories and can be divided into partitions, which in turn can be divided into buckets.

Additional Links



Mahout


Mahout is a scalable machine-learning and data mining library. There are currently four main groups of algorithms in Mahout:
  • recommendations, a.k.a. collective filtering
  • classification, a.k.a categorization
  • clustering
  • frequent itemset mining, a.k.a parallel frequent pattern mining
Mahout is not simply a collection of pre-existing algorithms; many machine learning algorithms are intrinsically non-scalable; that is, given the types of operations they perform, they cannot be executed as a set of parallel processes. Algorithms in the Mahout library belong to the subset that can be executed in a distributed fashion, and have been written to be executable in MapReduce.

Additional Links



Oozie


Oozie is a job coordinator and workflow manager for jobs executed in Hadoop, which can include non-MapReduce jobs. It is integrated with the rest of the Apache Hadoop stack and, according to the Oozie site, it "support[s] several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts)."
An Oozie workflow is a collection of actions and Hadoop jobs arranged in a Directed Acyclic Graph (DAG), which is a common model for tasks that must be a in sequence and are subject to certain constraints.

Additional Links



Pig


Pig is a framework consisting of a high-level scripting language (Pig Latin) and a run-time environment that allows users to execute MapReduce on a Hadoop cluster. Like HiveQL in Hive, Pig Latin is a higher-level language that compiles to MapReduce.
Pig is more flexible than Hive with respect to possible data format, due to its data model. Via the Pig Wiki: "Pig's data model is similar to the relational data model, except that tuples (a.k.a. records or rows) can be nested. For example, you can have a table of tuples, where the third field of each tuple contains a table. In Pig, tables are called bags. Pig also has a 'map' data type, which is useful in representing semistructured data, e.g. JSON or XML."

Additional Links



Sqoop


Sqoop ("SQL-to-Hadoop") is a tool which transfers data in both directions between relational systems and HDFS or other Hadoop data stores, e.g. Hive or HBase.
According to the Sqoop blog, "You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses."


ZooKeeper


ZooKeeper is a service for maintaining configuration information, naming, providing distributed synchronization and providing group services. As the ZooKeeper wiki summarizes it, "ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (we call these registers znodes), much like a file system." ZooKeeper itself is a distributed service with "master" and "slave" nodes, and stores configuration information, etc. in memory on ZooKeeper servers.

Additional Links



Hadoop-Related Projects Outside Apache

There are also projects outside of Apache that build on or parallel the major Hadoop projects at Apache. Several of interest are described here.

Spark (UC Berkeley)


Spark is a parallel computing program which can operate over any Hadoop input source: HDFS, HBase, Amazon S3, Avro, etc. Spark is an open-source project at the U.C. Berkeley AMPLab, and in its own words, Spark "was initially developed for two applications where keeping data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining."
While often compared to MapReduce insofar as it also provides parallel processing over HDFS and other Hadoop input sources, Spark differs in two key ways:
  • Spark holds intermediate results in memory, rather than writing them to disk; this drastically reduces query return time
  • Spark supports more than just map and reduce functions, greatly expanding the set of possible analyses that can be executed over HDFS data
The first feature is the key to doing iterative algorithms on Hadoop: rather than reading from HDFS, performing MapReduce, writing the results back to HDFS (i.e. to disk) and repeating for each cycle, Spark reads data from HDFS, performs the computation, and stores the intermediate results in memory as Resilient Distributed Data Sets. Spark can then run the next set of computations on the results cached in memory, thereby skipping the time-consuming steps of writing the nth round results to HDFS and reading them back out for the (n+1)th round.

Additional Links



Shark (UC Berkeley)


Shark is essentially "Hive running on Spark." It utilizes the Apache Hive infrastructure, including the Hive metastore and HDFS, but it gives users the benefits of Spark (increased processing speed, additional functions besides map and reduce). This way, Shark users can execute the queries in HiveQL over the same HDFS data sets, but receive results in near-real-time fashion.


Impala (Cloudera)


Released by Cloudera, Impala is an open-source project which, like Apache Drill, was inspired by Google's paper on Dremel; the purpose of both is to facilitate real-time querying of data in HDFS or HBase. Imapala uses an SQL-like language that, though similar to HiveQL, is currently more limited than HiveQL. Because Impala relies on the Hive Metastore, Hive must be installed on a cluster in order for Impala to work.
The secret behind Impala's speed is that it "circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs." (Source: Cloudera)

Additional Links

Related Posts Plugin for WordPress, Blogger...