How a little open source project came to dominate big data
It
began as a nagging technical problem that needed solving. Now, it’s
driving a market that’s expected to be worth $50.2 billion by 2020.
There are countless open source projects with crazy names in the
software world today, but the vast majority of them never make it onto
enterprises’ collective radar.
Hadoop is an exception of pachydermic proportions.
Named after a child’s toy elephant, Hadoop is now powering big data applications
at companies such as Yahoo
YHOO
and Facebook
FB
; more than half of the
Fortune 50 use it,
providers say.
The software’s “refreshingly unique approach to data management is
transforming how companies store, process, analyze and share big data,”
according to Forrester analyst
Mike Gualtieri. “Forrester believes that Hadoop will become must-have infrastructure for large enterprises.”
Globally, the Hadoop market was valued at $1.5 billion in 2012; by 2020,
it is expected to reach $50.2 billion.
It’s not often a grassroots open source project becomes a de facto standard in industry. So how did it happen?
‘A market that was in desperate need’
“Hadoop was a happy coincidence of a fundamentally differentiated
technology, a permissively licensed open source codebase and a market
that was in desperate need of a solution for exploding volumes of data,”
said
RedMonk cofounder and principal analyst Stephen O’Grady. “Its success in that respect is no surprise.”
Created by Doug Cutting and Mike Cafarella, the software—like so many
other inventions—was born of necessity. In 2002, the pair were working
on an open source search engine called Nutch. “We were making progress
and running it on a small cluster, but it was hard to imagine how we’d
scale it up to running on thousands of machines the way we suspected
Google was,” Cutting said.
Shortly thereafter Google
GOOG
published a series of academic papers on its own Google File
System and MapReduce infrastructure systems, and “it was immediately
clear that we needed some similar infrastructure for Nutch,” Cafarella
said.
“The way Google was approaching things was different and powerful,”
Cutting explained. Whereas so far at that point “you had to build a
special-purpose system for each distributed thing you wanted to do,”
Google’s approach offered instead a general-purpose automated framework
for distributed computing. “It took care of the hard part of distributed
computing so you could focus just on your application,” Cutting said.
Both Cutting and Cafarella (who are now chief architect at
Cloudera and
University of Michigan
assistant professor of computer science and engineering, respectively)
knew they wanted to make a version of their own—not just for Nutch, but
for the benefit of others as well—and they knew they wanted to make it
open source.
“I don’t enjoy the business aspects,” Cutting said. “I’m a technical
guy. I enjoy working on the code, tackling the problems with peers and
trying to improve it, not trying to sell it. I’d much rather tell
people, ‘It’s kind of OK at this; it’s terrible at that; maybe we can
make it better.’ To be able to be brutally honest is really nice—it’s
much harder to be that way in a commercial setting.”
But the pair knew that the potential upside of success could be
staggering. “If I was right and it was useful technology that lots of
people wanted to use, I’d be able to pay my rent—and without having to
risk my shirt on a startup,” Cutting said.
For Cafarella, “Making Nutch open source was part of a desire to see
search engine technology outside the control of a few companies, but
also a tactical decision that would maximize the likelihood of getting
contributions from engineers at big companies. We specifically chose an
open source license that made it easy for a company to contribute.”
It was a good decision. “Hadoop would not have become a big success
without large investments from Yahoo and other firms,” Cafarella said.
‘How would you compete with open source?’
So Hadoop borrowed an idea from Google, made the concept open source,
and both encouraged and got investment from powerhouses like Yahoo. But
that wasn’t all that drove its success. Luck—in the form of sheer,
unanticipated market demand—also played a key role.
“I knew other people would probably have similar problems, but I had
no idea just how many other people,” Cutting said. “I thought it would
be mostly people building text search engines. I didn’t see it being
used by folks in insurance, banking, oil discovery—all these places
where it’s being used today.”
Looking back, “my conjecture is that we were early enough, and that
the combination of being first movers and being open source and being a
substantial effort kept there from being a lot of competitors early on,”
he said. “Mike and I got so far, but it took tens of engineers from
Yahoo several more years to make it stable.”
And even if a competitor did manage to catch up, “how would you
compete with something open source?” Cutting said. “Competing against
open source is a tough game—everybody else is collaborating on it; the
cost is zero. It’s easier to join than to fight.”
IBM
IBM
, Microsoft
MSFT
, and Oracle
ORCL
are among the large companies that chose to collaborate with Hadoop.
Though Cafarella isn’t surprised that Web companies use Hadoop, he is
astonished at “how many people now have data management problems that
12 years ago were exceedingly rare,” he said. “Everyone now has the
problems that used to belong to just Yahoo and Google.”
Hadoop represents “somewhat of a turning point in the primary drivers
of open source software technology,” said Jay Lyman, a senior analyst
for enterprise software with
451 Research. Before,
open source software such as the Linux operating system were best known
for offering a cost-effective alternative to proprietary software like
Microsoft’s Windows. “Cost savings and efficiency drove much of the
enterprise use,” Lyman said.
With the advent of NoSQL databases and Hadoop, however, “we saw
innovation among the primary drivers of adoption and use,” Lyman said.
“When it comes to NoSQL or Hadoop technology, there is not really a
proprietary alternative.”
Hadoop’s success has come as a pleasant surprise to its creators. “I
didn’t expect an open source project would ever take over an industry
like this,” Cutting said. “I’m overjoyed.”
And it’s still on a roll. “Hadoop is now much bigger than the
original components,” Cafarella said. “It’s an entire stack of tools,
and the stack keeps growing. Individual components might have some
competition—mainly MapReduce—but I don’t see any strong alternative to
the overall Hadoop ecosystem.”
The project’s adaptability “argues for its continued success,”
RedMonk’s O’Grady said. “Hadoop today is a very different, and more
versatile, project than it was even a year or two ago.”
But there’s plenty of work to be done. Looking ahead, Cutting—with
the support of Cloudera—has begun to focus on the policy needed to
accommodate big data technology.
“Now that we have this technology and so much digitization of just
about every aspect of commerce and government and we have these tools to
process all this digital data, we need to make sure we’re using it in
ways we think are in the interests of society,” he said. “In many ways,
the policy needs to catch up with the technology.
“One way or other, we are going to end up with laws. We want them to be the right ones.”