Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Tuesday, 21 October 2014

Elasticsearch in Map Reduce

Elasticsearch is a great piece of software, it’s a distributed and fault tolerant indexing engine based on Lucene. It’s based on a very simple peer to peer technology, you just run an Elasticsearch instance on each node belonging to your cluster and voilà you have your powerful distributed indexing and searching engine available.

However, when it’s about to install a distributed platform, there are some intricacies, you have to copy the software on all the nodes, copy configuration files and using some mechanism for starting in one shot all the instances from a central place.

I’m not saying that it’s particular difficult, it’s plenty of tools like Ansible, Puppet, Chef that could make these kind of activities pretty simple. But let’s suppose you could simply run an executable from your console that magically deploys Elasticsearch and run it on all the nodes.

Yes, it’s doable, if you have an Hadoop cluster available, you can force the job trackers to become a sort of remote agents able to run your distributed application. The trick is to pack everything into an Hadoop map/reduce job which is then run on the cluster.

To force the job manager to run exactly one instance of Elasticsearch per node the trick is to implement a fake input format that generates fake splits one per job tracker.

That fake input format will make running an instance of your mapper one per node, then in your mapper you can just embed an instance of Elasticsearch.

An important point is to keep the job manager thinking that your long running Elasticsearch job is doing something, otherwise after a while it’s not receiving any heartbeat from the mappers it kills the job. I’ll show how to do this in the code below.

As usual my examples are in Scala, so let’s start from an sbt file:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

import AssemblyKeys._

assemblySettings

name := "elimr"

version := "1.0"

scalaVersion := "2.10.3"

javacOptions ++= Seq("-source", "1.6", "-target", "1.6")

scalacOptions ++= Seq("-deprecation", "-target:jvm-1.6")

libraryDependencies ++= Seq(

  "org.apache.hadoop" % "hadoop-client" % "2.0.0-cdh4.5.0" % "provided",

  "org.elasticsearch" % "elasticsearch" % "0.90.10"        % "compile",

  "com.github.scopt" %% "scopt"         % "3.2.0"          % "compile"

)

resolvers ++= Seq("sonatype snapshots" at "http://oss.sonatype.org/content/repositories/snapshots",

                  "cloudera" at "https://repository.cloudera.com/content/repositories/releases",

                  "apache"   at "https://repository.apache.org/content/repositories/releases")

mergeStrategy in assembly &lt;&lt;= (mergeStrategy in assembly) { mergeStrategy =&gt; {

    case entry =&gt; {

      val strategy = mergeStrategy(entry)

      if (strategy == MergeStrategy.deduplicate) MergeStrategy.first

      else strategy

    }

  }

}

Don’t forget to install the sbt-assembly plugin that will generate just a big jar containing everything we need for running Elasticsearch without the need to copy anything on the cluster’s nodes.

Then the next piece is the fake input format:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

import collection.JavaConversions._

import org.apache.hadoop.io.Text

import org.apache.hadoop.mapreduce._

import org.apache.hadoop.conf.Configuration

import org.apache.hadoop.mapreduce.lib.input.FileSplit

import org.apache.hadoop.fs.Path

class ElimrInputFormat extends InputFormat[Text, Text] {

  def getSplits(context: JobContext): java.util.List[InputSplit] = {

    val conf: Configuration = context.getConfiguration

    val clusterSize: Int = Integer.parseInt(conf.get("elimr_cluster_slots"))

    (1 to clusterSize).map(i =&gt; new FileSplit(new Path("dummy-split-" + i), 0, 1, null))

  }

  private[elimr] class ElimrRecordReader extends RecordReader[Text, Text] {

    def initialize(split: InputSplit, context: TaskAttemptContext) {

      name = split.asInstanceOf[FileSplit].getPath

    }

    def nextKeyValue: Boolean = {

      if (first) {

        first = false

        return true

      }

      false

    }

    def getCurrentKey: Text = {

      new Text(name.getName)

    }

    def getCurrentValue: Text = {

      new Text("")

    }

    def close(): Unit = {}

    def getProgress: Float = 0.0f

    private[elimr] var name: Path = null

    private[elimr] var first: Boolean = true

  }

  def createRecordReader(split: InputSplit, context: TaskAttemptContext): RecordReader[Text, Text] = new ElimrRecordReader

}

As you can see it does very little, it defines a fake reader and it just creates as many FileSplit instances as the number of job trackers running (lines 10-15).

You need also a fake output format which does nothing but it’s neeeded for making the job configuration happy:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

import org.apache.hadoop.mapreduce._

import org.apache.hadoop.io.Text

class ElimrOutputFormat extends OutputFormat[Text, Text] {

  def getRecordWriter(context: TaskAttemptContext): RecordWriter[Text, Text] = new RecordWriter[Text, Text] {

    def close(context: TaskAttemptContext): Unit = {}

    def write(key: Text, value: Text): Unit = {}

  }

  def getOutputCommitter(context: TaskAttemptContext): OutputCommitter = new OutputCommitter {

    def abortTask(taskContext: TaskAttemptContext): Unit = {}

    def commitTask(taskContext: TaskAttemptContext): Unit = {}

    def setupTask(taskContext: TaskAttemptContext): Unit = {}

    def needsTaskCommit(taskContext: TaskAttemptContext): Boolean = false

    def setupJob(jobContext: JobContext): Unit = {}

  }

  def checkOutputSpecs(context: JobContext): Unit = {}

}

Now it’s time to see the mapper:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

import org.apache.hadoop.io.Text

import java.util.{TimerTask, Timer}

import org.apache.hadoop.mapreduce.Mapper

import org.elasticsearch.common.settings.ImmutableSettings

import org.elasticsearch.node.NodeBuilder

import org.apache.hadoop.fs.{Path, FileSystem}

import org.apache.commons.io.IOUtils

import org.elasticsearch.common.settings.loader.YamlSettingsLoader

class ElimrMapper extends Mapper[Any, Text, Text, Text] {

  override def map(key: Any, value: Text, context: Mapper[Any, Text, Text, Text]#Context): Unit = {

    val timer: Timer = new Timer(true)

    timer.scheduleAtFixedRate(new TimerTask {

      def run() {

        context.progress()

      }

    }, 0, 8 * 60 * 1000)

    val clusterName = context.getConfiguration.get("cluster.name")

    val configFile = context.getConfiguration.get("config.file")

    val settings = if (configFile != null) {

      val fs = FileSystem.get(context.getConfiguration)

      val is = fs.open(new Path(configFile))

      val conf = IOUtils.toString(is, "UTF-8")

      val loader = new YamlSettingsLoader

      val items = loader.load(conf)

      val settings = ImmutableSettings.settingsBuilder().put(items).build()

      settings

    } else {

      val settings = ImmutableSettings.settingsBuilder().build()

      settings

    }

    val node = NodeBuilder.nodeBuilder()

      .loadConfigSettings(true).settings(settings)

      .clusterName(clusterName)

      .data(true).node()

    node.start()

    while (!node.isClosed) {

      Thread.sleep(5000)

    }

  }

}

In the map method you can embed your Elasticsearch instance (or whatever you want to make running on all the slave nodes of your Hadoop cluster), the lines 14-19 schedule a task that every eight minutes calls context.progress() . That method notifies the job manager that the job is still alive doing something.

Finally the driver:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

import collection.JavaConversions._

import org.apache.hadoop.conf.Configured

import org.apache.hadoop.util.{ToolRunner, Tool}

import org.apache.hadoop.mapred.{ClusterStatus, JobClient, JobConf}

import org.apache.hadoop.mapreduce.Job

import org.apache.hadoop.io.Text

import java.lang.reflect.Method

import org.slf4j.{LoggerFactory, Logger}

import org.apache.hadoop.fs.{Path, FileSystem, FileUtil}

import org.apache.hadoop.mapreduce.filecache.DistributedCache

object ElimrDriver extends Configured with Tool with App {

  private final val log: Logger = LoggerFactory.getLogger(this.getClass.getName)

  def run(args: Array[String]): Int = {

    case class Config(javaOptions: String = " -Xms256m -Xmx1g -Xss256k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly",

                      clusterName: String = "elasticsearch",

                      configFile: String = null)

    val parser = new scopt.OptionParser[Config]("elimr") {

      head("elimr", "1.0")

      opt[String]('J', "java-options") valueName "&lt;options&gt;" optional() action {

        (x, c) =&gt;

          c.copy(javaOptions = x)

      }

      opt[String]('n', "cluster-name") valueName "&lt;options&gt;" optional() action {

        (x, c) =&gt;

          c.copy(clusterName = x)

      }

      opt[String]('f', "config-file") valueName "&lt;options&gt;" optional() action {

        (x, c) =&gt;

          c.copy(configFile = x)

      }

    }

    val opt_args = args.slice(1, args.length)

    parser.parse(opt_args, Config()) map {

      config =&gt;

        val conf = getConf

        conf.set("mapreduce.user.classpath.first", "true")

        conf.set("mapreduce.task.classpath.user.precedence", "true")

        conf.set("mapreduce.task.classpath.first", "true")

        conf.set("mapreduce.job.user.classpath.first", "true")

        conf.set("mapred.map.tasks.speculative.execution", "false")

        conf.setInt("mapreduce.map.maxattempts", 1)

        conf.setInt("mapred.map.max.attempts", 1)

        conf.set("cluster.name", config.clusterName)

        if (config.configFile != null) {

          val lfs = FileSystem.getLocal(conf)

          val fs = FileSystem.get(getConf)

          val src = new Path(config.configFile)

          val dst = new Path("/tmp/elasticsearch.yml")

          FileUtil.copy(lfs, src, fs, dst, false, true, getConf)

          conf.set("config.file", "/tmp/elasticsearch.yml")

        }

        val jobConf: JobConf = new JobConf(getConf)

        val client: JobClient = new JobClient(jobConf)

        val cluster: ClusterStatus = client.getClusterStatus

        conf.set("elimr_cluster_slots", Integer.toString(cluster.getTaskTrackers))

        conf.set("mapred.child.java.opts", config.javaOptions)

        val job: Job = Job.getInstance(conf)

        job.setJobName("Elimr")

        job.setNumReduceTasks(0)

        job.setJarByClass(getClass)

        job.setMapperClass(classOf[ElimrMapper])

        job.setOutputKeyClass(classOf[Text])

        job.setOutputValueClass(classOf[Text])

        job.setInputFormatClass(classOf[ElimrInputFormat])

        job.setOutputFormatClass(classOf[ElimrOutputFormat])

        job.submit()

        0

    } getOrElse {

      1

    }

  }

  val exitCode = ToolRunner.run(ElimrDriver, args)

  System.exit(exitCode)

}

The driver does a couple of things, first of all allows passing different options (lines 17-35), for example it’s possible to pass the java options you want to propagate to the task that the job tracker creates embedding Elasticsearch. The driver can also get as an argument an Elasticsearch configuration file that will ultimately accessed by all the embedded Elasticsearch instances running on your cluster.

The lines 40-80 show how to configure and submit a job that using those fake input/output formats will run Elasticsearch on your cluster.

So, after packaging everything in one jar you could run one or more Elasticsearch clusters on top of your Hadoop infrastructure simply using this command:

hadoop jar elimr-assembly-1.0.jar ElimrDriver -n ElasticsearchName -f elasticsearch.yml

Just changing the name (-n parameter) and using different configuration files you could run multiple times that command allowing to run multiple Elasticsearch clusters on the same cluster infrastructure.

That’s all folks.

Harpoon helps in facing your data: HDFS is not a data warehouse.

This is the first in a series of blog posts written to give you a better technical understanding of our product Harpoon.

Based on our practical experiences, we will show what, in our opinion, are the typical problems people face when using Hadoop in the context of business intelligence on top of massive amounts of data. Let’s start.

Hadoop, as you may know, is a complex eco-system consisting of various platforms and tools meant mainly for acquiring and analysing big data sets. The most fundamental component of Hadoop is HDFS (Hadoop Distributed File System). It’s one of the cheapest ways to reliably store vast amounts of data. It’s basically a distributed file system that provides the abstraction of a global file system on top of the local file systems of the nodes that form a Hadoop cluster..

Using a bunch of relatively inexpensive servers connected together to form a cluster, it is possible to store terabytes of data without incurring the typical costs of big storage systems. It implements a data replication mechanism which allows it to deal with the loss of one or more cluster nodes (you cannot imagine how often this happens in a big cluster). Even if it doesn’t provide all the functionalities of a typical storage system, it does excel in streaming data which is the most common use case in data analysis and business intelligence.

Given all those characteristics, HDFS is becoming more and more popular as a sort of “universal” repository for storing data coming from different sources. Tools like Sqoop, Flume help in collecting data from RDBMS or any other source into this big pot called HDFS. At the other end, HDFS is more than good enough, especially taking into the account the cost savings, when it comes to performing complex business intelligence tasks on top of data where full OLTP capabilities aren’t required i.e the tasks that are performed most of the time.

Putting together a simple prototype to collect some data and do something meaningful using the Hadoop platform is pretty easy. Tools like the Cloudera Manager greatly simplify the setting up and the maintenance of a Hadoop cluster with all the services they provide.

So far so good; however, HDFS is just a file system, it doesn’t provide any mechanism for keeping the data you put into it ordered and catalogued. HDFS is not a database, it just provides simple abstractions like directories and files. Everything relies on discipline and best practices enforced by good people. We have observed companies fall in love with Hadoop (understandably, Hadoop is an amazing platform!), and start to use it, consolidating many different kinds of data coming from different data sources. Can you guess what happens after a while? A total mess! Different teams start to organise the data in different ways, there is no idea of data history, format, etc., and that’s not to mention all the security related problems. Frankly speaking, this is pretty normal. What do you expect if you give multiple users free access to a plain file system? Think what happens to a wiki, a tree-based document management system, that doesn’t impose any particular structure. Think how painful it is to keep an enterprise wiki clean and well organised.

This problem is well understood and there are attempts to solve it. Hive, for example, a popular SQL engine that generates Map/Reduce jobs, provides a meta-data repository and organises the data in terms of databases and tables (of course). It’s very nice and allows access to the data even using a low-latency SQL engine like Impala or the fast Hive engine based on Tez. However, this comes with a price, you are forced to use a specific set of tools, the interoperability among the different high-level tools like Pig and Hive is limited and allows you to work only on a subset of data types. A tool like HCatalog adds some meta-data support, but it’s still in its infancy and doesn’t fully solve the aforementioned interoperability problems.

During the design of Harpoon we took into account all these problems and we implemented a set of features specifically meant to mitigate them:

Simple data organization model: Harpoon imposes a strict way to organise your data by offering a simple to understand model very similar to Hive’s one: data is organised in databases as a collection of tables. A table contains a list of records where each field has a name and a type. This organisation imposes a very simple layout in HDFS: a directory per database and a directory per table.
Powerful meta-data repository: All the data schemas are collected and stored in the meta-data repository of Harpoon. Harpoon keeps track of the names and types of all the entities stored inside its repository. It also stores the actual data format of a table e.g Avro, Parquet or something else.
Support for multiple data formats: It is extremely easy to extend Harpoon to support additional data formats. It also keeps track of the lineage of data, security information like read/write rights, data visibility and data ownership.
Homogenization of data types: Harpoon achieves the interoperability among the different tools by enforcing common formats for specific data types like date and timestamps. For example, even if the date data type is not supported by Avro, a date is stored in Avro as a string in a format compatible with Pig, Hive and Impala at the same time. The meta-data repository, then, keeps track of the fact that this specific field is a actually a date and not a string.
Automatic support for data format conversion: The data is always stored in a self describing format, we currently support Avro but we will support Parquet soon. In this way it’s possible to use any tool to access the data stored inside Harpoon without the need to access Harpoon’s meta-data repository. Harpoon also provides an automatic mechanism to convert data from one format to another.

Collectively all those features are meant to mitigate the problem of using the plain HDFS for storing data. At a cost of being forced to use a simpler data organization model, Harpoon automatically takes control of the data layout on the file system, keeps track of the different data formats and ensures that different query tools can access the same data sets transparently, all run under a strict control of a powerful authentication and authorisation system that extends what Hadoop is currently offering.

In the next post I’ll talk a bit more about the different tools Hadoop provides for data analysis and manipulation, their pros and cons and how they are evolving. I’ll show also how Harpoon greatly simplifies the usage of those tools eliminating the interoperability problems among them by providing a unified view of the data.

Stay tuned.

This is the first in a series of blog posts written to give you a better technical understanding of our product Harpoon.

Based on our practical experiences, we will show what, in our opinion, are the typical problems people face when using Hadoop in the context of business intelligence on top of massive amounts of data. Let’s start.

Hadoop, as you may know, is a complex eco-system consisting of various platforms and tools meant mainly for acquiring and analysing big data sets. The most fundamental component of Hadoop is HDFS (Hadoop Distributed File System). It’s one of the cheapest ways to reliably store vast amounts of data. It’s basically a distributed file system that provides the abstraction of a global file system on top of the local file systems of the nodes that form a Hadoop cluster..

Using a bunch of relatively inexpensive servers connected together to form a cluster, it is possible to store terabytes of data without incurring the typical costs of big storage systems. It implements a data replication mechanism which allows it to deal with the loss of one or more cluster nodes (you cannot imagine how often this happens in a big cluster). Even if it doesn’t provide all the functionalities of a typical storage system, it does excel in streaming data which is the most common use case in data analysis and business intelligence.

Given all those characteristics, HDFS is becoming more and more popular as a sort of “universal” repository for storing data coming from different sources. Tools like Sqoop, Flume help in collecting data from RDBMS or any other source into this big pot called HDFS. At the other end, HDFS is more than good enough, especially taking into the account the cost savings, when it comes to performing complex business intelligence tasks on top of data where full OLTP capabilities aren’t required i.e the tasks that are performed most of the time.

Putting together a simple prototype to collect some data and do something meaningful using the Hadoop platform is pretty easy. Tools like the Cloudera Manager greatly simplify the setting up and the maintenance of a Hadoop cluster with all the services they provide.

So far so good; however, HDFS is just a file system, it doesn’t provide any mechanism for keeping the data you put into it ordered and catalogued. HDFS is not a database, it just provides simple abstractions like directories and files. Everything relies on discipline and best practices enforced by good people. We have observed companies fall in love with Hadoop (understandably, Hadoop is an amazing platform!), and start to use it, consolidating many different kinds of data coming from different data sources. Can you guess what happens after a while? A total mess! Different teams start to organise the data in different ways, there is no idea of data history, format, etc., and that’s not to mention all the security related problems. Frankly speaking, this is pretty normal. What do you expect if you give multiple users free access to a plain file system? Think what happens to a wiki, a tree-based document management system, that doesn’t impose any particular structure. Think how painful it is to keep an enterprise wiki clean and well organised.

This problem is well understood and there are attempts to solve it. Hive, for example, a popular SQL engine that generates Map/Reduce jobs, provides a meta-data repository and organises the data in terms of databases and tables (of course). It’s very nice and allows access to the data even using a low-latency SQL engine like Impala or the fast Hive engine based on Tez. However, this comes with a price, you are forced to use a specific set of tools, the interoperability among the different high-level tools like Pig and Hive is limited and allows you to work only on a subset of data types. A tool like HCatalog adds some meta-data support, but it’s still in its infancy and doesn’t fully solve the aforementioned interoperability problems.

During the design of Harpoon we took into account all these problems and we implemented a set of features specifically meant to mitigate them:

Simple data organization model: Harpoon imposes a strict way to organise your data by offering a simple to understand model very similar to Hive’s one: data is organised in databases as a collection of tables. A table contains a list of records where each field has a name and a type. This organisation imposes a very simple layout in HDFS: a directory per database and a directory per table.
Powerful meta-data repository: All the data schemas are collected and stored in the meta-data repository of Harpoon. Harpoon keeps track of the names and types of all the entities stored inside its repository. It also stores the actual data format of a table e.g Avro, Parquet or something else.
Support for multiple data formats: It is extremely easy to extend Harpoon to support additional data formats. It also keeps track of the lineage of data, security information like read/write rights, data visibility and data ownership.
Homogenization of data types: Harpoon achieves the interoperability among the different tools by enforcing common formats for specific data types like date and timestamps. For example, even if the date data type is not supported by Avro, a date is stored in Avro as a string in a format compatible with Pig, Hive and Impala at the same time. The meta-data repository, then, keeps track of the fact that this specific field is a actually a date and not a string.
Automatic support for data format conversion: The data is always stored in a self describing format, we currently support Avro but we will support Parquet soon. In this way it’s possible to use any tool to access the data stored inside Harpoon without the need to access Harpoon’s meta-data repository. Harpoon also provides an automatic mechanism to convert data from one format to another.

Collectively all those features are meant to mitigate the problem of using the plain HDFS for storing data. At a cost of being forced to use a simpler data organization model, Harpoon automatically takes control of the data layout on the file system, keeps track of the different data formats and ensures that different query tools can access the same data sets transparently, all run under a strict control of a powerful authentication and authorisation system that extends what Hadoop is currently offering.

In the next post I’ll talk a bit more about the different tools Hadoop provides for data analysis and manipulation, their pros and cons and how they are evolving. I’ll show also how Harpoon greatly simplifies the usage of those tools eliminating the interoperability problems among them by providing a unified view of the data.

Stay tuned.

Pages

Tuesday, 21 October 2014

Elasticsearch in Map Reduce

Harpoon helps in facing your data: HDFS is not a data warehouse.