Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Elasticsearch is a great piece of software, it’s a distributed and fault tolerant indexing engine based on Lucene. It’s based on a very simple peer to peer technology, you just run an Elasticsearch instance on each node belonging to your cluster and voilà you have your powerful distributed indexing and searching engine available.

However, when it’s about to install a distributed platform, there are some intricacies, you have to copy the software on all the nodes, copy configuration files and using some mechanism for starting in one shot all the instances from a central place.

I’m not saying that it’s particular difficult, it’s plenty of tools like Ansible, Puppet, Chef that could make these kind of activities pretty simple. But let’s suppose you could simply run an executable from your console that magically deploys Elasticsearch and run it on all the nodes.

Yes, it’s doable, if you have an Hadoop cluster available, you can force the job trackers to become a sort of remote agents able to run your distributed application. The trick is to pack everything into an Hadoop map/reduce job which is then run on the cluster.

To force the job manager to run exactly one instance of Elasticsearch per node the trick is to implement a fake input format that generates fake splits one per job tracker.

That fake input format will make running an instance of your mapper one per node, then in your mapper you can just embed an instance of Elasticsearch.

An important point is to keep the job manager thinking that your long running Elasticsearch job is doing something, otherwise after a while it’s not receiving any heartbeat from the mappers it kills the job. I’ll show how to do this in the code below.

As usual my examples are in Scala, so let’s start from an sbt file:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

import AssemblyKeys._

assemblySettings

name := "elimr"

version := "1.0"

scalaVersion := "2.10.3"

javacOptions ++= Seq("-source", "1.6", "-target", "1.6")

scalacOptions ++= Seq("-deprecation", "-target:jvm-1.6")

libraryDependencies ++= Seq(

  "org.apache.hadoop" % "hadoop-client" % "2.0.0-cdh4.5.0" % "provided",

  "org.elasticsearch" % "elasticsearch" % "0.90.10"        % "compile",

  "com.github.scopt" %% "scopt"         % "3.2.0"          % "compile"

)

resolvers ++= Seq("sonatype snapshots" at "http://oss.sonatype.org/content/repositories/snapshots",

                  "cloudera" at "https://repository.cloudera.com/content/repositories/releases",

                  "apache"   at "https://repository.apache.org/content/repositories/releases")

mergeStrategy in assembly &lt;&lt;= (mergeStrategy in assembly) { mergeStrategy =&gt; {

    case entry =&gt; {

      val strategy = mergeStrategy(entry)

      if (strategy == MergeStrategy.deduplicate) MergeStrategy.first

      else strategy

    }

  }

}

Don’t forget to install the sbt-assembly plugin that will generate just a big jar containing everything we need for running Elasticsearch without the need to copy anything on the cluster’s nodes.

Then the next piece is the fake input format:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

import collection.JavaConversions._

import org.apache.hadoop.io.Text

import org.apache.hadoop.mapreduce._

import org.apache.hadoop.conf.Configuration

import org.apache.hadoop.mapreduce.lib.input.FileSplit

import org.apache.hadoop.fs.Path

class ElimrInputFormat extends InputFormat[Text, Text] {

  def getSplits(context: JobContext): java.util.List[InputSplit] = {

    val conf: Configuration = context.getConfiguration

    val clusterSize: Int = Integer.parseInt(conf.get("elimr_cluster_slots"))

    (1 to clusterSize).map(i =&gt; new FileSplit(new Path("dummy-split-" + i), 0, 1, null))

  }

  private[elimr] class ElimrRecordReader extends RecordReader[Text, Text] {

    def initialize(split: InputSplit, context: TaskAttemptContext) {

      name = split.asInstanceOf[FileSplit].getPath

    }

    def nextKeyValue: Boolean = {

      if (first) {

        first = false

        return true

      }

      false

    }

    def getCurrentKey: Text = {

      new Text(name.getName)

    }

    def getCurrentValue: Text = {

      new Text("")

    }

    def close(): Unit = {}

    def getProgress: Float = 0.0f

    private[elimr] var name: Path = null

    private[elimr] var first: Boolean = true

  }

  def createRecordReader(split: InputSplit, context: TaskAttemptContext): RecordReader[Text, Text] = new ElimrRecordReader

}

As you can see it does very little, it defines a fake reader and it just creates as many FileSplit instances as the number of job trackers running (lines 10-15).

You need also a fake output format which does nothing but it’s neeeded for making the job configuration happy:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

import org.apache.hadoop.mapreduce._

import org.apache.hadoop.io.Text

class ElimrOutputFormat extends OutputFormat[Text, Text] {

  def getRecordWriter(context: TaskAttemptContext): RecordWriter[Text, Text] = new RecordWriter[Text, Text] {

    def close(context: TaskAttemptContext): Unit = {}

    def write(key: Text, value: Text): Unit = {}

  }

  def getOutputCommitter(context: TaskAttemptContext): OutputCommitter = new OutputCommitter {

    def abortTask(taskContext: TaskAttemptContext): Unit = {}

    def commitTask(taskContext: TaskAttemptContext): Unit = {}

    def setupTask(taskContext: TaskAttemptContext): Unit = {}

    def needsTaskCommit(taskContext: TaskAttemptContext): Boolean = false

    def setupJob(jobContext: JobContext): Unit = {}

  }

  def checkOutputSpecs(context: JobContext): Unit = {}

}

Now it’s time to see the mapper:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

import org.apache.hadoop.io.Text

import java.util.{TimerTask, Timer}

import org.apache.hadoop.mapreduce.Mapper

import org.elasticsearch.common.settings.ImmutableSettings

import org.elasticsearch.node.NodeBuilder

import org.apache.hadoop.fs.{Path, FileSystem}

import org.apache.commons.io.IOUtils

import org.elasticsearch.common.settings.loader.YamlSettingsLoader

class ElimrMapper extends Mapper[Any, Text, Text, Text] {

  override def map(key: Any, value: Text, context: Mapper[Any, Text, Text, Text]#Context): Unit = {

    val timer: Timer = new Timer(true)

    timer.scheduleAtFixedRate(new TimerTask {

      def run() {

        context.progress()

      }

    }, 0, 8 * 60 * 1000)

    val clusterName = context.getConfiguration.get("cluster.name")

    val configFile = context.getConfiguration.get("config.file")

    val settings = if (configFile != null) {

      val fs = FileSystem.get(context.getConfiguration)

      val is = fs.open(new Path(configFile))

      val conf = IOUtils.toString(is, "UTF-8")

      val loader = new YamlSettingsLoader

      val items = loader.load(conf)

      val settings = ImmutableSettings.settingsBuilder().put(items).build()

      settings

    } else {

      val settings = ImmutableSettings.settingsBuilder().build()

      settings

    }

    val node = NodeBuilder.nodeBuilder()

      .loadConfigSettings(true).settings(settings)

      .clusterName(clusterName)

      .data(true).node()

    node.start()

    while (!node.isClosed) {

      Thread.sleep(5000)

    }

  }

}

In the map method you can embed your Elasticsearch instance (or whatever you want to make running on all the slave nodes of your Hadoop cluster), the lines 14-19 schedule a task that every eight minutes calls context.progress() . That method notifies the job manager that the job is still alive doing something.

Finally the driver:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

import collection.JavaConversions._

import org.apache.hadoop.conf.Configured

import org.apache.hadoop.util.{ToolRunner, Tool}

import org.apache.hadoop.mapred.{ClusterStatus, JobClient, JobConf}

import org.apache.hadoop.mapreduce.Job

import org.apache.hadoop.io.Text

import java.lang.reflect.Method

import org.slf4j.{LoggerFactory, Logger}

import org.apache.hadoop.fs.{Path, FileSystem, FileUtil}

import org.apache.hadoop.mapreduce.filecache.DistributedCache

object ElimrDriver extends Configured with Tool with App {

  private final val log: Logger = LoggerFactory.getLogger(this.getClass.getName)

  def run(args: Array[String]): Int = {

    case class Config(javaOptions: String = " -Xms256m -Xmx1g -Xss256k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly",

                      clusterName: String = "elasticsearch",

                      configFile: String = null)

    val parser = new scopt.OptionParser[Config]("elimr") {

      head("elimr", "1.0")

      opt[String]('J', "java-options") valueName "&lt;options&gt;" optional() action {

        (x, c) =&gt;

          c.copy(javaOptions = x)

      }

      opt[String]('n', "cluster-name") valueName "&lt;options&gt;" optional() action {

        (x, c) =&gt;

          c.copy(clusterName = x)

      }

      opt[String]('f', "config-file") valueName "&lt;options&gt;" optional() action {

        (x, c) =&gt;

          c.copy(configFile = x)

      }

    }

    val opt_args = args.slice(1, args.length)

    parser.parse(opt_args, Config()) map {

      config =&gt;

        val conf = getConf

        conf.set("mapreduce.user.classpath.first", "true")

        conf.set("mapreduce.task.classpath.user.precedence", "true")

        conf.set("mapreduce.task.classpath.first", "true")

        conf.set("mapreduce.job.user.classpath.first", "true")

        conf.set("mapred.map.tasks.speculative.execution", "false")

        conf.setInt("mapreduce.map.maxattempts", 1)

        conf.setInt("mapred.map.max.attempts", 1)

        conf.set("cluster.name", config.clusterName)

        if (config.configFile != null) {

          val lfs = FileSystem.getLocal(conf)

          val fs = FileSystem.get(getConf)

          val src = new Path(config.configFile)

          val dst = new Path("/tmp/elasticsearch.yml")

          FileUtil.copy(lfs, src, fs, dst, false, true, getConf)

          conf.set("config.file", "/tmp/elasticsearch.yml")

        }

        val jobConf: JobConf = new JobConf(getConf)

        val client: JobClient = new JobClient(jobConf)

        val cluster: ClusterStatus = client.getClusterStatus

        conf.set("elimr_cluster_slots", Integer.toString(cluster.getTaskTrackers))

        conf.set("mapred.child.java.opts", config.javaOptions)

        val job: Job = Job.getInstance(conf)

        job.setJobName("Elimr")

        job.setNumReduceTasks(0)

        job.setJarByClass(getClass)

        job.setMapperClass(classOf[ElimrMapper])

        job.setOutputKeyClass(classOf[Text])

        job.setOutputValueClass(classOf[Text])

        job.setInputFormatClass(classOf[ElimrInputFormat])

        job.setOutputFormatClass(classOf[ElimrOutputFormat])

        job.submit()

        0

    } getOrElse {

      1

    }

  }

  val exitCode = ToolRunner.run(ElimrDriver, args)

  System.exit(exitCode)

}

The driver does a couple of things, first of all allows passing different options (lines 17-35), for example it’s possible to pass the java options you want to propagate to the task that the job tracker creates embedding Elasticsearch. The driver can also get as an argument an Elasticsearch configuration file that will ultimately accessed by all the embedded Elasticsearch instances running on your cluster.

The lines 40-80 show how to configure and submit a job that using those fake input/output formats will run Elasticsearch on your cluster.

So, after packaging everything in one jar you could run one or more Elasticsearch clusters on top of your Hadoop infrastructure simply using this command:

hadoop jar elimr-assembly-1.0.jar ElimrDriver -n ElasticsearchName -f elasticsearch.yml

Just changing the name (-n parameter) and using different configuration files you could run multiple times that command allowing to run multiple Elasticsearch clusters on the same cluster infrastructure.

That’s all folks.

Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Pages

Tuesday, 21 October 2014

Elasticsearch in Map Reduce