Quick Start

Prerequisites

Before starting, you will need:

Hadoop 2.2.0 (see the Deployment page for CDH3 and CDH4 configurations)
Sbt 0.13.1

In addition to Hadoop, scoobi uses sbt (version 0.13.1) to simplify building and packaging a project for running on Hadoop.

Directory Structure

Here the steps to get started on your own project:

$ mkdir my-app
$ cd my-app
$ mkdir -p src/main/scala

We first can create a build.sbt file that has a dependency on Scoobi:

name := "MyApplication"

version := "1.0"

scalaVersion := "2.10.3"

libraryDependencies += "com.nicta" %% "scoobi" % "0.9.0-SNAPSHOT"

resolvers ++= Seq(Resolver.sonatypeRepo("releases"),
                  Resolver.sonatypeRepo("snaspshots"))

Write your code

Now we can write some code. In src/main/scala/WordCount.scala, for instance:

import com.nicta.scoobi.Scoobi._
import Reduction._

object WordCount extends ScoobiApp {
  def run() {
    val lines = fromTextFile(args(0))

    val counts = lines.mapFlatten(_.split(" "))
      .map(word => (word, 1))
      .groupByKey
      .combine(Sum.int)
    counts.toTextFile(args(1)).persist
  }
}

Running

The Scoobi application can now be compiled and run using sbt:

> sbt compile
> sbt "run-main mypackage.myapp.WordCount input-files output"

Your Hadoop configuration will automatically get picked up, and all relevant JARs will be made available.

If you had any trouble following along, take a look at Word Count for a self contained example.

User Guide