Before starting, you will need:
In addition to Hadoop, scoobi uses sbt (version 0.13.1) to simplify building and packaging a project for running on Hadoop.
Here the steps to get started on your own project:
$ mkdir my-app
$ cd my-app
$ mkdir -p src/main/scala
We first can create a build.sbt
file that has a dependency on Scoobi:
name := "MyApplication"
version := "1.0"
scalaVersion := "2.10.3"
libraryDependencies += "com.nicta" %% "scoobi" % "0.9.2"
resolvers ++= Seq(Resolver.sonatypeRepo("releases"),
Resolver.sonatypeRepo("snaspshots"))
Now we can write some code. In src/main/scala/WordCount.scala
, for instance:
import com.nicta.scoobi.Scoobi._
import Reduction._
object WordCount extends ScoobiApp {
def run() {
val lines = fromTextFile(args(0))
val counts = lines.mapFlatten(_.split(" "))
.map(word => (word, 1))
.groupByKey
.combine(Sum.int)
counts.toTextFile(args(1)).persist
}
}
The Scoobi application can now be compiled and run using sbt:
> sbt compile
> sbt "run-main mypackage.myapp.WordCount input-files output"
Your Hadoop configuration will automatically get picked up, and all relevant JARs will be made available.
If you had any trouble following along, take a look at Word Count for a self contained example.