com.nicta.scoobi.io

text

package text

Visibility
  1. Public
  2. All

Type Members

  1. class DownloadSink extends DataSink[NullWritable, NullWritable, NullWritable]

    This is a dummy sink just used to collect files downloaded in map tasks

    This is a dummy sink just used to collect files downloaded in map tasks

    The map task must be a parallelDo like this:

    def download = (path: String, InputOutputContext) => { // get the output directory for the current map task val outputDir = FileOutputFormat.getWorkOutputPath(context.context) val outDir = outputDir.toString.replace("file:", "") logger.debug("output dir is "+outDir)

    // download the file // ... }

    val sink = new DownloadSink("target/test", (_:String).startsWith("source"))

    val fileNames: DList[String] = ??? fileNames.parallelDo(download).addSink(sink).persist

    The downloaded files will be collected from the working directory of the map task and go to "target/test" based on their path

  2. class OverwritableTextOutputFormat[K, V] extends TextOutputFormat[K, V]

  3. class PartitionedTextOutputFormat[P, K, V] extends PartitionedOutputFormat[P, K, V]

    This format creates a new text record writer for each different path that's generated by the partition function Each record writer defines a specific OutputCommitter that will define a different work directory for a given key.

    This format creates a new text record writer for each different path that's generated by the partition function Each record writer defines a specific OutputCommitter that will define a different work directory for a given key.

    All the generated paths will be created under temporary dir/sink id in order to collect them more rapidly with just a rename of directories (see OutputChannel)

  4. case class TextFileSink[A](path: String, overwrite: Boolean = false, check: OutputCheck = Sink.defaultOutputCheck, compression: Option[Compression] = None)(implicit evidence$1: Manifest[A]) extends DataSink[NullWritable, A, A] with Product with Serializable

  5. trait TextInput extends AnyRef

    Smart functions for materialising distributed lists by loading text files.

  6. trait TextOutput extends AnyRef

    Smart functions for persisting distributed lists by storing them as text files.

  7. case class TextSource[A](paths: Seq[String], inputFormat: Class[_ <: FileInputFormat[LongWritable, Text]] = classOf[TextInputFormat], inputConverter: InputConverter[LongWritable, Text, A] = TextInput.defaultTextConverter, check: InputCheck = Source.defaultInputCheck)(implicit evidence$1: WireFormat[A]) extends DataSource[LongWritable, Text, A] with Product with Serializable

    Class that abstracts all the common functionality of reading from text files.

Value Members

  1. object TextFileSink extends Serializable

  2. object TextInput extends TextInput

  3. object TextOutput extends TextOutput

Ungrouped