This is a dummy sink just used to collect files downloaded in map tasks
This format creates a new text record writer for each different path that's generated by the partition function Each record writer defines a specific OutputCommitter that will define a different work directory for a given key.
This format creates a new text record writer for each different path that's generated by the partition function Each record writer defines a specific OutputCommitter that will define a different work directory for a given key.
All the generated paths will be created under temporary dir/sink id in order to collect them more rapidly with just a rename of directories (see OutputChannel)
Smart functions for materialising distributed lists by loading text files.
Smart functions for persisting distributed lists by storing them as text files.
Class that abstracts all the common functionality of reading from text files.
This is a dummy sink just used to collect files downloaded in map tasks
The map task must be a parallelDo like this:
def download = (path: String, InputOutputContext) => { // get the output directory for the current map task val outputDir = FileOutputFormat.getWorkOutputPath(context.context) val outDir = outputDir.toString.replace("file:", "") logger.debug("output dir is "+outDir)
// download the file // ... }
val sink = new DownloadSink("target/test", (_:String).startsWith("source"))
val fileNames: DList[String] = ??? fileNames.parallelDo(download).addSink(sink).persist
The downloaded files will be collected from the working directory of the map task and go to "target/test" based on their path