This output channel simply copy values coming from a ParallelDo input (a mapper in an Input channel) to this node sinks and bridgeStore
Abstract trait for both input and output channels
This input channel is a tree of Mappers which are not connected to Gbk nodes
This input channel is a tree of Mappers which are all connected to Gbk nodes
Output channel for a GroupByKey.
An input channel groups mapping operations from a single DataSource, attached to a source node (a Load node, or a GroupByKey node from a previous Mscr for example).
An input channel groups mapping operations from a single DataSource, attached to a source node (a Load node, or a GroupByKey node from a previous Mscr for example).
There are however more data inputs for an InputChannel since the environments of ParallelDos are inputs as well
An InputChannel emits (key, values) of different types classified by an Integer tag, either:
The main functionality of an InputChannel is to map an input key/value to another key/value to be grouped or reduced using the functions of ParallelDos.
There are 2 main types of InputChannels:
They both share some implementation in the MscrInputChannel trait.
An input channel can have no mappers at all. In that case the values from the source node are directly emitted with no transformation.
Two InputChannels are equal if they have the same id.
encapsulation of expected key types for each tag
Simple layering algorithm using the Longest path method to assign nodes to layers.
Simple layering algorithm using the Longest path method to assign nodes to layers.
See here for a good overview: http://www.cs.brown.edu/~rt/gdhandbook/chapters/hierarchical.pdf
In our case the layers have minimum height and possibly big width which is actually good if we run things in parallel
This class represents an MSCR job with a Seq of input channels and a Seq of output channels
This class represents an MSCR job with a Seq of input channels and a Seq of output channels
Each mscr has a unique id, which is used for equality.
A Mscr has both input and output channels where input channels are doing mapping operations while output channels are doing grouping and reducing operations.
In order to optimise the processing a same value can be computed by several mapping functions and be directed to different grouping / reducing functions, each of them identified by a tag.
Common implementation of InputChannel for GbkInputChannel and FloatingInputChannel
Implementation of an OutputChannel for a Mscr
This trait processes the computation graph created out of DLists and creates map-reduce jobs from it.
This trait processes the computation graph created out of DLists and creates map-reduce jobs from it.
The algorithm consists in:
- building layers of independent nodes in the graph - finding the input nodes for the first layer - reaching "output" nodes from the input nodes - building output channels with those nodes - building input channels connecting the output to the input nodes - aggregating input and output channels as Mscr representing a full map reduce job - iterating on any processing node that is not part of a Mscr
An OutputChannel is responsible for emitting key/values grouped by one Gbk or passed through from an InputChannel with no grouping
An OutputChannel is responsible for emitting key/values grouped by one Gbk or passed through from an InputChannel with no grouping
Two OutputChannels are equal if they have the same tag. This tag is the id of the last processing node of the channel
encapsulation of expected value types for each tag
Utility functions to create Mscrs
Utility functions for Output channels
Output channel for a GroupByKey.
It can optionally have a reducer and / or a combiner applied to the grouped key/values.
The possible combinations are
There can not be gbk -> reducer -> combiner because in that case the second combiner is transformed as a parallelDo by the Optimiser