Scoobi includes a growing set of reusable components, ready for use in your application. They are located
com.nicta.scoobi.lib, take a look at the API documentation to get a real sense on how to use them.
Grouping works is important before moving on to use the Scoobi libraries. This is because the libraries use the grouping in order to determine how values are grouped. While normally the defaults are very sane, in some case they may not be what you want. For instance, if you're doing a full outer join, where you're grouping based on an
Option[T] it is quite likely you would not want
Nones to be grouped (and cartesian product expanded) together. So to do this, you would specify a Grouping implicit that will make two
None's not group together.
In the future, this will be more explicit and some convenience ones will be provided for you out of the box.
Often you have two collections, and would like to collate them somehow. While you might expect the logic to be around a
DList[A], with a
DList[B] and predicate
(A, B) => Bool to create
DList[(A, B)] -- it would not be possible to do (remotely) efficiently in this manner, and be rather cumbersome. So instead our join code is structured around having a
DList[(K, A)] and a
DList[(K, B)] where the
K has to have an
Ordering available (a requirement for
From this, you generally end up with a
DList[(K, (A, B)].
In addition to a normal (equi)join, often you want to do something for a
K when there's no corresponding values in the left (
A) or right (
B) DList. This forms the basis of left, right, inner and outer joins. See: wikipedia for an almost tutorial like walkthrough of the different join types.
Sometimes when dealing a
DList[(K, (A, B)] where there's multiple K's, its easier to work with in the form
DList[(K, (Iterable[A], Iterable[B]))] where each K is unique. If this is what you need, it's a lot more efficient to use
CoGroup then a relational join, followed by a
DMatrix[Elem, V] provides you with a distributed matrix currently backed by a coordinate oriented DList (
DList[((Elem, Elem), V)] ) which makes it ideal for storing huge sparse matrices. When multiplying by vectors that can fit in memory, it often makes sense to convert it to a RowWise or ColWise matrix and use it from there. There are implicit conversions that can happen, but remember the conversion isn't cheap -- so try save the result if you're going to keep using it (e.g. doing an iterative algorithm with the same matrix)
DVector[Elem, V] is in the same spirit of a DMatrix, it's backed by a coordinate form DList (
DList[(Elem, V)] ) ideal for huge and spare vectors.
InMemVector[Elem, V] and
InMemDenseVector[V] are for vectors that can fit into memory. They are based around a
DObject which means a copy gets efficiently sent to each mapper. For things like Matrix by Vector operations, this can result in a huge speedup. The obvious catch is, they need to fit into memory.