There are several flavors of the fold function. The simplest version of the function takes a sequence (of some sort), an associative function binary function (A,A)=>A and a zero value of type A.
I can’t use such a function to add up the reciprocals of an input sequence of numbers. E.g., given Seq(1.0, 2.0, 3.0, 4.0). I want to compute 1/1.0 + 1/2.0 + 1/3.0 + 1/4.0, even though that does match the fold function according to type signature. Of course if I map the input sequence to compute the inverses before calling fold everyone is happy.
Is there another flavour of fold which has this concept built in? i.e., once an element of the input has been consumed, the computed values can be combined in an associative way.
You could merge both steps explicitly using #foldLeft()…
inp.foldLeft(0.0)((a, c) => a + 1 / c)
…exploit Double's Numeric nature with #map() and #sum()…
inp.map(1 / _).sum
…or pull in Cats and use #foldMap() (requiring a concrete List instead of a Seq and relying on the implicit default Monoid instance for Double implementing addition).
Yes of course foldLeft is a possible choice. The thing I don’t like about foldLeft though it is carves in stone the computation order.
It seems to me that during the Coursera course a couple of years ago, there was some discussion of this problem, but I forget the details. The idea, is that foldLeft is not parallelizable because the computation order is fixed. Whereas, if the given function is associative, then the computation can be parallelized. And they offered a fix for problems where the combination was associative but the sequence consumption was not.
Sorry, I don’t remember the details anymore. I hope that vague (maybe wrong) description still rings a bell with someone.
I found what I was looking for, and it is a bit more complicated than I remembered. It is the aggregate method. Which takes two functions, first like given to foldLeft, but the second a combop which is used to combine intermediate results.
The example given in the documentation is how to add up the integer values a sequence of characters. At initial input, you have to convert the char to int, but when combining results, you simply add integers.
There are some interesting case where the concept of aggregate is more powerful than foldLeft. Although mathematically and semantically they are equivalent, computational they may have different characteristics.
Basically foldLeft makes the hidden assumption that the complexity of the operation is constant over the input. However, in cases where the complexity is in some sense also a function of the position in the input sequence, the aggregate approach might be faster, or less memory intensive. There are also some cases where the aggregate approach can produce less floating-point round-off error.
To me, aggregate is implemented with slightly the wrong abstraction.
I hope someone can correct me if I misunderstand something here.
But aggregate takes two functions
seqop: (B, A) ⇒ B and combop: (B, B) ⇒ B
whereas it should take
unaryop:A=>B and combop: (B, B) ⇒ B
and then should internally compute seqop as
(b:B,a:A)=>combop(b,unaryop(a)).
I believe it makes the assumption that the user’s given combop contains the same code as combop, but doesn’t have anyway to verify that invariant.
Perhaps (I’m guessing) it is (was) defined as it is/was, because unaryop is very often the identity function, and letting the user specify seqop allows a call to the identity function to be optimized away.
You describe something like mapReduce, whereas #aggregate() is more like foldReduce - the “nodes” are already expected to fold single results that only need to be combined once pulled together. I’d rather think of it as another abstraction than the wrong one…
The idea of aggregate is that the seqop is running locally ti each distributed task and then the combop is used to combine those partial results into a final result.
What you described is more like a foldMap that also doesn’t provide any information about how the task is to be distributed.
aggregate is only different and more powerful than foldLeft if it can be parallelized / distributed. Since the stdlib no longer provides those capabilities then the method is no longer necessary.
Being able to distribute a computation in parallel is of course one of the benefits of aggregate or foldMap. And I would not be surprised if that was indeed the motivating example. However, a procedure need not be implemented in a distributed way in order to reap benefits from smarter parenthesis placement in an associative computation.
I’m writing a research paper about this topic. The feedback that everyone has given so far is very useful. Many thanks
Looks like foldMap is part of scalaz and not part of the scala standard library, right? the binary operator is (I’ll guess) coming from the implicit monoid, rather than given to the foldMap function.
aggregate doesn’t really provide you with anything useful (semantically speaking), is actually a leaky abstraction.
That is actually the point of that operation, that you know when you are in non-distributed scenario and when you are. That way you can “cheat” and do “unsafe” things in seqop like using a mutable buffer instead to improve performance. Otherwise it has the same expressive power as fold
foldMap only gives you more expressive power than fold when the combine operation is automatically provided by the type (given a Monoid). It can also be used to provide a higher level framework for parallelization which would be completely transparent (contrary to aggregate with all the trade-offs that implies). But, I do not remember if just the associative laws of Monoid are enough, or if we also need a CommutativeMonoid
Yes it is part of cats(and I would guess also part from scalaz).
The stdlib doesn’t provide it, because we do not have Monoids in the stdlib and foldMap without Monoids is just fold. That is what I meant in my previous comment.
You don’t really need an explicit monoid in code form; You just need an operation and a zero, right? Yes, you can derive the monoid from that, but those are the same things you have to give fold anyway.