There are several flavors of the
fold function. The simplest version of the function takes a sequence (of some sort), an associative function binary function
(A,A)=>A and a zero value of type
I can’t use such a function to add up the reciprocals of an input sequence of numbers. E.g., given
Seq(1.0, 2.0, 3.0, 4.0). I want to compute
1/1.0 + 1/2.0 + 1/3.0 + 1/4.0, even though that does match the
fold function according to type signature. Of course if I map the input sequence to compute the inverses before calling
fold everyone is happy.
Is there another flavour of fold which has this concept built in? i.e., once an element of the input has been consumed, the computed values can be combined in an associative way.
You could merge both steps explicitly using
inp.foldLeft(0.0)((a, c) => a + 1 / c)
Numeric nature with
inp.map(1 / _).sum
…or pull in Cats and use
#foldMap() (requiring a concrete
List instead of a
Seq and relying on the implicit default
Monoid instance for
Double implementing addition).
inp.toList.foldMap(1 / _)
There’s probably more options…
Yes of course
foldLeft is a possible choice. The thing I don’t like about
foldLeft though it is carves in stone the computation order.
It seems to me that during the Coursera course a couple of years ago, there was some discussion of this problem, but I forget the details. The idea, is that foldLeft is not parallelizable because the computation order is fixed. Whereas, if the given function is associative, then the computation can be parallelized. And they offered a fix for problems where the combination was associative but the sequence consumption was not.
Sorry, I don’t remember the details anymore. I hope that vague (maybe wrong) description still rings a bell with someone.
I found what I was looking for, and it is a bit more complicated than I remembered. It is the aggregate method. Which takes two functions, first like given to foldLeft, but the second a
combop which is used to combine intermediate results.
The example given in the documentation is how to add up the integer values a sequence of characters. At initial input, you have to convert the char to int, but when combining results, you simply add integers.
Hmm… As of 2.13, this is deprecated on
IterableOnce (and implemented in terms of
so is the combine op function is used in that
foldLeft only implementation?
Unless you are using a parallel collection it is still serial and probably from left to right.
There are some interesting case where the concept of aggregate is more powerful than
foldLeft. Although mathematically and semantically they are equivalent, computational they may have different characteristics.
foldLeft makes the hidden assumption that the complexity of the operation is constant over the input. However, in cases where the complexity is in some sense also a function of the position in the input sequence, the aggregate approach might be faster, or less memory intensive. There are also some cases where the aggregate approach can produce less floating-point round-off error.
aggregate is implemented with slightly the wrong abstraction.
I hope someone can correct me if I misunderstand something here.
aggregate takes two functions
seqop: (B, A) ⇒ B and
combop: (B, B) ⇒ B
whereas it should take
combop: (B, B) ⇒ B
and then should internally compute
I believe it makes the assumption that the user’s given
combop contains the same code as
combop, but doesn’t have anyway to verify that invariant.
Perhaps (I’m guessing) it is (was) defined as it is/was, because
unaryop is very often the identity function, and letting the user specify
seqop allows a call to the identity function to be optimized away.
…but if it’s implemented via
#foldLeft(), computational differences are rather unlikely.
You describe something like mapReduce, whereas
#aggregate() is more like foldReduce - the “nodes” are already expected to fold single results that only need to be combined once pulled together. I’d rather think of it as another abstraction than the wrong one…
The idea of
aggregate is that the
seqop is running locally ti each distributed task and then the
combop is used to combine those partial results into a final result.
What you described is more like a
foldMap that also doesn’t provide any information about how the task is to be distributed.
exactly my point. the implementation via
foldLeft obviates the potential benefits other than syntactical ones.
aggregate is only different and more powerful than
foldLeft if it can be parallelized / distributed. Since the stdlib no longer provides those capabilities then the method is no longer necessary.
Being able to distribute a computation in parallel is of course one of the benefits of
foldMap. And I would not be surprised if that was indeed the motivating example. However, a procedure need not be implemented in a distributed way in order to reap benefits from smarter parenthesis placement in an associative computation.
I’m writing a research paper about this topic. The feedback that everyone has given so far is very useful. Many thanks
foldMap is part of scalaz and not part of the scala standard library, right? the binary operator is (I’ll guess) coming from the implicit monoid, rather than given to the
aggregate doesn’t really provide you with anything useful (semantically speaking), is actually a leaky abstraction.
That is actually the point of that operation, that you know when you are in non-distributed scenario and when you are. That way you can “cheat” and do “unsafe” things in
seqop like using a mutable buffer instead to improve performance. Otherwise it has the same expressive power as
foldMap only gives you more expressive power than
fold when the combine operation is automatically provided by the type (given a
Monoid). It can also be used to provide a higher level framework for parallelization which would be completely transparent (contrary to
aggregate with all the trade-offs that implies). But, I do not remember if just the associative laws of
Monoid are enough, or if we also need a
Yes it is part of cats (and I would guess also part from scalaz).
The stdlib doesn’t provide it, because we do not have
Monoids in the stdlib and
Monoids is just
fold. That is what I meant in my previous comment.
You don’t really need an explicit monoid in code form; You just need an operation and a zero, right? Yes, you can derive the monoid from that, but those are the same things you have to give
Tha is the point.
The joke of foldMap is that I just give a function
A => B and it will fold a collection of
As into a single
B given that
B has a
If I also have to pass the
zero element and the
combine operation I do not win much with a
foldMap method, since it is just a