Iterate all the elements in Iterable

sgaurav22 · December 4, 2018, 12:39pm

Can anyone tell me a good way to iterate all the elements in
rdd_43: org.apache.spark.rdd.RDD[((Int, String, String), Iterable[(Int, Int, Int, Int, Int, Int, Int)])] = ShuffledRDD[203] at groupByKey at <console>:115

And then call aggregate function sum on each element of Iterable.

I have grouped the data based on 1st 2nd & 3rd element.
rdd_42: org.apache.spark.rdd.RDD[(Int, String, String, Int, Int, Int, Int, Int, Int, Int)] = UnionRDD[201] at union at <console>:113

The final O/P should be RDD[(Int, String, String, Int, Int, Int, Int, Int, Int, Int)]

Jasper-M · December 6, 2018, 11:23am

There’s not really a way that doesn’t look horrible (i.e. manually summing every separate element of the tuple), unless maybe if you want to use some pretty advanced stuff such as Shapeless.
First of all, I think you should avoid doing a groupByKey. For an aggregation like this you can use reduceByKey. And you might also want to consider creating some case classes to represent your data instead of raw tuples, especially when you have this many fields.