My code question, groupBy and reduceByKey stuff

pengyh · January 12, 2022, 12:46pm

The file content:

$ cat latest |head -5
[email protected],20
[email protected],4
[email protected],1
[email protected],2
[email protected],1

I want to get the domain counts from this file.
So I wrote this:

val li = Source.fromFile(file).getLines().toList
li.map( _.split(",")(0).split("@")(1) ).groupBy(x=>x).map{ case(x,y) => (x,y.size) }.toList.sortBy(-_._2)

It does work. The outputs:

val res17: List[(String, Int)] = List((gmail.com,5076), (redhat.com,172), (apache.org,166), (163.com,114), (hotmail.com,92), (gnu.org,88), (googlegroups.com,78), (freebsd.org,77), (qq.com,68), (google.com,62), (yahoo.com,61), (outlook.com,61), (intel.com,56),...

My questions are:

Is there any better statement for this purpose?
here groupBy(x=>x) works, if I replace it as groupBy(_), why doesn’t work?
Is there a native reduceByKey function in scala (as in spark)?

Thanks in advance.

BalmungSan · January 12, 2022, 12:52pm

You may use identity instead of x => x if you find that more readable (I do).
Then goupBy(_) expands as x => groupBy(x) instead as groupBy(x=>x) which is what you want; this, among many other reasons, why I personally dislike using _ for creating shorthands (although I do use them from time to time).

For 1 & 3 yes, you may use groupMapReduce if you are in Scala 2.13 (but I doubt you are given you mention Spark).
Which would look like this: li.groupMapReduce(identity)(_ => 1)(_ + _)

vpatryshev · January 12, 2022, 1:39pm

I’d rather focus on the virtual impossibility of testing this long line of code.
Split it into pieces, introducing variables. So that next time you refactor it, you won’t have to guess what was it about.

val domains = li map ( _.split(",")(0).split("@")(1))
val grouped = domains groupBy identity
val counted = grouped map  { case(x,y) => (x,y.size)  }
val sorted = counted.toList sortBy (-_._2)

Or you could, instead of splitting, have

val Pattern = "[^@]*@([^,]*),.*".r
val domains = li map { case Pattern(domain) => domain }
...

ndas1971 · January 12, 2022, 1:45pm

You are right, One addition , spark 3.2.0 has scala 2.13 binding

pengyh · January 13, 2022, 1:12am

thanks .smart info.

sangamon · January 13, 2022, 12:51pm

Note that email addresses may contain multiple ‘@’, e.g. with quoted local part. In production code, you’ll probably also want to guard against malformed input in general.