My code question, groupBy and reduceByKey stuff

The file content:

$ cat latest |head -5
ss@suse.de,20
zz@google.com,4
yy@gmail.com,1
xx@fullstory.com,2
tt@gmail.com,1

I want to get the domain counts from this file.
So I wrote this:

val li = Source.fromFile(file).getLines().toList
li.map( _.split(",")(0).split("@")(1) ).groupBy(x=>x).map{ case(x,y) => (x,y.size) }.toList.sortBy(-_._2)

It does work. The outputs:

val res17: List[(String, Int)] = List((gmail.com,5076), (redhat.com,172), (apache.org,166), (163.com,114), (hotmail.com,92), (gnu.org,88), (googlegroups.com,78), (freebsd.org,77), (qq.com,68), (google.com,62), (yahoo.com,61), (outlook.com,61), (intel.com,56),...

My questions are:

  1. Is there any better statement for this purpose?
  2. here groupBy(x=>x) works, if I replace it as groupBy(_), why doesn’t work?
  3. Is there a native reduceByKey function in scala (as in spark)?

Thanks in advance.

You may use identity instead of x => x if you find that more readable (I do).
Then goupBy(_) expands as x => groupBy(x) instead as groupBy(x=>x) which is what you want; this, among many other reasons, why I personally dislike using _ for creating shorthands (although I do use them from time to time).

For 1 & 3 yes, you may use groupMapReduce if you are in Scala 2.13 (but I doubt you are given you mention Spark).
Which would look like this: li.groupMapReduce(identity)(_ => 1)(_ + _)

2 Likes

I’d rather focus on the virtual impossibility of testing this long line of code.
Split it into pieces, introducing variables. So that next time you refactor it, you won’t have to guess what was it about.

val domains = li map ( _.split(",")(0).split("@")(1))
val grouped = domains groupBy identity
val counted = grouped map  { case(x,y) => (x,y.size)  }
val sorted = counted.toList sortBy (-_._2)

Or you could, instead of splitting, have

val Pattern = "[^@]*@([^,]*),.*".r
val domains = li map { case Pattern(domain) => domain }
...
2 Likes

You are right, One addition , spark 3.2.0 has scala 2.13 binding

thanks .smart info.

Note that email addresses may contain multiple ‘@’, e.g. with quoted local part. In production code, you’ll probably also want to guard against malformed input in general.

2 Likes