How to use hashmap for counting

pengyh · January 11, 2022, 11:38am

Sorry I am really the newbie to scala. I most time use python/ruby stuff for the jobs.

Here I have a file whose content is (each line with an email addr):

$ cat gh_emails_sorted.txt |head -3
***@t-online.de
***@freenet.de
***@web.de

I want to count how many mail domains there are. So I write the code below:

import scala.io.Source
import scala.collection.mutable.HashMap

val filename = “gh_emails_sorted.txt”
var hash = new HashMap()

for (line ← Source.fromFile(filename).getLines) {
val x = line.stripLineEnd.split("@")
val dom = x(1)
if (hash.contains(dom)) hash(dom) += 1 else hash(dom) = 1
}

It can’t work at all.
Yes I know I have less experience on scala’s collection.
Can you help me with my issue?

Thanks in advance.
Regards.

LeonardoC · January 11, 2022, 11:53am

Why are you using the HashMap, or want to do it using hashMap, when you can use the “distinct” method that can provide you this result quickly and efficiently?

BalmungSan · January 11, 2022, 12:39pm

What does “it can’t work at all” mean? Compile error? Runtime exception? Wrong result?

Anyways, I would go ahead and assume the problem is here var hash = new HashMap() try using this instead: var hash = Map.empty[String, Int]

Also, if you are in 2.13 you can convert the Iterator into a View and use groupMapReduce instead.

ndas1971 · January 11, 2022, 12:43pm

one way

scala.io.Source.fromFile("gh_emails_sorted.txt").getLines.toList.groupBy( email => email.split("@")(1)).map{ case(k,lstv) => (k, lstv.size)}

spamegg1 · January 11, 2022, 3:52pm

Hello pengyh,
Here is another: suppose I have the text file emails.txt:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

It has 7 emails but 5 unique domains.
Now I have the file count.scala:

import scala.io.Source

val filename = "emails.txt"
val file = Source.fromFile(filename)

val emails =
  for
    line <- file.getLines
  yield
    line.split("@")(1)

emails.distinct.size

I load these into the REPL:

scala> :l count.scala
val filename: String = emails.txt
val file: scala.io.BufferedSource = <iterator>
val emails: Iterator[String] = <iterator>
val res4: Int = 5

Not sure if this is idiomatic Scala or the best/most efficient way to do it, but it’s just another way! Hopefully you find it useful!

BalmungSan · January 11, 2022, 3:58pm

That doesn’t count how many entries there were for each domain

spamegg1 · January 11, 2022, 5:31pm

@BalmungSan Sorry, was that in response to me? Still new to this posting system.

Did I misunderstand the OP? I guess they said one thing, but tried to do another thing with their code?

BalmungSan · January 11, 2022, 5:47pm

Oh right, my fault actually for not replying to you.

Also, you are right the problem description says one thing and the code tries to do another. Not sure which one is what OP really needs. In the case they only need to know how many different domains there are, then yeah your solution is pretty good; I personally would use map instead of for but that is just a style thing.

pengyh · January 12, 2022, 2:06am

Hello everybody
Thanks for the answering.

In fact I just want to get the result similar to this in Spark (I am pretty familiar with higher-order func in pyspark):

rdd=sc.textFile(“tmp/gh_emails_sorted.txt”)
rdd.map(lambda x: ((x.split(‘@’))[1],1) ).reduceByKey(lambda x,y: x+y).sortBy((lambda x: x[1]),ascending=False).collect()

[(‘web.de’, 60), (‘gmx.de’, 59), (‘t-online.de’, 57), …

The result is sorted by every domain’s number.
Do you have any idea how I can write a scala program to implement this?

Thank you.

pengyh · January 12, 2022, 6:09am

Hello

I finally resolve this as:

scala> val li = Source.fromFile(file).getLines

scala> li.toList.groupBy(x=>x.split("@")(1)).map{case(x,y) => (x,y.size)}.toList.sortWith( (x,y) => x._2 > y._2 )
val res37: List[(String, Int)] = List((web.de,60), (gmx.de,59), (t-online.de,57), (aol.com,18), (freenet.de,8), (hotmail.com,5), (gmx.net,5), …

Thank you everybody for the kind helps.

odd · January 12, 2022, 8:07am

Here is a variant using Iterable[A].groupMapReduce[K, B](key: A => K)(f: A => B)(reduce: (B, B) => B): Map[K,B] and string extraction:

li.groupMapReduce{case s"$_@$y" => y}(_ => 1)(_ + _).toSeq.sortBy(-_._2)

pengyh · January 12, 2022, 9:51am

I finally use this way:

scala> li2.groupBy(_.split("@")(1)).map{case(x,y) => (x,y.size)}.toList.sortBy(-_._2)
val res19: List[(String, Int)] = List((web.de,60), (gmx.de,59), (t-online.de,57)

Thanks buddies.