How to use hashmap for counting

Sorry I am really the newbie to scala. I most time use python/ruby stuff for the jobs.

Here I have a file whose content is (each line with an email addr):

$ cat gh_emails_sorted.txt |head -3
***@t-online.de
***@freenet.de
***@web.de

I want to count how many mail domains there are. So I write the code below:

import scala.io.Source
import scala.collection.mutable.HashMap

val filename = “gh_emails_sorted.txt”
var hash = new HashMap()

for (line ← Source.fromFile(filename).getLines) {
val x = line.stripLineEnd.split("@")
val dom = x(1)
if (hash.contains(dom)) hash(dom) += 1 else hash(dom) = 1
}

It can’t work at all. :slight_smile:
Yes I know I have less experience on scala’s collection.
Can you help me with my issue?

Thanks in advance.
Regards.

Why are you using the HashMap, or want to do it using hashMap, when you can use the “distinct” method that can provide you this result quickly and efficiently?

What does “it can’t work at all” mean? Compile error? Runtime exception? Wrong result?

Anyways, I would go ahead and assume the problem is here var hash = new HashMap() try using this instead: var hash = Map.empty[String, Int]

Also, if you are in 2.13 you can convert the Iterator into a View and use groupMapReduce instead.

1 Like

one way

scala.io.Source.fromFile("gh_emails_sorted.txt").getLines.toList.groupBy( email => email.split("@")(1)).map{ case(k,lstv) => (k, lstv.size)}

Hello pengyh,
Here is another: suppose I have the text file emails.txt:

hello@example.com
world@gmail.com
spam@pm.me
egg@web.de
scala@scala-lang.org
martin@gmail.com
pengyh@pm.me

It has 7 emails but 5 unique domains.
Now I have the file count.scala:

import scala.io.Source

val filename = "emails.txt"
val file = Source.fromFile(filename)

val emails =
  for
    line <- file.getLines
  yield
    line.split("@")(1)

emails.distinct.size

I load these into the REPL:

scala> :l count.scala
val filename: String = emails.txt
val file: scala.io.BufferedSource = <iterator>
val emails: Iterator[String] = <iterator>
val res4: Int = 5

Not sure if this is idiomatic Scala or the best/most efficient way to do it, but it’s just another way! Hopefully you find it useful!

That doesn’t count how many entries there were for each domain

@BalmungSan Sorry, was that in response to me? Still new to this posting system.

Did I misunderstand the OP? :confused: I guess they said one thing, but tried to do another thing with their code? :confused:

Oh right, my fault actually for not replying to you.

Also, you are right the problem description says one thing and the code tries to do another. Not sure which one is what OP really needs. In the case they only need to know how many different domains there are, then yeah your solution is pretty good; I personally would use map instead of for but that is just a style thing.

Hello everybody
Thanks for the answering.

In fact I just want to get the result similar to this in Spark (I am pretty familiar with higher-order func in pyspark):

rdd=sc.textFile(“tmp/gh_emails_sorted.txt”)
rdd.map(lambda x: ((x.split(’@’))[1],1) ).reduceByKey(lambda x,y: x+y).sortBy((lambda x: x[1]),ascending=False).collect()

[(‘web.de’, 60), (‘gmx.de’, 59), (‘t-online.de’, 57), …

The result is sorted by every domain’s number.
Do you have any idea how I can write a scala program to implement this?

Thank you.

1 Like

Hello

I finally resolve this as:

scala> val li = Source.fromFile(file).getLines

scala> li.toList.groupBy(x=>x.split("@")(1)).map{case(x,y) => (x,y.size)}.toList.sortWith( (x,y) => x._2 > y._2 )
val res37: List[(String, Int)] = List((web.de,60), (gmx.de,59), (t-online.de,57), (aol.com,18), (freenet.de,8), (hotmail.com,5), (gmx.net,5), …

Thank you everybody for the kind helps.

1 Like

Here is a variant using Iterable[A].groupMapReduce[K, B](key: A => K)(f: A => B)(reduce: (B, B) => B): Map[K,B] and string extraction:

li.groupMapReduce{case s"$_@$y" => y}(_ => 1)(_ + _).toSeq.sortBy(-_._2)

I finally use this way:

scala> li2.groupBy(_.split("@")(1)).map{case(x,y) => (x,y.size)}.toList.sortBy(-_._2)
val res19: List[(String, Int)] = List((web.de,60), (gmx.de,59), (t-online.de,57)

Thanks buddies.

1 Like