Performance question for big list filter

pengyh · January 15, 2022, 1:37am

Hello members,

I have a stopwords list who has about 10k different words.
The input data is another list who has xx millions of words.

I run this way to filter out the stopwords:

import scala.io.Source

val li = Source.fromFile("words.txt").getLines()
val stopwords = Source.fromFile("stopwords.txt").getLines().toList
val hash = scala.collection.mutable.Map[String,Int]()

for (x <- li) {
    if ( ! stopwords.contains(x) ) {
      if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
    }
}

val sorted = hash.toList.sortBy(-_._2)
sorted.take(30).foreach {println}

This run too slow. The optimization I think is to convert the stopwords list to a Map, and using Map.contains(key) for filtering. Do you have further suggestion on this?

BTW, I don’t know how spark implements its DSL syntax: object.isin(list), this run quite fast.

pengyh · January 15, 2022, 1:47am

Hello members,

I did a small test on the sub data. Yes the hash way is much faster.
The improved code:

import scala.io.Source

val li = Source.fromFile("words.txt").getLines()
val stopwords = Source.fromFile("stopwords.txt").getLines().toList
val hash = scala.collection.mutable.Map[String,Int]()

val hash_sw = stopwords.map( (_,1) ).toMap

for (x <- li) {
    if ( ! hash_sw.contains(x) ) {
      if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
    }
}

val sorted = hash.toList.sortBy(-_._2)
sorted.take(30).foreach {println}

The comparison of running speed:

$ time scala countwords-byhash2.sc 
(send,20987)
(message,17516)
(unsubscribe,15541)
(2021,15221)
(list,13017)
(mailing,12402)
(mail,11647)
(file,11133)
(flink,10114)
(email,9919)
(pm,9248)
(group,8865)
(problem,8853)
(code,8659)
(data,8657)
(2020,8398)
(received,8246)
(google,7921)
(discussion,7920)
(jan,7893)
(groups,7730)
(subscribed,7726)
(visit,7716)
(return,7685)
(view,7623)
(time,7541)
(web,7514)
(receiving,7413)
(emails,7213)
(error,7201)

real	0m4.347s
user	0m6.964s
sys	0m0.280s

$ time scala countwords-byhash.sc 
(send,20987)
(message,17516)
(unsubscribe,15541)
(2021,15221)
(list,13017)
(mailing,12402)
(mail,11647)
(file,11133)
(flink,10114)
(email,9919)
(pm,9248)
(group,8865)
(problem,8853)
(code,8659)
(data,8657)
(2020,8398)
(received,8246)
(google,7921)
(discussion,7920)
(jan,7893)
(groups,7730)
(subscribed,7726)
(visit,7716)
(return,7685)
(view,7623)
(time,7541)
(web,7514)
(receiving,7413)
(emails,7213)
(error,7201)

real	0m19.046s
user	0m21.585s
sys	0m0.185s

Do you think there is any better way than this?
I found scala has the principle like perl’s: There is more than one way to do it!

Thank you.

jducoeur · January 15, 2022, 2:49am

You probably want to use Set instead of Map – in general, when you only care about the key, not the value, that’s a Set. (Performance-wise, they’re basically the same.)

BalmungSan · January 15, 2022, 2:56am

Use a Set instead of a Map as Justin said.
Like this:

val stopwords = Source.fromFile("stopwords.txt").getLines().toSet

Also, if the file is that big you may want to look into fs2 or AkkaStreams.

pengyh · January 15, 2022, 2:56am

Yes that works. the Set version is even a little faster ( I have tested it many times).

Set:
real	0m4.012s
user	0m6.549s
sys	0m0.188s

Map:
real	0m4.344s
user	0m7.061s
sys	0m0.209s

Thank you.