Hello members,
I have a stopwords list who has about 10k different words.
The input data is another list who has xx millions of words.
I run this way to filter out the stopwords:
import scala.io.Source
val li = Source.fromFile("words.txt").getLines()
val stopwords = Source.fromFile("stopwords.txt").getLines().toList
val hash = scala.collection.mutable.Map[String,Int]()
for (x <- li) {
if ( ! stopwords.contains(x) ) {
if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
}
}
val sorted = hash.toList.sortBy(-_._2)
sorted.take(30).foreach {println}
This run too slow. The optimization I think is to convert the stopwords list to a Map, and using Map.contains(key) for filtering. Do you have further suggestion on this?
BTW, I don’t know how spark implements its DSL syntax: object.isin(list), this run quite fast.