I have a txt file who has one word on each line, the total file size is 33MB.
I have uploaded the file to this URL for your check: https://cloudcache.net/data/words.txt.tgz
While I run a scala script to count the words, its get error:
$ scala countwords.sc
Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.StringUTF16.compress(StringUTF16.java:160)
at java.base/java.lang.String.(String.java:3214)
at java.base/java.lang.String.(String.java:276)
at java.base/java.io.BufferedReader.readLine(BufferedReader.java:358)
at java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:73)
at scala.collection.immutable.List.prependedAll(List.scala:155)
at scala.collection.IterableOnceOps.toList(IterableOnce.scala:1251)
at scala.collection.IterableOnceOps.toList$(IterableOnce.scala:1251)
at scala.collection.AbstractIterator.toList(Iterator.scala:1296)
at Main$$anon$1.(countwords.sc:4)
at Main$.main(countwords.sc:1)
at Main.main(countwords.sc)
The script content:
import scala.io.Source
val file = "words.txt"
val li = Source.fromFile(file).getLines().toList
val re = li.groupBy(identity).map { case(x,y) => (x,y.size) }.toList.sortBy(-_._2)
re.foreach { println }
I should be clearer: on line 4, you’re reading the entire file into memory – probably at least a couple of bytes per character, plus a fair amount of overhead. Depending on how you have the JVM configured, it would be pretty easy to completely fill the heap.
The more-typical Scala way to do this would be with a streaming library – fs2 or zio-streams or akka-streams or something – that processes the data as it comes in, rather than reading it all at once and then processing it, keeping the memory requirement much smaller.
You should remove the .toList, so you’re dealing with Iterator (which processes one item at a time) instead of List (which is always entirely in-memory).
I tried that just now with a 6-gigabyte input file and it ran fine.
The libraries @jducoeur mentions are good libraries, but Iterator is perfectly adequate for this task.
import scala.io.Source
val file = "words.txt"
val li = Source.fromFile(file).getLines()
val hash = scala.collection.mutable.Map[String,Int]()
for (x <- li) {
if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
}
val sorted = hash.toList.sortBy(-_._2)
sorted.take(30).foreach {println}
But the higher functions below can’t work:
$ cat countwords.sc
import scala.io.Source
val file = "words.txt"
val li = Source.fromFile(file).getLines()
val re = li.groupBy(identity).map { case(x,y) => (x,y.size) }.toList.sortBy(-_._2)
re.foreach { println }
$ scala countwords.sc
countwords.sc:5: error: value groupBy is not a member of Iterator[String]
did you mean grouped?
val re = li.groupBy(identity).map { case(x,y) => (x,y.size) }.toList.sortBy(-_._2)
If you have a lot of these kind of counting problems you could implement your own countBy method (expressed below in Scala 2 style and using mutability on the inside for performance):
implicit class RichIterableOnce[A](val it: IterableOnce[A]) extends AnyVal {
def countBy[K](f: A => K): Map[K, Int] = {
val map = collection.mutable.Map.empty[K, Int]
for (i <- it) {
map.updateWith(f(i)) {
case None => Some(1)
case Some(n) => Some(n + 1)
}
}
map.to(Map)
}
}
Then your problem becomes:
li.countBy(identity).toList.sortBy(-_._2)
You can use the method on any collection extending IterableOnce (be aware that the collection might not be iterable again afterwards; this is the case for Iterator):