Run out of memory

dear community,

I have a txt file who has one word on each line, the total file size is 33MB.
I have uploaded the file to this URL for your check:
https://cloudcache.net/data/words.txt.tgz

While I run a scala script to count the words, its get error:

$ scala countwords.sc
Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.StringUTF16.compress(StringUTF16.java:160)
at java.base/java.lang.String.(String.java:3214)
at java.base/java.lang.String.(String.java:276)
at java.base/java.io.BufferedReader.readLine(BufferedReader.java:358)
at java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:73)
at scala.collection.immutable.List.prependedAll(List.scala:155)
at scala.collection.IterableOnceOps.toList(IterableOnce.scala:1251)
at scala.collection.IterableOnceOps.toList$(IterableOnce.scala:1251)
at scala.collection.AbstractIterator.toList(Iterator.scala:1296)
at Main$$anon$1.(countwords.sc:4)
at Main$.main(countwords.sc:1)
at Main.main(countwords.sc)

The script content:

import scala.io.Source

val file = "words.txt"
val li = Source.fromFile(file).getLines().toList
val re = li.groupBy(identity).map { case(x,y) => (x,y.size) }.toList.sortBy(-_._2)

re.foreach { println }

Please help. Thank you a lot.

BTW, my host has 4G dedicated ram, and double AMD cores.

It’s so strange. I don’t run any group/sort by below code. It’s still get out of memory.

import scala.io.Source

val file = "words.txt"
val li = Source.fromFile(file).getLines().toList
val hash = scala.collection.mutable.Map[String,Int]()

for (x <- li) {
    if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
}

//val sorted = hash.toList.sortBy(-_._2)
//sorted.take(30).foreach {println}

hash.take(20).foreach { println }

If I run this one liner perl, it will finish the job quickly without any error.

$ perl -nle 'chomp; $hash{$_}++; END {for (sort {$hash{$b} <=> $hash{$a}} keys %hash) {print "$_,$hash{$_}"}}' words.txt |head -20
the,318261
to,247694
a,128015
and,116926
is,109919
i,100410
in,93621
of,89180
on,87029
this,82799
for,81037
you,80874
at,72918
that,72910
it,65811
with,49132
be,48190
from,47701
not,47459
if,44930

So, please help. Thank you.

How big is the file?

I should be clearer: on line 4, you’re reading the entire file into memory – probably at least a couple of bytes per character, plus a fair amount of overhead. Depending on how you have the JVM configured, it would be pretty easy to completely fill the heap.

The more-typical Scala way to do this would be with a streaming library – fs2 or zio-streams or akka-streams or something – that processes the data as it comes in, rather than reading it all at once and then processing it, keeping the memory requirement much smaller.

You should remove the .toList, so you’re dealing with Iterator (which processes one item at a time) instead of List (which is always entirely in-memory).

I tried that just now with a 6-gigabyte input file and it ran fine.

The libraries @jducoeur mentions are good libraries, but Iterator is perfectly adequate for this task.

3 Likes

Thanks a lot @SethTisue

After using Iterator this code can work:

import scala.io.Source

val file = "words.txt"
val li = Source.fromFile(file).getLines()
val hash = scala.collection.mutable.Map[String,Int]()

for (x <- li) {
    if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
}

val sorted = hash.toList.sortBy(-_._2)
sorted.take(30).foreach {println}

But the higher functions below can’t work:

$ cat countwords.sc 
import scala.io.Source

val file = "words.txt"
val li = Source.fromFile(file).getLines()
val re = li.groupBy(identity).map { case(x,y) => (x,y.size) }.toList.sortBy(-_._2)

re.foreach { println }

$ scala countwords.sc 
countwords.sc:5: error: value groupBy is not a member of Iterator[String]
did you mean grouped?
val re = li.groupBy(identity).map { case(x,y) => (x,y.size) }.toList.sortBy(-_._2)

Do you have the further suggestions?

Thanks.

You are right that the groupBy operations don’t exist on iterator, you can overcome this by doing one of the following.

converting to a LazyList then using groupMapReduce (since you want a histogram)

using foldLeft and manually implementing the grouping algorithm.

example scastie: Scastie - An interactive playground for Scala.

2 Likes

If you have a lot of these kind of counting problems you could implement your own countBy method (expressed below in Scala 2 style and using mutability on the inside for performance):

implicit class RichIterableOnce[A](val it: IterableOnce[A]) extends AnyVal {
    def countBy[K](f: A => K): Map[K, Int] = {
      val map = collection.mutable.Map.empty[K, Int]
      for (i <- it) {
        map.updateWith(f(i)) {
          case None => Some(1)
          case Some(n) => Some(n + 1)
        }
      }
      map.to(Map)
    }
  }

Then your problem becomes:

li.countBy(identity).toList.sortBy(-_._2)

You can use the method on any collection extending IterableOnce (be aware that the collection might not be iterable again afterwards; this is the case for Iterator):

List("Cat", "Ape", "Bear", "Ant", "Anaconda", "Camel").countBy(_.charAt(0))
res1: Map[Char, Int] = Map('A' -> 3, 'B' -> 1, 'C' -> 2)

The equivalent in Scala 3 style is left as an exercise for the reader.

1 Like