Run out of memory

pengyh · January 14, 2022, 1:20am

dear community,

I have a txt file who has one word on each line, the total file size is 33MB.
I have uploaded the file to this URL for your check:
https://cloudcache.net/data/words.txt.tgz

While I run a scala script to count the words, its get error:

$ scala countwords.sc
Exception in thread “main” java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.StringUTF16.compress(StringUTF16.java:160)
at java.base/java.lang.String.(String.java:3214)
at java.base/java.lang.String.(String.java:276)
at java.base/java.io.BufferedReader.readLine(BufferedReader.java:358)
at java.base/java.io.BufferedReader.readLine(BufferedReader.java:392)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:73)
at scala.collection.immutable.List.prependedAll(List.scala:155)
at scala.collection.IterableOnceOps.toList(IterableOnce.scala:1251)
at scala.collection.IterableOnceOps.toList$(IterableOnce.scala:1251)
at scala.collection.AbstractIterator.toList(Iterator.scala:1296)
at Main$$anon$1.(countwords.sc:4)
at Main$.main(countwords.sc:1)
at Main.main(countwords.sc)

The script content:

import scala.io.Source

val file = "words.txt"
val li = Source.fromFile(file).getLines().toList
val re = li.groupBy(identity).map { case(x,y) => (x,y.size) }.toList.sortBy(-_._2)

re.foreach { println }

Please help. Thank you a lot.

pengyh · January 14, 2022, 1:35am

BTW, my host has 4G dedicated ram, and double AMD cores.

pengyh · January 14, 2022, 3:40am

It’s so strange. I don’t run any group/sort by below code. It’s still get out of memory.

import scala.io.Source

val file = "words.txt"
val li = Source.fromFile(file).getLines().toList
val hash = scala.collection.mutable.Map[String,Int]()

for (x <- li) {
    if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
}

//val sorted = hash.toList.sortBy(-_._2)
//sorted.take(30).foreach {println}

hash.take(20).foreach { println }

pengyh · January 14, 2022, 3:55am

If I run this one liner perl, it will finish the job quickly without any error.

$ perl -nle 'chomp; $hash{$_}++; END {for (sort {$hash{$b} <=> $hash{$a}} keys %hash) {print "$_,$hash{$_}"}}' words.txt |head -20
the,318261
to,247694
a,128015
and,116926
is,109919
i,100410
in,93621
of,89180
on,87029
this,82799
for,81037
you,80874
at,72918
that,72910
it,65811
with,49132
be,48190
from,47701
not,47459
if,44930

So, please help. Thank you.

jducoeur · January 14, 2022, 4:00am

How big is the file?

jducoeur · January 14, 2022, 4:07am

I should be clearer: on line 4, you’re reading the entire file into memory – probably at least a couple of bytes per character, plus a fair amount of overhead. Depending on how you have the JVM configured, it would be pretty easy to completely fill the heap.

The more-typical Scala way to do this would be with a streaming library – fs2 or zio-streams or akka-streams or something – that processes the data as it comes in, rather than reading it all at once and then processing it, keeping the memory requirement much smaller.

SethTisue · January 14, 2022, 4:42am

You should remove the .toList, so you’re dealing with Iterator (which processes one item at a time) instead of List (which is always entirely in-memory).

I tried that just now with a 6-gigabyte input file and it ran fine.

The libraries @jducoeur mentions are good libraries, but Iterator is perfectly adequate for this task.

pengyh · January 14, 2022, 6:04am

Thanks a lot @SethTisue

After using Iterator this code can work:

import scala.io.Source

val file = "words.txt"
val li = Source.fromFile(file).getLines()
val hash = scala.collection.mutable.Map[String,Int]()

for (x <- li) {
    if (hash.contains(x)) hash(x) += 1 else hash(x) = 1
}

val sorted = hash.toList.sortBy(-_._2)
sorted.take(30).foreach {println}

But the higher functions below can’t work:

$ cat countwords.sc 
import scala.io.Source

val file = "words.txt"
val li = Source.fromFile(file).getLines()
val re = li.groupBy(identity).map { case(x,y) => (x,y.size) }.toList.sortBy(-_._2)

re.foreach { println }

$ scala countwords.sc 
countwords.sc:5: error: value groupBy is not a member of Iterator[String]
did you mean grouped?
val re = li.groupBy(identity).map { case(x,y) => (x,y.size) }.toList.sortBy(-_._2)

Do you have the further suggestions?

Thanks.

javax-swing · January 14, 2022, 8:08am

You are right that the groupBy operations don’t exist on iterator, you can overcome this by doing one of the following.

converting to a LazyList then using groupMapReduce (since you want a histogram)

using foldLeft and manually implementing the grouping algorithm.

example scastie: Scastie - An interactive playground for Scala.

odd · January 14, 2022, 10:09am

If you have a lot of these kind of counting problems you could implement your own countBy method (expressed below in Scala 2 style and using mutability on the inside for performance):

implicit class RichIterableOnce[A](val it: IterableOnce[A]) extends AnyVal {
    def countBy[K](f: A => K): Map[K, Int] = {
      val map = collection.mutable.Map.empty[K, Int]
      for (i <- it) {
        map.updateWith(f(i)) {
          case None => Some(1)
          case Some(n) => Some(n + 1)
        }
      }
      map.to(Map)
    }
  }

Then your problem becomes:

li.countBy(identity).toList.sortBy(-_._2)

You can use the method on any collection extending IterableOnce (be aware that the collection might not be iterable again afterwards; this is the case for Iterator):

List("Cat", "Ape", "Bear", "Ant", "Anaconda", "Camel").countBy(_.charAt(0))
res1: Map[Char, Int] = Map('A' -> 3, 'B' -> 1, 'C' -> 2)

The equivalent in Scala 3 style is left as an exercise for the reader.