Reading a large data file

I have an ASCII (text) data file that is around 2 GB in size. I am trying to read it line by line and save a reduced version of it. Something like this:

for (line <- io.Source.fromFile(fileName).getLines) println(line)

Actually, I am reducing the number of lines and the size of each line, but that is irrelevant here. When I run it, I get:

Exception in thread “main” java.lang.OutOfMemoryError: Requested array size exceeds VM limit

I assume this is happening because the entire file is getting loaded into memory. Is there a simple way to read and process one line at a time without loading the entire file? Thanks.

I would suggest you using some kind of Streaming.

My personal choice would be fs2, the main page of the documentation shows a similar use case, by reading transforming and writing again a file.

However, if you are not familiar with the cats ecosystem and functional programming in general, and you wouldn’t want to, you may look into Akka Streams.

PS: Since getLines returns an Iterator, it surprises me that it produces a memory error, specially related to some Array size, maybe it is the way you are trying to write the file to disk?

Is it really irrelevant? Do you get that error with the snippet you showed? That would be very surprising, especially since there is no array in sight.

2 Likes

A good option you should look into is https://github.com/lihaoyi/os-lib

1 Like

Or you can use:

java.nio.file.Files public static Stream<String> lines(Path path, Charset cs)

Yes, the error does seem to be triggered by the snippet that I showed. I reduced the whole program to just executing the following function called copyFile to eliminate other possibilities.

def copyFile(fileName: Text) = { // for testing
for (line <- io.Source.fromFile(fileName).getLines) println(line)
}

The offending array appears to be in java.lang.AbstractStringBuilder. Here is the full stack trace:

Exception in thread “main” java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
at java.lang.StringBuffer.append(StringBuffer.java:369)
at java.io.BufferedReader.readLine(BufferedReader.java:370)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:73)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1196)
at trajspec.test.ReduceTerrainDataFile$.copyFile(TerrainMap.scala:208)
at trajspec.test.ReduceTerrainDataFile$.main(TerrainMap.scala:203)
at trajspec.test.ReduceTerrainDataFile.main(TerrainMap.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.reflect.internal.util.ScalaClassLoader.$anonfun$run$2(ScalaClassLoader.scala:105)
at scala.reflect.internal.util.ScalaClassLoader$$Lambda$81/1327763628.apply(Unknown Source)
at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:40)
at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:37)
at scala.reflect.internal.util.ScalaClassLoader$URLClassLoader.asContext(ScalaClassLoader.scala:130)
at scala.reflect.internal.util.ScalaClassLoader.run(ScalaClassLoader.scala:105)
at scala.reflect.internal.util.ScalaClassLoader.run$(ScalaClassLoader.scala:97)
at scala.reflect.internal.util.ScalaClassLoader$URLClassLoader.run(ScalaClassLoader.scala:130)
at scala.tools.nsc.CommonRunner.run(ObjectRunner.scala:29)
at scala.tools.nsc.CommonRunner.run$(ObjectRunner.scala:28)
at scala.tools.nsc.ObjectRunner$.run(ObjectRunner.scala:44)
at scala.tools.nsc.CommonRunner.runAndCatch(ObjectRunner.scala:35)
at scala.tools.nsc.CommonRunner.runAndCatch$(ObjectRunner.scala:34)
at scala.tools.nsc.MainGenericRunner.runTarget$1(MainGenericRunner.scala:70)

Can you provide a link to a sample file? Probably each line is too long? (however that would be a line fo more than 2,147,483,647 characters, that seems extremely weird)
Or maybe does Source has a memory leak?

Where exactly does println print to?

Ahhhhhh yes. Thank you. Your remark about long lines reminded of what the problem is.

The original file was a huge matrix of terrain elevation data in Matlab format. I had someone convert it to text so I could parse it in Scala. In the conversion, I had asked for the elements to be comma separated and the rows of the matrix to be marked with a “|” character. That was a big mistake. I should have asked for newlines for easier parsing. So I did a global substitution of newlines for |, and everything was fine. But I did that at home while on holiday break. Then I came back to work today and tried to recreate the matrix – but I forgot to substitute the newlines! So the entire file was one line! Sorry for the wasted time – mine and yours!

2 Likes

No problem!

Anyways, streaming would be good if the file is that big.
Also, instead of doing those text manipulations, you may prefer to read the file as a stream of bytes and then as a stream of chars which you can process in some kind of finite state machine.

Maybe you could assign buffer to the memory first. GetLines method put the whole file into the memory. It’s wrong.