Fastparse runs out of memory for simple rules

I wrote following parses using fastparse library to get words in pairs. On execution it parses using parsePair properly. But for parseMultiPairs it goes into 100% CPU usage and then exit with OutOfMemoryError.

import fastparse.*, MultiLineWhitespace.*

case class Pair(a: String, b: String)

def parsePair[$: P]: P[Pair] = P(CharIn("a-z").repX.! ~ CharIn("a-z").repX.!).map(Pair(_, _))

def parsePairs[$: P]: P[Seq[Pair]] = P(parsePair.rep ~ End)

println(parse("gnu linux", parsePair(_)))
println(parse("gnu linux", parsePairs(_)))

I thought maybe I am not writing the grammar correctly. So I wrote similar grammar in parsy library of Python as:

from parsy import *

word = regex(r'[a-z]+')
pair = word.sep_by(regex(r' +'), min=2, max=2)
pairs = pair.sep_by(regex(r'[ \n]+'))

print(word.parse('hello'))
print(pair.parse('hello world'))
print(pairs.parse('hello world hey there\ngood bye'))

It works as expected and is able to parse all inputs. What change do I need with fastparse to make it work?

1 Like

Got a reply from Haoyi themself here on Discord. Issue with my fastparse spec was my definition of word as CharIn("a-z").repX which is 0 or more characters, instead of CharIn("a-z").repX(1) which is 1 or more characters.

Following is the complete working code:

object Test extends App:
    import fastparse.*, MultiLineWhitespace.*

    case class Pair(a: String, b: String)

    def parsePair[$: P]: P[Pair] = P(CharIn("a-z").repX(1).! ~ CharIn("a-z").repX(1).!).map(Pair(_, _))

    def parseMultiPairs[$: P]: P[Seq[Pair]] = P(parsePair.rep ~ End)

    println(parse("gnu linux", parsePair(_)))
    println(parse("hello world hey there\ngood bye", parseMultiPairs(_)))
end Test

Side note: using rep with a parser that can consume 0 input is a somewhat common class of bug. The implementation of cats-parse removes this class of bug by separating into two types parsers that can parse 0 or more vs 1 or more, and only the 1 or more type has the .rep method:

A second class of bug this library prevents is exponential blow-up in parsing time due to overuse of backtracking. In cats-parse, backtracking is opt-in vs opt-out. You should really try to absolutely minimize use of backtracking in order to avoid accidentally exponential parsers.

4 Likes

Thanks! That is a crucial piece of info.

1 Like