Parsing combinators

I’ve found myself in need of parsing strings of the form:

r + e + c + u - r - s - i - o - n    t + o - o - e + a - s - y
r + e - c - u - r - s + i + o - n    t - o + o - e + a - s - y

r + e - c + u + r - s - i - o + n    t + o - o + e - a - s + y
r - e + c + u - r + s - i + o - n    t - o + o + e - a - s + y
                                     t - o - o - e + a + s + y

r - e - c - u + r - s - i + o + n    t + o - o - e - a + s - y
                                     t - o + o - e - a + s - y

I decided to try using parsing combinators:

class ParseAll extends RegexParsers:
   override val whiteSpace: Regex = """\h+""".r

   def nl: Parser[Any]  = """\v""".r     // one newline
   def nl2: Parser[Any] = """\v{2,}""".r // two or more newlines

   def all: Parser[Set[Set[Int]]] = repsep(par, nl2) <~ rep(nl) ^^ (_.toSet) // sequence of paragraphs

   def par: Parser[Set[Int]] = rep1sep(line, nl) ^^ (_.toSet.flatten) // sequence of lines

   def line: Parser[Set[Int]] = repNM(1, 2, sum) ^^ (_.toSet) // 1 or 2 sums

   def sum: Parser[Int] = chainl1(letter, plus | minus)

   def plus: Parser[(Int, Int) => Int] = "+" ^^ (_ => (x: Int, y: Int) => x + y)

   def minus: Parser[(Int, Int) => Int] = "-" ^^ (_ => (x: Int, y: Int) => x - y)

   def letter: Parser[Int] = """\p{Alpha}""".r ^^ (_.head.toInt) // single letter

This is concise and does exactly what I need. (The columns in the input are actually meaningful but I plan to write another parser for that.) My question is more of a matter of style and use of features. Is this the right way to handle the fact that blank lines are meaningful in the input? For instance, the <~ rep(nl) in the first rule looks weird to me, but I need to handle potential newlines at the end of the file. Also, is there a more generic way of saying: “I want repetitions to be parsed as sets instead of sequences”?

Thanks in advance for any feedback.

I’d think it’s natural and unavoidable for syntactically relevant whitespace to show up in the grammar. (Occasionally I might even be tempted to move these to a declaration of their own in order to have a speaking name, e.g. def termLines: Parser[Unit] = rep(nl))

As for the Set rep result - just for reference, in cats-parse there’s a mechanism for rep target mapping, so you could do

given[A]: Accumulator0[A, Set[A]] =
  () => Appender.fromBuilder(Set.newBuilder[A])

val line: Parser[Set[Int]] = sum.repAs(1, 2)

…but this doesn’t seem to extend to the repSep variants, and recreating this for SCP would probably be a little excessive, anyway. :slightly_smiling_face: Pragmatically, I’d probably just kick in an extension method for the concrete case, e.g.

extension [A](p: Parser[List[A]])
  def set: Parser[Set[A]] = p.map(_.toSet)

def all: Parser[Set[Set[Int]]] = repsep(par, nl2).set <~ rep(nl)
2 Likes