With-input-from-string

#1

I’m writing a file parser for a particular file format called DIMACS. And I’d like to ask about how to supply data for the test suite. I’m planning two genre of test cases, which I’ll call small and large.

I’d like to write several small test cases, each containing a literal string which represents a small number (5 to 10) of lines file content.

In Common Lisp we have the functions with-input-from-string and its cousin make-string-input-stream, whose argument is a string (often a literal string), which gets opened and you are returned a stream, which acts like a file pointer. The caller may access the characters in this string with the same API as accessing the characters of a file opened for reading. This use particularly useful in test cases, as test cases can have many small examples of file content, directly in the test case code itself, and you can call the file processing functions directly on the open stream.

(define-test "test-case-1"
  (with-input-from-string (stream "c this string contains DIMACS content
c  simple_v3_c2.cnf
c
p cnf 3 2
1 -3 0 2 3 -1 0")
    (assert-true (equal '((1 -3) (2 3 -1))
                        (read-dimacs-file stream)))))

QUESTION 1: How does such a concept translate to Scala? I suppose it could be some concept inherited from the Java library. Or perhaps I need to make the parser read from some object which is an iterator of characters?

It would be great if the object supported peek-ahead-one-char function, so I could examine the next character without consuming it. This would make the parser easier to write, as I could probably just translate the existing code I already have in another programming language, without missing the weird corner cases.

QUESTION 2: What is the correct way to open a file by relative-path relative to the test case directory, or relative to the project directory?

For the large test cases, I’d like to put certain benchmark files, which may be large into the project directory structure and have the test case locate it and open it. I certainly do not want the test suite to make assumptions about the full path to the benchmark files.

#2

Have you considered using a parser combinator library, like scala-parser-combinators or FastParse?

#3

Well, I did consider whether to use some grammar based approach, however for that I’d need an explicit grammar right? No argument from me that such parser approaches are very powerful and flexible.

What I have instead of a grammar, is a human readable description, some examples, and a working implementation in another language.

#4

Have to echo the parser-combinator recommendation – while you don’t have to do things that way, it’s pretty much the norm, and I suspect you’re going to have to go to significantly more effort otherwise.

(And seriously: you can’t write a parser without a grammar. Any parser requires a grammar – the only question is whether you figure it out in advance or puzzle it out by trial and error. Even if you’re figuring it out as you go, it’s easier with parser combinators than writing it by hand.)

And yes, FastParse is typically the tool of choice these days: there are many other options, but it’s the fastest and one of the most heavily-used. While I do use scala-parser-combinators in production, I wouldn’t recommend them today: they’re much slower and fell out of favor quite some time ago.

#5

You could accept java.util.Readerjava.io.Reader, scala.io.Source or similar as input and provide instances via StringReader or Source#fromString() in the test cases.

parser-combinators accepts Reader, FastParser uses Iterator[String] for “streaming” input. Both accept String, as well.

Since the tests can be expected to be run from the (sbt or other build tool) project structure, I don’t think it’s a problem to use paths relative to the project root directory.

Alternatively you could put the files in src/test/resources (assuming mvn/sbt layout) and access them as classpath resources.

#6

Thanks for the suggestion. I’ll put FastParse on my list of things to learn. But whether its more work, taking the learning curve into account, is something I’m interesting in knowing in the end.

BTW, does FastParse handle the original question of how to treat a literal string as the content of a file?

Here is the implementation I have. Just a straightforward parser.

#7

The second part of the question was: If I have benchmark files, how can my test suites find them? Is there an API to the project structure which is available from inside scalatest? I’m happy to put them somewhere in the project directory structure which makes sense, either in the test directory or elsewhere.

#8

For fastparse, if you want to be able to parse some data without loading it all in memory, I think you need a streaming parser: http://www.lihaoyi.com/fastparse/#StreamingParsing
It parses an Iterator[String], so testing with a hard-coded small dataset is as simple as writing Iterator("my first chunk", "my second chunk").

#9

On the JVM, there is a way to access resources which are available on the classpath (alongside the compiled classes): https://docs.oracle.com/javase/8/docs/technotes/guides/lang/resources.html

In a standard SBT project, such files are put into into src/main/resources (for general use) or src/test/resources (for use only in tests).

#11

Should be a Iterator[Char] right, if it mimics a file?

#12

great! works like a charm.

#13

No, fastparse explicitly does not handle char iterators:

Note that fastparse does not parse Iterator[Char] or Iterator[Byte] s for performance reasons: most input sources make data available in chunks, such as network packets or lines from file on the disk. By parsing chunks, FastParse better matches any underlying data source, and itself has better performance parsing larger chunks.

Note that the iterator represents chunks of arbitrary size, not necessarily lines like you get from Files.readAllLines and such.

#14

It’s been quite some time I’ve worked with parser combinators, and I’ve never used FastParse. This discussion made me curious…

import fastparse._
import scala.io._

object FastParseCNFParser {

  import NoWhitespace._

  def parse(source: Source): Either[String, CNF] =
    fastparse.parse(source.mkString, cnf(_)) match {
      case Parsed.Success(res, _) => Right(res)
      case Parsed.Failure(_, _, extra) => Left(extra.trace(true).longAggregateMsg)
    }

  def nl[_ : P]: P[Unit] = P("\r\n" | "\n")
  def ws[_ : P]: P[Unit] = P(nl | CharIn(" \t"))
  def posNumber[_ : P]: P[Int] = P(CharIn("0-9").rep(1)).!.map(_.toInt)
  def number[_ : P]: P[Int] = P("-".!.? ~ posNumber).map { case (s, v) => s.fold(v)(_ => -v) }

  def comment[_ : P]: P[Unit] = P("c" ~ CharPred(!Seq('\r', '\n').contains(_)).rep(0))
  def problem[_ : P]: P[(Int, Int)] = P("p cnf " ~ posNumber ~ " " ~ posNumber)
  def clause[_ : P]: P[Clause] = P(number.filter(_ != 0).rep(1, ws) ~ ws ~ "0")

  def cnf[_ : P]: P[CNF] = {
    val p =
      (comment.rep(1, nl) ~ nl).? ~
      problem ~ nl ~
      clause.rep(1, ws) ~
      nl.rep(0) ~
      End
    p.map { case (_, _, cs) => cs }
  }

}

Probably this can be improved in terms of conciseness/elegance, and maybe there’s still bugs in there - I’ve only tested against one simple example from the “spec”.

I’d definitely prefer to maintain an implementation like this in the long run, compared to a handcrafted parser. Yes, the API adds some mental weight, but the structure of the problem (and the solution) is much easier to grasp, IMHO, and I’d guess that extending/modifying the parser should be more straightforward and less risky than for a RYO variant. YMMV, of course.

#15

Thanks for the implementation. I must say though that it looks daunting.
BTW, after sleeping on the problem for one night, I refactored it the next day to the following, purely functional approach.

val digits = Set('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '-')
val endOfLine = Set('\r', '\n')
val whiteSpace = Set(' ', '\t') ++ endOfLine

def parseListContent(chars: List[Char], clause: Clause, clauses: CNF): CNF = {
  chars match {
    case 'p' :: _ |
         'c' :: _ =>
      parseListContent(chars.dropWhile(!endOfLine.contains(_)), clause, clauses)
    case c :: _ if digits.contains(c) =>
      val (num, tail) = chars.span { c => digits.contains(c) }
      val number = num.mkString.toInt
      if (number == 0)
        parseListContent(tail, Nil, clause.reverse :: clauses)
      else
        parseListContent(tail, number :: clause, clauses)
    case c :: _ if whiteSpace.contains(c) =>
      parseListContent(chars.dropWhile(whiteSpace.contains(_)), clause, clauses)
    case Nil if clause.nonEmpty =>
      parseListContent(Nil, Nil, clause.reverse :: clauses)
    case Nil =>
      clauses.reverse
  }
}
#16

I agree this kind of API needs getting used to. Still I see benefits in the long run.

  • The structure of the file is easily visible in the cnf declaration: Optional comment lines, a problem line, a sequence of clauses, optional trailing line breaks. From there one can easily drill down.
  • Declarations like nl, ws, number can easily be factored out to a “toolbox” for reuse in other parsers.
  • Assume for example that the requirement comes up to parse a more complete representation of the file instead:
    case class FullCNF(comments: Seq[String], numVars: Int, clauses: Seq[Clause])
    The FastParse implementation can be adapted with a rather straightforward sequence of few localized changes. With the handcrafted parser it’ll be more like shotgun surgery.
  • It’s easier to come up with more helpful error messages in the case of parser failures. (I haven’t paid attention to that in the sample code, but FastParse provides some support there.)
#17

I think the key thing to bear in mind is that the parser combinators approach is generally more scalable than hand-written. For something very simple, the hand-written approach can be pretty clear and easy, as shown by @jimka. But it can easily turn into spaghetti gobbledygook as things get more complex.

So this:

is actually pretty key. The ability to decompose the bits and pieces, test them separately and combine them using predictable combinators, tends to make things more maintainable as the problem grows.

Yes, there’s a bit of syntax involved, but once you get used to that little DSL, it becomes pretty readable, and it’s far more concise than the hand-written version, with concepts like alternatives built right in. (As so often, learning to pronounce the symbols helps – for example, I mentally think of | as “or”, and ~ as “then”.)

And the factoring, with each element broken out into its own named function, makes it much, much easier to find what you’re looking for when you’re working in a non-trivial grammar…

#18

I have what might be a motivating reason to abandon the simple tail recursive approach and embrace the parser approach.

When I did some timing analysis, I discovered that for a very large benchmark example, my simple parser takes about 16 seconds to run, with most of the time being spent in converting the content of the file to a List[Char]. Thereafter the parseListContent function works very fast. I then re-wrote the function to operate on a BufferedIterator[Char], and the time drops to 2.5 seconds. Unfortunately can’t use the came code to parse a List[Char] and BufferedIterator[Char] because BufferedIterator neither supports pattern matching, nor dropWhile and friends.

So here is what the buffered version of the parser looks like now. Considerably uglier than before. Worse is that I have two versions of the function, one for BufferedIterator[Char] and one for List[Char], the latter being used in several test cases which call parseListContent(string.toList). I could of course eliminate the latter and test using parseBufferedContent(string.toIterator.buffered), but perhaps biting the bullet and massaging the code from @sangamon, to fit is quickly become a more elegant alternative.

def parseBufferedContent(it:BufferedIterator[Char], clause: Clause, clauses: CNF): CNF = {
    def dropWhile(p: Char => Boolean): BufferedIterator[Char] = {
      while (it.hasNext && p(it.head)) {
        it.next()
      }
      it
    }

    def takeWhile(p: Char => Boolean): List[Char] = {
      def loop(buf: List[Char]):List[Char] = {
        if (it.hasNext && p(it.head))
          loop(it.next() :: buf)
        else
          buf.reverse
      }
      loop(Nil)
    }

    if (it.hasNext)
      it.head match {
        case 'p' |
             'c' =>
          parseBufferedContent(dropWhile(!endOfLine.contains(_)), clause, clauses)
        case c if digits.contains(c) =>
          val num = takeWhile(digits.contains)
          val number = num.mkString.toInt
          if (number == 0)
            parseBufferedContent(it, Nil, clause :: clauses)
          else
            parseBufferedContent(it, number :: clause, clauses)
        case c if whiteSpace.contains(c) =>
          parseBufferedContent(dropWhile(whiteSpace.contains), clause, clauses)
      }
    else if (clause.nonEmpty)
      parseBufferedContent(it, Nil, clause :: clauses)
    else
      clauses
  }
#19

I’m not convinced that parser-combinators are scalable. I tried FastParse twice and quickly ran into essentially the same problem. I posted on Gitter, but no one seemed to be able to help.

A minimal example is here, that I also posted on Gitter:

import fastparse., NoWhitespace.
object FastParserTest extends App {
def term[:P]: P[Unit] = P(“a”)
def sum[
:P]: P[Unit] = P(expression ~ “+” ~ term)
def expression[:P]: P[Unit] = P(term|sum)
def line[
:P]: P[Unit] = P(Start ~ expression ~ End)
val result: Parsed[Unit] = parse(“a+a”, line(_))
println(result)
}

This one prints:

Parsed.Failure(Position 1:2, found “+a”)

I can’t figure out why it won’t simply backtrack and find the right option.

If I replace P(term|sum) with P(sum|term), it recurses until the stack overflows.

#20

As I said before, I’m not a parser pro, it’s been some time since I’ve used parsers at all, and I’ve never used FastParse before, but I’ll give it a try…

The problem seems to be

  • left recursion in between expression and sum
  • having term first in the “Either” in expression

To make sure that the parser does not only succeed but actually detect the expected structure, I’ve added some result ADT. (It works in the Unit variant, too, though.)

sealed trait Exp
case class Term(v: String) extends Exp
case class Sum(a: Exp, b: Exp) extends Exp

This works, then:

def term[_:P]: P[Exp] = P("a").!.map(Term)
def sum[_:P]: P[Exp] = P(term ~ "+" ~ expression).map((Sum.apply _).tupled)
def expression[_:P]: P[Exp] = P(sum|term)
def line[_:P]: P[Exp] = P(Start ~ expression ~ End)
val result: Parsed[Exp] = parse("a+a+a", line(_))
// Parsed.Success(Sum(Term(a),Sum(Term(a),Sum(Term(a),Term(a)))), 7)

Note that “+” is right-associative now, of course. Making it left-associative will need some more restructuring - I’d have to dive deeper into FastParse and parsers in general to figure that out.

I also wouldn’t jump to conclusions about parser combinators in general from a single FastParse example. Your grammar can be translated directly to parser-combinators, yielding the left-recursive variant. The secret sauce here is the usage of PackratParsers and the ||| operator instead of |.

object SimpleParser extends RegexParsers with PackratParsers {
  lazy val term: PackratParser[Exp] = "a" ^^ { Term }
  lazy val sum: PackratParser[Exp] = expression ~ "+" ~ term ^^ { case a ~ _ ~  b => Sum(a, b) }
  lazy val expression: PackratParser[Exp] = term ||| sum
  lazy val line: PackratParser[Exp] = expression
}

import SimpleParser._

println(parseAll(line, "a+a+a+a"))
// [1.8] parsed: Sum(Sum(Sum(Term(a),Term(a)),Term(a)),Term(a))
#21

Ah, thanks for the response, this was very helpful!

Well, I do want my + to be left-associative. :slight_smile:

Basically, one should know that only some parser-combinators libraries can do left recursion, and FastParse is unfortunately not one of them.

And while I keep hearing a lot of bad things about Scala Parser Combinators, it does include a Packrat Parser that can do left recursion.

So the question is, which popular parser libraries can do left recursion?