Parsing text within two tags

Guido · June 18, 2020, 11:17am

Hi guys. I’ ve decided to have a try at parser combinators.
So given the following text:

<b>Il profilo alare rappresentato appartiene alla categoria:</b><br>
<img src="q75_small.png"><br>
<ol type="A">
<li>dei piano/convessi</li>
<li>dei concavi/convessi</li>
<li>dei biconvessi asimmetrici</li>
<li>dei biconvessi, simmetrici</li>
</ol>B

I’ m trying to parse it according to the following schema:

import scala.util.parsing.combinator._

trait QuestionParser extends JavaTokenParsers {
  def card: Parser[Card] = {
    question ~ rep(choice) ~ answer ^^ {
      case q ~ cs ~ a => Card(q, cs, a)
    }
  }
  def question: Parser[Question] =
    "<b>" ~ text ~ "</b>" ~ opt(image) ^^ {
      case tag1 ~ txt ~ tag2 ~ img => Question(txt, img)
    }

  def choice: Parser[Choice] =
    "<li>" ~ text ~ "</li>" ^^ {
      case tag1 ~ txt ~ tag2 => Choice(txt)
    }

  def answer: Parser[Answer] = "</ol>" ~ answerChar ^^ {
    case tag ~ aC => Answer(aC)
  }
  def answerChar: Parser[String] = raw"[A-E]".r

  def image: Parser[String] = "<img src=" ~> stringLiteral <~ ">"
  def text: Parser[String] = """(.?)""".r
}

and these are the relevant classes:

case class Card(question: Question, choices: List[Choice], ans: Answer)

case class Question(text: String, image: Option[String])

case class Choice(text: String)

case class Answer(char: String) {
   override def toString = char
}

But if I run the code as follows:

object Main extends App with QuestionParser{
   
   val testCard: String = Connector.cards(74).card

   val p = parseAll(image, testCard) match {
      case Success(matched,_) => println(matched)
      case Failure(msg,_) => println(s"FAILURE: $msg")
      case Error(msg,_) => println(s"ERROR: $msg")
    }

   //println(p.ans)
}

I get the following message:

FAILURE: '<img src=' expected but '<' found

that is, the parser doesn’t recognize the tag.
Any idea?
Thank you, have a lovely day.
G.

sangamon · June 18, 2020, 12:04pm

There’s a <br> between the <b/> and <img/> tags…?

(More issues ahead: There’s another <br> after the <img/> tag, and the opening <ol> is not accounted for, either.)

jducoeur · June 18, 2020, 12:15pm

It’s also worth noting that the traditional Scala parser combinators library is old, slow, and somewhat casually maintained. You might want to look at something more current, such as FastParse.

Guido · June 18, 2020, 1:16pm

I know, but mine is just a one-time-use project for the conversion from one file format to the other, and scala parser-combinators are, in my opinion, better documented, which means I don’t have to invest too much time (at least, this was my initial idea).
Still, thank you very much for your reply: I don’ t feel lonely.

curoli · June 18, 2020, 1:25pm

Have you considered using an existing HTML parser?

Guido · June 18, 2020, 1:49pm

Yes, but at this point it is a personal challenge: why can’t I write a parser for text within tags?
Plus, at the end of the html-like structure there’s a character ([A-E]) representing the correct answer and I don’t think I would be able to retrieve it with tools such as Jsoup.

curoli · June 18, 2020, 2:09pm

Sure, whatever you consider best. Since this is a public place where people might come to learn, I wanted to add a reference to that solution, too.

I don’t know jsoup, but usually HTML cleaners don’t discard text, so the last character should still be there, although it may be not obvious where in the DOM it can be found.

On the other hand, if you know that your data is something HTML-like plus an extra character, you can just chop off the last character and parse the remainder as HTML.