Parsing text within two tags

Hi guys. I’ ve decided to have a try at parser combinators.
So given the following text:

<b>Il profilo alare rappresentato appartiene alla categoria:</b><br>
<img src="q75_small.png"><br>
<ol type="A">
<li>dei piano/convessi</li>
<li>dei concavi/convessi</li>
<li>dei biconvessi asimmetrici</li>
<li>dei biconvessi, simmetrici</li>
</ol>B

I’ m trying to parse it according to the following schema:

import scala.util.parsing.combinator._

trait QuestionParser extends JavaTokenParsers {
  def card: Parser[Card] = {
    question ~ rep(choice) ~ answer ^^ {
      case q ~ cs ~ a => Card(q, cs, a)
    }
  }
  def question: Parser[Question] =
    "<b>" ~ text ~ "</b>" ~ opt(image) ^^ {
      case tag1 ~ txt ~ tag2 ~ img => Question(txt, img)
    }

  def choice: Parser[Choice] =
    "<li>" ~ text ~ "</li>" ^^ {
      case tag1 ~ txt ~ tag2 => Choice(txt)
    }

  def answer: Parser[Answer] = "</ol>" ~ answerChar ^^ {
    case tag ~ aC => Answer(aC)
  }
  def answerChar: Parser[String] = raw"[A-E]".r

  def image: Parser[String] = "<img src=" ~> stringLiteral <~ ">"
  def text: Parser[String] = """(.?)""".r
}

and these are the relevant classes:

case class Card(question: Question, choices: List[Choice], ans: Answer)

case class Question(text: String, image: Option[String])

case class Choice(text: String)

case class Answer(char: String) {
   override def toString = char
}

But if I run the code as follows:

object Main extends App with QuestionParser{
   
   val testCard: String = Connector.cards(74).card

   val p = parseAll(image, testCard) match {
      case Success(matched,_) => println(matched)
      case Failure(msg,_) => println(s"FAILURE: $msg")
      case Error(msg,_) => println(s"ERROR: $msg")
    }

   //println(p.ans)
}

I get the following message:

FAILURE: '<img src=' expected but '<' found

that is, the parser doesn’t recognize the tag.
Any idea?
Thank you, have a lovely day.
G.

There’s a <br> between the <b/> and <img/> tags…?

(More issues ahead: There’s another <br> after the <img/> tag, and the opening <ol> is not accounted for, either.)

It’s also worth noting that the traditional Scala parser combinators library is old, slow, and somewhat casually maintained. You might want to look at something more current, such as FastParse.

1 Like

I know, but mine is just a one-time-use project for the conversion from one file format to the other, and scala parser-combinators are, in my opinion, better documented, which means I don’t have to invest too much time (at least, this was my initial idea).
Still, thank you very much for your reply: I don’ t feel lonely. :smiley:

Have you considered using an existing HTML parser?

Yes, but at this point it is a personal challenge: why can’t I write a parser for text within tags? :smiley:
Plus, at the end of the html-like structure there’s a character ([A-E]) representing the correct answer and I don’t think I would be able to retrieve it with tools such as Jsoup.

Sure, whatever you consider best. Since this is a public place where people might come to learn, I wanted to add a reference to that solution, too.

I don’t know jsoup, but usually HTML cleaners don’t discard text, so the last character should still be there, although it may be not obvious where in the DOM it can be found.

On the other hand, if you know that your data is something HTML-like plus an extra character, you can just chop off the last character and parse the remainder as HTML.