Deserialize a multiline, unordered, key-value string containing more than the required information


#1

Hey there,

here comes another beginners question: I am given a multiline String containing a key-value-pair per line. Now I’d like to parse the String generating a case class object from it. Assume the following case class:

case class Something(key1: String, key2: String, key3: Boolean)

and a String like:

key2: value2
key1: value1
key4: value4

key3:        no

As can be seen, the String contains some hurdles:

  1. key-value pairs might not occur in a defined order.
  2. whitespaces and newlines might occur more often than required
  3. there might be keys inside, that can be neglected when the object is constructed
  4. There may be a need for implicit conversions such as key3 = yes -> True, False else

My question is: what is the most beautiful, scala-like approach for parsing. I considered to use pattern matching but using it in a loop, I would need to initially construct an “empty object” whose setters are called during parsing. This is not very scala-like.

Any ideas?

Simon


#2

Hint: I would probably do this as two separate steps – parse, then build – with an intermediate Map in the middle…


#3

Which would require the map to be mutable?


#4

It doesn’t require the Map to be mutable. You can build a Seq[(String, String)] and call toMap on it. That whole process can be done in a way that is functional and immutable.

What I’m wondering about is if other people would use regular expressions for this. Given the variability of the input, that is my first inclination. Then do a for-yield with a pattern on the regular expression.


#5

Not at all – you’re parsing in a line-by-line loop, with each line creating a new immutable Map based on the previous Map and the new information. It’s very rarely necessary to use a mutable Map


#6

Personally, I tend to just go directly to FastParse. But you’re probably right that that’s overkill in this case, and that doing a line-by-line regex is good enough…


#7

Thanks for the idea. I finally came up with two similar solutions:

  def main(args: Array[String]): Unit =
  {
    val pattern: Regex = "\\s*([0-9a-zA-Z]+):\\s*([0-9a-zA-Z]+)\\s*".r
    val serialized =
      """
      key2: value2
      key1: value1
      key4: value4
      key3:        no
      """

    // Attempt 1
    val result = for (pair <- pattern.findAllMatchIn(serialized)) yield (pair.group(1), pair.group(2))

    for ((key, value) <- result.toMap)
      println(s".$key. -> .$value.")

    println()
    println()

    // Attempt 2
    val result2 = for (line <- serialized.split("\n")) yield line match {
      case pattern(key, value) => (key, value)
      case _ => ("X", "X")
    }

    for ((key, value) <- result2.toMap)
      println(s".$key. -> .$value.")
  }
}

My point is: i would prefer attempt 2 over attempt 1 because it looks a little bit more beautiful to me. However:

  1. how could i just skip the case when the pattern doesn’t match? (case _)
  2. is there some way to bypass calling the split function?
  3. does split slow down my code?
  4. can I somehow call toMap somewhere inside my for comprehension? I would like my result to be a map already.

And my most general question: is it beautiful code from an experienced scala programmers perspective?

Thanks, Simon


#8

You can’t do that with the for-comprehen, but you can use collect on the iteratee instead.It takes a partial function (e.g. a pattern match that does not match all cases) and returns only the values that matched.

For this special case, Scala has a method lines for Strings, which splits a string at \n. It returns an Iterator instead of an array, which may be faster (not storing the intermediate result), but the split should be negligible performance-wise anyway.

With the for-comprehension, you would have to wrap the whole comprehension in parens to add a .toMap afterwards. But using collect, the whole thing becomes a one-liner:

serialized.lines.collect{ case pattern(key, value) => (key, value) }.toMap

#9

Thank you! This is beautiful :wink:


#10

You can do this with a for-comprehension. One of the features of for-comprehensions (that has recently been debated some on these boards) is that is simply skips anything that isn’t a match. So you can do the following.

val resultMap = (for(pattern(key, value) <- serialized.split("\n")) yield (key, value)).toMap

Any line that doesn’t match the pattern is excluded. Whether you prefer this syntax to the collect is up to you. Using @crater2150’s lines trick works here too. (I hadn’t seen that method in the API for String. I just used it for Source in the past.)

val resultMap = (for(pattern(key, value) <- serialized.lines) yield (key, value)).toMap

#11

Oh, right, I forgot about pattern matchings on the lefthand side in for comprehensions.

The lines method is not actually on the String class, but on the StringOps wrapper, which has an implicit conversion in Predef.


#12

(Aside: this is now JDK version dependent — String acquired a built-in lines method on JDK 11. We undeprecated .linesIterator in Scala 2.12.8 to avoid conflicting with the new method.)


#13

Good to know. I have always used split since I didn’t know about lines. I’ll stick with that approach.