Hi. I am new to scala too. And I have never used Spark before. So I cannot assist you on that. But your problem seems familiar to me. Here is my approach:
1: Our file consists of lines, right?
2: A more semantic view on that file would be: it consists of different “sections”
3: 1. and 2. => there are two types of lines: those indicating the beginning of a section (and belonging to that section in your case) and those simply belonging to that section without indicating anything from a high-level point-of-view
4: a section needs to have an end in some way. In your case: whenever a new section starts, the old (previous) one ends
Given that, I would implement something like:
currentSection = empty string buffer
while not EOF:
currentLine = getNextLine()
if (currentLine indicates beginning of new section):
handleCurrentSection()
startNewSection()
else: // currentSection is still "incomplete"
currentSection.append(currentLine)
// when the file ends, the last section ends. We still need to handle it
handleCurrentSection()
where handleCurrentSection()
can basically parse the content of currentSection
and startNewSection()
clears the currentSection
(and - in your case - adds currentLine
to the cleared buffer). As your sections seem to be “homogeneous”, handleCurrentSection()
is always doing the same thing: generate a new row in your dataframe using the data currently given in currentSection
.
So far for my approach.
Here is another thought on that: Your solution could easily be generalized to handle different types of sections differently. In that case, you end up with the very basic algorithm of interpreters: call your currentLine
Token
and your Section
expression
and iterate over all tokes your file consist of: whenever a token completes an expression, evaluate that expression