Parsing Text file and loading it into a DataFrame

simon-gunacker · October 11, 2018, 2:28pm

Hi. I am new to scala too. And I have never used Spark before. So I cannot assist you on that. But your problem seems familiar to me. Here is my approach:

1: Our file consists of lines, right?

2: A more semantic view on that file would be: it consists of different “sections”

3: 1. and 2. => there are two types of lines: those indicating the beginning of a section (and belonging to that section in your case) and those simply belonging to that section without indicating anything from a high-level point-of-view

4: a section needs to have an end in some way. In your case: whenever a new section starts, the old (previous) one ends

Given that, I would implement something like:

currentSection = empty string buffer
while not EOF:
   currentLine = getNextLine()
   if (currentLine indicates beginning of new section):
      handleCurrentSection()
      startNewSection()
   else: // currentSection is still "incomplete"
      currentSection.append(currentLine)

// when the file ends, the last section ends. We still need to handle it
handleCurrentSection()

where handleCurrentSection() can basically parse the content of currentSection and startNewSection() clears the currentSection (and - in your case - adds currentLine to the cleared buffer). As your sections seem to be “homogeneous”, handleCurrentSection() is always doing the same thing: generate a new row in your dataframe using the data currently given in currentSection.

So far for my approach.

Here is another thought on that: Your solution could easily be generalized to handle different types of sections differently. In that case, you end up with the very basic algorithm of interpreters: call your currentLine Token and your Section expression and iterate over all tokes your file consist of: whenever a token completes an expression, evaluate that expression