Parsing Text file and loading it into a DataFrame


#1

I am pretty newbie to Scala. I am into this situation now.
I have a semi-structured text file which I want to convert it to a Data Frame in Spark. I do have a schema on my mind which is shown below. However, I am finding it challenging to parse my text file and assign the schema.

Following is my sample text file:

    "good service"
    Tom Martin (USA) 17th October 2015    
    4    
    Long review..    
    Type Of Traveller	Couple Leisure    
    Cabin Flown	Economy    
    Route	Miami to Chicago    
    Date Flown	September 2015    
    Seat Comfort	12345    
    Cabin Staff Service	12345    
    Ground Service	12345    
    Value For Money	12345    
    Recommended	no

    "not bad"
    M Muller (Canada) 22nd September 2015
    6
    Yet another long review..
    Aircraft	TXT-101
    Type Of Customer	Couple Leisure
    Cabin Flown	FirstClass
    Route	IND to CHI
    Date Flown	September 2015
    Seat Comfort	12345
    Cabin Staff Service	12345
    Food & Beverages	12345
    Inflight Entertainment	12345
    Ground Service	12345
    Value For Money	12345
    Recommended	yes

.
.

The resulting schema with result that I expect to have as follows:

    +----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
    | Review_Header  | User_Name  | User_Country |  User_Review_Date   | Overall Score |          Review           | Aircraft | Type of Traveler | Cabin Flown | Route_Source | Route_Destination |   Date Flown   | Seat Comfort | Cabin Staff Service | Food & Beverage | Inflight Entertainment | Ground Service | Wifi & Connectivity | Value for Money |
    +----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+
    | "good service" | Tom Martin | USA          | 17th October 2015   |             4 | Long review..             |          | Couple Leisure   | Economy     | Miami        | Chicago           | September 2015 |        12345 |               12345 |                 |                        |          12345 |                     |           12345 |
    | "not bad"      | M Muller   | Canada       | 22nd September 2015 |             6 | Yet another long review.. | TXT-101  | Couple Leisure   | FirstClass  | IND          | CHI               | September 2015 |        12345 |               12345 |           12345 |                  12345 |          12345 |                     |           12345 |
    +----------------+------------+--------------+---------------------+---------------+---------------------------+----------+------------------+-------------+--------------+-------------------+----------------+--------------+---------------------+-----------------+------------------------+----------------+---------------------+-----------------+

As you may notice, for each block of data in text file, the first four lines are mapped to user defined columns such as Review_Header, User_Name, User_Country, User_Review_Date, whereas rest other individual lines have defined columns.


#2
  1. analyze your input data to determine what types of blocks of data you need to deal with.
  2. associate each block of data with some type of class that maps to that particular block.
    3.process the class for the block
  3. repeat.