How to create dataframe from reading wholetextFiles method

kumarraj · December 7, 2017, 4:50pm

I have text as below,
sample.txt
TIME STAMP1
A1200 EVENT START
EVENT NAME = DOS
EVENT_INS = 1
EVENT_ID = 100
BUFFER = 233355
FORMAT = ATC
LOC = C:/User/data
;
TIME STAMP2
A1201 EVENT START
EVENT NAME = DOS
EVENT_INS = 0
EVENT_ID = 87
BUFFER = 773355
FORMAT = ETC
LOC = C:/User/data
;
The above structure comes multiple times in the text file.

I tried with wholeTextFiles() and convert to RDD string.
But when I try to convert dataframe, all textfile data comes as first row.
I need each value comes as different columns such as col1 contains (EVENT NAME:DOS,DOS…), col2 contains (EVENT_INS: 1,0,…) and others.
I tried to read the as,
val text1 = sc.wholeTextFiles("/user/files/log1.txt")
val rdd1 = rdd.map(x => x.2.replace("\n", “|*|”).split(";").filter(!.contains(“A1200”)).mkString(";").replace("|*|", “\n”)+";")
val rdd2 = rdd1.map(.split("=").map(.trim).mkString(","))
rdd2.toDF().show(20)

But results in table shows all text value in single column and row.
But when I use val text1 = sc.textFile("/user/files/log1.txt"), the data frame comes properly.

How can I properly display dataframe using sc.wholeTextFiles() method.
Thank you

Jasper-M · December 7, 2017, 5:32pm

Create a case class to represent your data and then flatMap your RDD[(String,String)] to a RDD[YourCaseClass].

case class Event(EVENT_NAME: String, EVENT_INS: Int, ...)
val events: RDD[Event] = rdd.flatMap{ case (_, file) => file.split(';').map(...) }
events.toDF

kumarraj · December 7, 2017, 6:06pm

I tried as,
case class Event(EVENT_NAME: String, EVENT_INS: String, EVENT_TYPE: String)
then try to map as ,
val dd1 = rdd.map(x => x.2.replace("\n", “|*|”).split(";").filter(!.contains(“A1200”)).mkString(";").replace("|*|", “\n”)+";")

it shows multiple markers error.
Is it passible to take RDD[String,String] of wholeTextFiles as key and value and map the value as line by line string as in sc.textFile()

Jasper-M · December 7, 2017, 6:26pm

That code is the same as before and it’s still not clear to me what it is supposed to achieve.
And I think you specifically don’t want to map the contents of your files line by line, since there seems to be information about a single row of data spread out over multiple lines.

kumarraj · December 7, 2017, 8:07pm

Hi,
I need only following portion from the text file between A1200 to “;”.
I need to remove the timestamp 2 where ever comes in the file…

TIME STAMP1

A1200 EVENT START
EVENT NAME = DOS
EVENT_INS = 1
EVENT_ID = 100
BUFFER = 233355
FORMAT = ATC
LOC = C:/User/data
;
[/quote]
I need to replace the “=” with comma “,” and save this file.
Finally I need to create dataframe with tables, col1 and col2.
Col 1 should have left side of “=” sign and col2 contains right side of “=” sign.
so it look like,
±--------------------------------±--------------------------±-------------------+
|filedata |col1 |col2 |
±--------------------------------±--------------------------±-------------------+
|EVENT NAME,DOS |EVENT NAME |DOS |
|EVENT NAME,DOS |EVENT NAME |DOS ||
|EVENT NAME,DOS |EVENT NAME |DOS | |
|EVENT NAME,XP |EVENT NAME |XP | |
|EVENT_INS,1 |EVENT_INS |1 | |
|EVENT_INS,1 |EVENT_INS |1 | |
|EVENT_INS,2 |EVENT_INS |2 | |
|EVENT_INS,1 |EVENT_INS |1 | |
|EVENT_INS,2 |EVENT_INS |2 | |
|EVENT_INS,1 |EVENT_INS |1 | |
|EVENT_INS,4 |EVENT NAME |4 |

Here I need , after this table, need to allign as.
EVENT_NAME,DOS,DOS,DOS,XP
EVENT_INS,1,1,2,1,4
Finally agai convert to column as col1 EVENT_NAME and Col2 as EVENT_INS with its values
Thank you …

Jasper-M · December 7, 2017, 8:48pm

I think you are confusing the result you need to obtain with the process you believe you need to follow to obtain it.

Still, you will probably want to flatMap the RDD you get from wholeTextFiles to a RDD[Record] with case class Record(col1: String, col2: String).