Creating Dataframe from text file as per column number


#1

Example:
Row 1: XXX1234QQQQRRRRRRR$$$$
Row 2: XXX2345GGGGHHHHHHH####
where 1-3 is Serial No (XXX)
4-7 is Day (1234 or 2345)


#2

I would start with a Dataset[String] or even an RDD[String] and map it to the form that you want. If you really want a DataFrame, you could map each String to a Row with the data that you want in it. Personally, I’d make a case class for your data and map to that. It might look like this.

case class Part(serial: String, day: String, ...)

val parts = spark.read.text(inputFile).map { line =>
  val serial = line.substring(0, 3)
  val day = line.substring(3, 7)
  ...
  Part(serial, day, ...)
}

This gives you a Dataset[Part] that you can then do whatever you want with referring to the columns by the field names of the case class.


#3

Thanks Mark.
I was thinking same. But thought is there another way some of doing that.