Data losing while reading a file of huge size in spark scala

sayantangangopadhyay · March 16, 2020, 1:28pm

val data = spark.read
.text(filepath)
.toDF(“val”)
.withColumn(“id”, monotonically_increasing_id())
val count = data.count()
val header = data.where(“id==1”).collect().map(s => s.getString(0)).apply(0)
val columns = header
.replace(“H||", “”)
.replace("|##|", “”)
.split("|*|")
val structSchema = StructType(columns.map(s=>StructField(s, StringType, true)))
var correctData = data.where('id > 1 && 'id < count-1).select(“val”)
val dataArr = correctData.rdd.map(s=> {
val a = s.getString(0).replace("\\n","").replace("\\r","")
var b = a.replace("|##|", “”).split("\|\\|”)
while(b.length < columns.length) b = b :+ “”
RowFactory.create(b:_*)
})
val finalDF = spark.createDataFrame(dataArr,structSchema)

This code works fine when I am reading a file contains upto 50k+ rows… but when a file comes with rows more than that , this code starts losing data.when this code reads a file having 1 million+ rows , the final datframe count only gives 65k+ rows data. I can’t understand where the problem is happening in this code and what needs to change in this code so that it will ingest every data in the final dataframe. p.s - highest file this code will ingest , having almost 14 million + rows , currently this code ingests only 2 million rows out of them.