Dataframes, do they keep file line order?

Welsige · August 29, 2019, 8:37pm

I am loading a CSV file into a Dataframe, to after create a LinkedHashMap from some columns of it, like so:

var vMap = new mutable.LinkedHashMap[String, String]()
sparkSession.read.option( "header", true ).csv( pPath )
      .collect().map( t => vMap += ((t(0).toString, t(1).toString)))

Although it seems to be maintaining file order when I print it, is it guaranteed? I read somewhere else Dataframes do not guarantee line order.

pjfanning · September 3, 2019, 11:25pm

My experience is that the order of the CSV will be maintained when read. If you do a transform on the dataframe, the order can be lost. Dataframes do have sort support, if you are not sure. You can use zipWithIndex on the dataframe before any transforms to add an index column that can be used to re-sort the data after a transform (to get back to the original order).

Welsige · September 5, 2019, 2:29pm

Thanks, i had this impression also, but since i am mapping it to collections I was after a certainty of this behavior, not only in the DF but in the used collection i am mapping it into.
I decided to use a column as index on the text files that need ordering assured, so there’s no doubts about the processing taking place.