pNode is a collect_list
var pArray=pArray1.select("PID","GID","pNode")
var pMap = pArray.rdd.collect().map(f=>{
var PID= if (f.getAs[Any]("PID") != null) f.getAs("PID").toString() else null
var GID= if (f.getAs[Any]("GID") != null) f.getAs("GID").toString() else null
println( PID+"~"+GID ",pNode )
(PID+"~"+GID+"~",pNode)
}).toList.groupBy(_._1).mapValues(_.map(_._2))
pArray is a Spark-Dataframe which is derived from joining 3 data-frames.
The collect() which is an action gets the key value pairs from the Map function.
What does the toList and and .mapValues does here . Request you to please explain this program step by step ,sorry might be a basic question
I’m not too familiar with Spark, but if I understand your question correctly, it’s mostly about the standard collections in the code.
First, you have the map call on the spark object:
pArray.rdd.collect().map(f=>...)
The function given to map returns a tuple with a string and pNode. You then call toList, which converts the Spark collection to a List[(String, PNode)] (I’ll just use PNode as the type of pNode here).
On list, the groupBy(_._1) groups the elements in it by the first element of the tuple (_._1 is shorthand for a function tuple => tuple._1, which selects the first element in the tuple).
The result is a Map[String, List[(String, PNode)]]. The keys are the strings from the tuples, and the values are lists of the tuples with the corresponding string in them.
You can see, that the string is now redundantly stored as the key and in the values. The call .mapValues(_.map(_._2)) changes this:
mapValues is similar to map on a list, applying the given function to all values in the map (ignoring the keys)
the values are List[(String, PNode)] but we want List[PNode], so we call map to transform the values in the lists.
_._2 is again a function selecting an element in the tuple, this time the PNode one.