What does the map and mapvalues return here

       pNode is a collect_list
       var pArray=pArray1.select("PID","GID","pNode")
       var pMap = pArray.rdd.collect().map(f=>{
       var PID= if (f.getAs[Any]("PID") != null) f.getAs("PID").toString() else null
       var GID= if (f.getAs[Any]("GID") != null) f.getAs("GID").toString() else null

            println( PID+"~"+GID ",pNode )

pArray is a Spark-Dataframe which is derived from joining 3 data-frames.

The collect() which is an action gets the key value pairs from the Map function.
What does the toList and and .mapValues does here . Request you to please explain this program step by step ,sorry might be a basic question

I’m not too familiar with Spark, but if I understand your question correctly, it’s mostly about the standard collections in the code.

First, you have the map call on the spark object:


The function given to map returns a tuple with a string and pNode. You then call toList, which converts the Spark collection to a List[(String, PNode)] (I’ll just use PNode as the type of pNode here).

On list, the groupBy(_._1) groups the elements in it by the first element of the tuple (_._1 is shorthand for a function tuple => tuple._1, which selects the first element in the tuple).
The result is a Map[String, List[(String, PNode)]]. The keys are the strings from the tuples, and the values are lists of the tuples with the corresponding string in them.

You can see, that the string is now redundantly stored as the key and in the values. The call .mapValues(_.map(_._2)) changes this:

  • mapValues is similar to map on a list, applying the given function to all values in the map (ignoring the keys)
  • the values are List[(String, PNode)] but we want List[PNode], so we call map to transform the values in the lists.
  • _._2 is again a function selecting an element in the tuple, this time the PNode one.

So in the end you get a Map[String, List[PNode]]

1 Like

thats was helpful thankyou