How to compare the values of 2 columns and get output into an array in Scala?

nbhan · February 16, 2024, 8:32pm

Below is the input dataframe:

val data = Seq( (5416, “il”, “ar”, “17/02/2017”), (5416, “ar”, “us”, “22/02/2017”), (5416, “us”, “au”, “01/03/2017”), (5416, “au”, “sg”, “03/03/2017”), (5416, “sg”, “be”, “10/03/2017”), (5416, “be”, “be”, “13/03/2017”), (5416, “be”, “ca”, “02/04/2017”), (5416, “ca”, “ch”, “16/06/2017”), (5416, “ch”, “cg”, “20/06/2017”), (5416, “cg”, “at”, “14/07/2017”), (5416, “at”, “no”, “23/07/2017”) ).toDF(“passengerId”, “from”, “to”, “date”)

We have to group by the passenger id and get the country codes from “FROM” and “TO” columns into an array. The country code in FROM and TO are in sync with each other. Data consists of passenger travelling from one country to another on different dates

Below is the expected output:

passengerId---------------countries
5416 ------------------------[il,ar,us,au,sg,be,ca,ch,cg,at,no]

I have used the below logic:
data.groupBy(“passengerId”)
.agg(concat(collect_list(col(“from”)),collect_list(col(“to”))).name(“countries”))

It is somehow duplicating the country codes:

[“il”, “ar”, “us”, “au”, “sg”, “be”, “be”, “ca”, “ch”, “cg”, “at”, “no”, “us”, “tk”, “cl”, “pk”, “tj”, “at”, “sg”, “nl”, “tj”]

Quafadas · February 16, 2024, 9:16pm

So… if you don’t need spark, (discard the .toDF part) then the scala std library has you covered pretty nicely. Is this what you would want?

data.groupMapReduce(_._1)(tuple => List((tuple._2, tuple._3)))(_ ++ _).foreach(println)

Which spits out;
(5416,List((il,ar), (ar,us), (us,au), (au,sg), (sg,be), (be,be), (be,ca), (ca,ch), (ch,cg), (cg,at), (at,no)))

Now, I think we can reproduce your confusion, by mapping to a List[String], instead of a List[(String, String)]?

Spot the missing brackets…

data.groupMapReduce(_._1)(tuple => List(tuple._2, tuple._3))(_ ++ _).foreach(println)

(5416,List(il, ar, ar, us, us, au, au, sg, sg, be, be, be, be, ca, ca, ch, ch, cg, cg, at, at, no))

I don’t know spark, but my guess, is that if you can translate the difference between the two examples above to Spark, it may help?