Appending dataframe column in scala spark

kumarraj · December 15, 2017, 10:06am

Hi,
I tried to merge two dataframes, but facing duplicate rows problem,
DATAFRAME1 (df1)
±--------------------------------±-----------±-------------------+
|Val_1 |RES_1 |OWNER_1 |
±--------------------------------±-----------±-------------------+
|val-a |PASS |OWN-1 |
|val-b |PASS |OWN-2 |
|val-c |FAIL |OWN-2 |
±--------------------------------±-----------±-------------------+

DATAFRAME-2 (df2)
±------------------------------±--------------±------------------+
|val_2 |RES_2 |OWNER_2|
±------------------------------±--------------±------------------+
|val-d |FAIL |OWN-3 |
|val-e |PASS |OWN-4 |
|val-f |FAIL |OWN-5 |
±------------------------------±--------------±------------------+

I tried with,
val df3 = df1.join(df2,df1(“val1”)=!= df2.col(“val2”))
df3.show()
But it creates duplicate rows in merged dataframe.
How can remove duplicate rows.
I used df3.dropDuplicates(), but no use.

thank you.

jpallas · December 16, 2017, 8:18pm

This isn’t really a Scala question, it’s a Spark question. You will probably find useful information on StackOverflow (for example, here is a similar question—but don’t use the accepted answer, it may fail for non-trivial datasets).

My question is why does this keep coming up? Where do people get these separate datasets that have no explicit index and are implicitly indexed by row number?

spaszek · December 17, 2017, 6:42pm

As @jpallas above said - this is a Spark specific question. If stackoverflow does not help, you should reach out to Spark User Mailing List. It is a very active, friendly and wise community and they will most likely answer your question or suggest a better solution.

kumarraj · December 17, 2017, 7:03pm

Thank you.
I solved the above problem by join and select column of spark dataframe using scala.
Thank you very much for helpful suggestions…nice…