Appending dataframe column in scala spark


#1

Hi,
I tried to merge two dataframes, but facing duplicate rows problem,
DATAFRAME1 (df1)
±--------------------------------±-----------±-------------------+
|Val_1 |RES_1 |OWNER_1 |
±--------------------------------±-----------±-------------------+
|val-a |PASS |OWN-1 |
|val-b |PASS |OWN-2 |
|val-c |FAIL |OWN-2 |
±--------------------------------±-----------±-------------------+

DATAFRAME-2 (df2)
±------------------------------±--------------±------------------+
|val_2 |RES_2 |OWNER_2|
±------------------------------±--------------±------------------+
|val-d |FAIL |OWN-3 |
|val-e |PASS |OWN-4 |
|val-f |FAIL |OWN-5 |
±------------------------------±--------------±------------------+

I need final merged dataframe as,
±-------------------±---------±------------------±---------------±-----------±---------------+
|Val_1 |RES_1 |OWNER_1 |val_2 |RES_2 |OWNER_2
±-------------------±---------±------------------±---------------±-----------±---------------+
|val-a |PASS |OWN-1 |val-d |FAIL |OWN-3|
|val-b |PASS |OWN-2 |val-e |PASS |OWN-3|
|val-c |FAIL |OWN-2 |val-f |FAIL |OWN-3|

I tried with,
val df3 = df1.join(df2,df1(“val1”)=!= df2.col(“val2”))
df3.show()
But it creates duplicate rows in merged dataframe.
How can remove duplicate rows.
I used df3.dropDuplicates(), but no use.

thank you.


#2

This isn’t really a Scala question, it’s a Spark question. You will probably find useful information on StackOverflow (for example, here is a similar question—but don’t use the accepted answer, it may fail for non-trivial datasets).

My question is why does this keep coming up? Where do people get these separate datasets that have no explicit index and are implicitly indexed by row number?


#3

As @jpallas above said - this is a Spark specific question. If stackoverflow does not help, you should reach out to Spark User Mailing List. It is a very active, friendly and wise community and they will most likely answer your question or suggest a better solution.


#4

Thank you.
I solved the above problem by join and select column of spark dataframe using scala.
Thank you very much for helpful suggestions…nice…