SQL Querries not works on joined two dataframe

I have two data frame. First have 4 column, Second have 1 column.
I joined this two dataframe (code under this post).
First DF was made from three .json.

Second DF was made from Array on which have hashCode from one of column from first DF.

All it works. All without SQL querries.

When I made SQL querries on “merged” DF … in added columned (from secound DF) values was this same. In other column it was done to SQL querries, this one no.


val ArrayA = mergeDataFrame.select("USER_id").rdd.map(r => r(0)).collect()
val xyz = ArrayA.map{_.hashCode }
val rdd = sc.parallelize(xyz)
val HashCode = rdd.toDF("HashCode")
val mergeDataFrameHashCode = mergeDataFrame.join(HashCode)

I guess you’re missing a join condition. The way you’re doing the join, you would end up with cartesian product.

But you don’t have to drop out of dataframe to calculate hash code:

val mergeDataFrameHashCode = mergeDataFrame.withColumn("HashCode", hash($"USER_id"))

To use hash function you may need to import it from org.apache.spark.sql.functions (and use Spark 2.x)


Questions about Spark are perhaps better suited to be asked on Spark’s mailing list