Need help understanding converting a 3 elements array into 2 elements

Maninderpreet · May 9, 2020, 11:34am

Hi I am trying to convert an array of 3 elements:

Example:

val people = sc.parallelize(Array(("Jane", "student", 1000),
("Peter", "doctor", 100000), ("Mary", "doctor", 200000),
("Michael", "student", 1000)))

to val jobOsalary = (("student", 1000), ("doctor", 100000), ("doctor" , 200000), ("student", 1000))

How can I do this?

SethTisue · May 9, 2020, 1:40pm

pattern matching is useful here:

scala 2.13.2> val a = Array(("Jane", "student", 1000),
            | ("Peter", "doctor", 100000), ("Mary", "doctor", 200000),
            | ("Michael", "student", 1000))
val a: Array[(String, String, Int)] = Array((Jane,student,1000), (Peter,doctor,100000), (Mary,doctor,200000), (Michael,student,1000))

scala 2.13.2> a.map{case (_, job, salary) => (job, salary)}
val res1: Array[(String, Int)] = Array((student,1000), (doctor,100000), (doctor,200000), (student,1000))

Maninderpreet · May 9, 2020, 2:03pm

Thanks Seth.

I am trying to reduce the elements in the array a by using reduceByKey() but it shows error:

val result = a.map{case (_, job, salary) => (job, salary)}.reduceByKey(_+_)
<console>:25: error: value reduceByKey is not a member of Array[(String, Int)]
       val result = a.map{case (_, job, salary) => (job, salary)}.reduceByKey(_+_)
                                                                  ^

Can you tell what to do in this case? How can reduceByKey be used to reduce the set of elements to get output something like

Array((doctor,300000), (student,2000))

BalmungSan · May 9, 2020, 2:46pm

reduceByKey only exists on an RDD (Spark). Your first example did use Spark, but Seth’s example omitted the parallelize step.

You can simply mix both parts and it should work.
Or if you want a plain Scala solution, you can use groupMapReduce if you can use 2.13 or a combination of groupBy with mapValues with reduce on older versions.

anmol · May 9, 2020, 2:47pm

Hi,

I am thinking why case is needed here. I just tried map directly followed by reduceByKey and it works. Not sure if I missed out something. -

scala> val people = sc.parallelize(Array(("Jane", "student", 1000),
     | ("Peter", "doctor", 100000), ("Mary", "doctor", 200000),
     | ("Michael", "student", 1000)))
people: org.apache.spark.rdd.RDD[(String, String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val res = people.map(v => (v._2, v._3)).reduceByKey
reduceByKey   reduceByKeyLocally

scala> val res = people.map(v => (v._2, v._3)).reduceByKey(_+_).collect()
res: Array[(String, Int)] = Array((student,2000), (doctor,300000))

BalmungSan · May 9, 2020, 2:48pm

The case (_, job, salary) helps to improve the readability of the code.

Maninderpreet · May 9, 2020, 3:10pm

Thanks a lot guys:

Here is the piece of code it made:

scala> val people = sc.parallelize(Array(("Jane", "student", 1000),
     |                                   ("Peter", "doctor", 100000),
     |                                   ("Mary", "doctor", 200000),
     |                                   ("Michael", "student", 1000)))
people: org.apache.spark.rdd.RDD[(String, String, Int)] = ParallelCollectionRDD[25] at parallelize at <console>:24
scala> 
scala> 
scala> val b =sc.parallelize(a.map{case (_, job, salary) => (job, salary)})
b: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[26] at parallelize at <console>:26
scala> val result =b.reduceByKey(_+_).sortBy(_._1)
result: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[32] at sortBy at <console>:28
scala> result.collect
res12: Array[(String, Int)] = Array((doctor,300000), (student,2000))

scala> val people = sc.parallelize(Array(("Jane", "student", 1000),
     |                                   ("Peter", "doctor", 100000),
     |                                   ("Mary", "doctor", 200000),
     |                                   ("Michael", "student", 1000)))
people: org.apache.spark.rdd.RDD[(String, String, Int)] = ParallelCollectionRDD[25] at parallelize at <console>:24
scala> 
scala> 
scala> val b =sc.parallelize(a.map(v => (v._2, v._3)))
b: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[26] at parallelize at <console>:26
scala> val result =b.reduceByKey(_+_).sortBy(_._1)
result: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[32] at sortBy at <console>:28
scala> result.collect
res12: Array[(String, Int)] = Array((doctor,300000), (student,2000))

Btw a user told me to use before and after entering the code here. In my keyboard if i use quotes ''' it's different from. how can type it using normal keyboard? I always have to go back to find the ``` from the chat. Its very basic still but helpful.
Thanks

BalmungSan · May 9, 2020, 3:22pm

You can just use people and remove the second parallelize.

Finally collect() should always be called with the parenthesis.

An additional note, do not use Arrays unless you need to.
In this case, since it is only used to create the RDD you may just use List or Seq.

Russ · May 9, 2020, 9:19pm

That looks a lot more complicated to me than Seth’s solution. The simplest solution is usually the best solution. Why do you need the additional complexity? Is it because you need parallel processing? If so, you must have a huge list of these tuples. Have you considered just using a parallel Vector?