Spark-Scala RDD, group by count from array of array

4Ben4 · May 17, 2022, 10:57am

Hi everyone, I’m very beginner in spark-scala, so please bear with me, I have progressed a bit since I posted my last question, but I am stuck again,

val dataLines = sc.textFile("Data/clientjobs.csv")
val data = dataLines.map(.split(";"))

val values = data.map(array => (array(1))

I have the code above and an array of array like this

val data = Array[Array[String]] = Array(Array(“c1”,20)
,Array(“c2”,102)
,Array(“c3”,50)
,Array(“c4”,80)
,Array(“c5”,140)
,Array(“c6”,2036), Array(“c7”,568))

I want an output where instead of array of array I just get a normal array with the integer values

like this Array(20, 102, 50, 80, 140, 2036, 568)

I have mapped the array but it doesn’t show me in the terminal

It shows - MapPartitionsRDD[3] at map at code1.scala:14

SethTisue · May 17, 2022, 12:25pm

This code:

data.map(array => (array(1))

appears correct to me and should be giving you an Array[String]. If you wanted an Array[Int], do

data.map(array => array(1).toInt)

but then this part of your question:

but it doesn’t show me in the terminal
It shows - MapPartitionsRDD[3] at map at code1.scala:14

I don’t know how to help with. Your description of what went wrong it’s clear to me. I think you might need to include more context from your terminal session in order for someone to help. (Also I don’t know Spark at all, so if whatever you’re experiencing ends up involving Spark knowledge, perhaps someone will step in.)

spamegg1 · May 17, 2022, 3:16pm

Try calling .collect at the end of it, see if that helps.

mdiedricks · May 17, 2022, 7:17pm

I think you may have an array of tuples stored in your values object.

You can’t create an array with differing data types - you have a String and an Int, so would therefore be creating a tuple.
To access elements in a tuple you need to use the ._# notation.

You can read more about it here

So, I think, you may need to have something more like this:
val values = data.map(tuple => tuple._1)

4Ben4 · May 18, 2022, 3:50am

Thank you for your guidance, the toInt method works

4Ben4 · May 18, 2022, 3:52am

Thanks, but can’t seem to use the tuple, gives me an error on ._ notation.

SethTisue · May 18, 2022, 7:11am

When asking for help with code online, you should never just say “gives me an error”. Show the code and show the complete text of the error. That will make it much easier for people to help you.

mdiedricks · May 18, 2022, 7:51am

Ah ok. Thanks for letting me know.

It’s hard to tell what the data is actually being transformed into from your example.
But if the tuple notation wasn’t working, then it’s like when you converted the dataLines in an array of arrays each cell in the csv is converted to a string.

For example Array(Array("c2", "102")...)
Notice the 102 is a String, not an Int

Which also explains why the .toInt method worked.

Glad you found a solution!

4Ben4 · May 18, 2022, 7:53am

Noted, I run my code line by line, so when I sent you that one line from the console that was all the console was showing so I only sent that one line.

4Ben4 · May 18, 2022, 7:59am

Exactly, I’m so used to python and other languages and have never touched spark,scala,RDD hence the silly questions and silly mistakes, even after reading materials about any languages, doing things are a little tough, but each and every one of you helped, even if some of the code didn’t work, I got to know various methods of achieving the same result, thanks for that.

SethTisue · May 18, 2022, 12:35pm

also asked at https://www.reddit.com/r/scala/comments/ureyqt/sparkscala_rdd_finding_out_range_groups_from_csv/

spamegg1 · May 18, 2022, 1:28pm

You should take this course: https://www.coursera.org/learn/scala-spark-big-data
It will save you a huge amount of time, instead of endlessly searching the web, reading documentation, asking questions over and over. (That’s a horrible “learning technique” if it can even be called that.) Instead of tiny bits and pieces you’ll gain serious understanding from the course.

For example, transformations on RDDs like map, filter etc. are lazy, they are not evaluated until you call .collect. This is such a basic, fundamental aspect of RDDs. It is taught immediately in the first week of that course. The fact that you didn’t even know that immediately stood out to me.