Trying to use math.max and reduceByKey to get the output

Maninderpreet · May 10, 2020, 5:23am

Hi I am trying to read a file from HDFS and use math.max and reduceByKey to get the age of the oldest person from each country.

Format of the file:
Index Column name Possible values
0 age continuous(1,2,3,4…)
1 workclass Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, etc.
2 fnlwgt continuous
3 education Bachelors, Some-college, 11th, HS-grad, Prof-school, etc.
4 education-num continuous
5 marital-status Married-civ-spouse, Divorced, Never-married, Separated, etc.
6 occupation Tech-support, Craft-repair, Other-service, Sales, etc.
7 relationship Wife, Own-child, Husband, Not-in-family, Other-relative, etc.
8 race White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
9 sex Female, Male
10 capital-gain continuous
11 capital-loss continuous
12 hours-per-week continuous
13 native-country United-States, Cambodia, England, Puerto-Rico, Canada, etc.
14 income >50K, <=50K

I am filtering native-country r(13) and age r(0) from the table

val censusLines = sc.textFile("/user/ashhall1616/bdc_data/lab_5/census.txt")
val censusSplit = censusLines.map(_.split(", "))
val countryAge = censusSplit.map(r => (r(13), r(0).toInt)).filter(x => !x._1.contains("?")).distinct( )
val oldestPerCountry = countryAge((math.max(_)).reduceByKey(_+_)
println(oldestPerCountry .count())
oldestPerCountry.collect()

Output:

scala> val censusLines = sc.textFile("/user/ashhall1616/bdc_data/lab_5/census.txt")
censusLines: org.apache.spark.rdd.RDD[String] = /user/ashhall1616/bdc_data/lab_5/census.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val censusSplit = censusLines.map(_.split(", "))
censusSplit: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26
scala> val countryAge = censusSplit.map(r => (r(13), r(0).toInt)).filter(x => !x._1.contains("?")).distinct( )
countryAge: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[7] at distinct at <console>:28
scala> val oldestPerCountry = countryAge.map((math.max(_)).reduceByKey(_+_)
     | println(oldestPerCountry.count())
     | oldestPerCountry.collect()
<console>:3: error: ')' expected but '.' found.
oldestPerCountry.collect()

I am expecting the output to be like:

(Japan,61),
(Outlying-US(Guam-USVI-etc),63),
(Taiwan,61),
(Portugal,78),
(Guatemala,66)

Can someone help with it?

BalmungSan · May 10, 2020, 2:46pm

What did you expect this line:

countryAge((math.max(_)).reduceByKey(_+_)

To do?

As far as I remember, there isn’t any apply method on RDDs.

BTW, the compiler error should be clear enough. You are missing one closing parenthesis before the reduceByKey
But still, that won’t solve the meta problem.