Hi I am trying to read a file from HDFS and use math.max and reduceByKey to get the age of the oldest person from each country.
Format of the file:
Index Column name Possible values
0 age continuous(1,2,3,4…)
1 workclass Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, etc.
2 fnlwgt continuous
3 education Bachelors, Some-college, 11th, HS-grad, Prof-school, etc.
4 education-num continuous
5 marital-status Married-civ-spouse, Divorced, Never-married, Separated, etc.
6 occupation Tech-support, Craft-repair, Other-service, Sales, etc.
7 relationship Wife, Own-child, Husband, Not-in-family, Other-relative, etc.
8 race White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
9 sex Female, Male
10 capital-gain continuous
11 capital-loss continuous
12 hours-per-week continuous
13 native-country United-States, Cambodia, England, Puerto-Rico, Canada, etc.
14 income >50K, <=50K
I am filtering native-country r(13) and age r(0) from the table
val censusLines = sc.textFile("/user/ashhall1616/bdc_data/lab_5/census.txt")
val censusSplit = censusLines.map(_.split(", "))
val countryAge = censusSplit.map(r => (r(13), r(0).toInt)).filter(x => !x._1.contains("?")).distinct( )
val oldestPerCountry = countryAge((math.max(_)).reduceByKey(_+_)
println(oldestPerCountry .count())
oldestPerCountry.collect()
Output:
scala> val censusLines = sc.textFile("/user/ashhall1616/bdc_data/lab_5/census.txt")
censusLines: org.apache.spark.rdd.RDD[String] = /user/ashhall1616/bdc_data/lab_5/census.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> val censusSplit = censusLines.map(_.split(", "))
censusSplit: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26
scala> val countryAge = censusSplit.map(r => (r(13), r(0).toInt)).filter(x => !x._1.contains("?")).distinct( )
countryAge: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[7] at distinct at <console>:28
scala> val oldestPerCountry = countryAge.map((math.max(_)).reduceByKey(_+_)
| println(oldestPerCountry.count())
| oldestPerCountry.collect()
<console>:3: error: ')' expected but '.' found.
oldestPerCountry.collect()
I am expecting the output to be like:
(Japan,61),
(Outlying-US(Guam-USVI-etc),63),
(Taiwan,61),
(Portugal,78),
(Guatemala,66)
Can someone help with it?