I am very much new to Spark/Scala. I have 6 years of experience in Java/J2EE and now interested in building an application using spark scala.
I am looking for the help on following use case.
- It’ a batch application (runs once in a month)
- My data is structured data (there are 25 columns) and the input size is 60 GB
- Read 60 GB input data from HDFS (not sure which one to use either read plain HDFS files or retrieve the data using Spark SQL) and group by one of those 25 columns(Total 375 million rows and 22 million groups/tuples)
- Treat each group as one tuple and apply different transformations (will have around 10 other scala objects) for each tuple and write the output to HDFS again
It is a batch processing but streaming application.
Please provide your valuable suggestions on how i can start the development