Group By on huge data

Hi Team,

I have 22 million accounts (375 millions rows in multiple files in my input). Data in each line is separated by Ctrl A character("/001"). Hive table is created on top of this data with 25 columns (account_number,package,account_type and etc…)

Please suggest the best way to read 375 million rows available in multiple files and group them by account number (as i need to process account by account one after another)

Thanks in advance.


Are you asking how to do it in Spark? Or whether you should use Spark to do it?

If this is in Spark it probably depends. Do you want to group by account number just to have your data grouped by account number? Well, then there’s not much choice but to use groupBy. Do you want to group by account number to do some aggregation? Then map to account_number -> account and use reduceByKey or similar. Or some slight variation on that when you’re using DataSet.


Thanks for your reply.

yes i wanted to do this in spark / scala

My use cases is as follows:

  1. There are 100 rows and they belong to 10 accounts (each account has 10 rows)
  2. I wanted to group 100 rows as 10 where group by account_number i.e account_number mapped to List Account is a pojo to hold each row (I am new to scala and hence using java terminology)
  3. First tuple (i am referring tuple as 1 account as group of 10 rows) will be processed over 10 operators and the output will be written to file, and pick 2nd tuple and 3rd tuple and so on…

can we use groupBy to do grouping on 375 million rows?
I am not doing simple aggregation. I will write business logic in other classes and pass these tuples over them as a stream or connected dataframes. It is a batch processing but wanted to do it in stream.