I have 22 million accounts (375 millions rows in multiple files in my input). Data in each line is separated by Ctrl A character("/001"). Hive table is created on top of this data with 25 columns (account_number,package,account_type and etc…)
Please suggest the best way to read 375 million rows available in multiple files and group them by account number (as i need to process account by account one after another)
Thanks in advance.