Trying to clean up data from csv file

scalanoob · March 7, 2021, 1:16am

I’m new to Scala, bear with me…
btw, I’m using a web console (spark-shell) that was provided by the site where I’m trying to learn Scala.

I need to clean up this dataset where the delimiters ( and quotations are quite messy. So far, this is what I have:

val df = spark.read.options(Map(“inferSchema”->“true”,“delimiter”->";",“header”->“true”)).csv(“Bank.csv”)

Here’s the output of the header and 1st row (apologize for messiness). I need to split up “age” and “job” into separate columns.
±-------------------±-----------±------------±----------±----------±----------±-------±----------±------±--------±-----------±-----------±--------±-----------±-----------±------+
| “age;”“job”"| ““marital””|"“education”"|"“default”"|"“balance”"|"“housing”"|"“loan”"|"“contact”"|"“day”"|"“month”"|"“duration”"|"“campaign”"|"“pdays”"|"“previous”"|"“poutcome”"| ““y””"|
±-------------------±-----------±------------±----------±----------±----------±-------±----------±------±--------±-----------±-----------±--------±-----------±-----------±------+
| “58;”“management”"| ““married””| ““tertiary””| ““no””| 2143| ““yes””| ““no””|"“unknown”"| 5| ““may””| 261| 1| -1| 0| ““unknown””|"“no”""|

This is how the file looks like by itself (no " at the beginning):
age;“job”;“marital”;“education”;“default”;“balance”;“housing”;“loan”;“contact”;“day”;“month”;“duration”;“campaign”;“pdays”;“previous”;“poutcome”;“y”
58;“management”;“married”;“tertiary”;“no”;2143;“yes”;“no”;“unknown”;5;“may”;261;1;-1;0;“unknown”;“no”

LannyRipple · March 13, 2021, 8:08pm

It seems your post got cut off but you should just be able to create a spark.write.options(...).csv (or maybe it’s part of options with “format”->“csv”. Check docs for details.) and write back out a CSV closer to comma-separated.

Note that if you just want to use Scala, and not Spark, to read your CSV there are several java libraries that can help. I’ve used opencsv in the past http://opencsv.sourceforge.net/ but a google search will dig up lots of options.