I’m new to Scala, bear with me…
btw, I’m using a web console (spark-shell) that was provided by the site where I’m trying to learn Scala.
I need to clean up this dataset where the delimiters ( and quotations are quite messy. So far, this is what I have:
val df = spark.read.options(Map(“inferSchema”->“true”,“delimiter”->";",“header”->“true”)).csv(“Bank.csv”)
Here’s the output of the header and 1st row (apologize for messiness). I need to split up “age” and “job” into separate columns.
±-------------------±-----------±------------±----------±----------±----------±-------±----------±------±--------±-----------±-----------±--------±-----------±-----------±------+
| “age;”“job”"| ““marital””|"“education”"|"“default”"|"“balance”"|"“housing”"|"“loan”"|"“contact”"|"“day”"|"“month”"|"“duration”"|"“campaign”"|"“pdays”"|"“previous”"|"“poutcome”"| ““y””"|
±-------------------±-----------±------------±----------±----------±----------±-------±----------±------±--------±-----------±-----------±--------±-----------±-----------±------+
| “58;”“management”"| ““married””| ““tertiary””| ““no””| 2143| ““yes””| ““no””|"“unknown”"| 5| ““may””| 261| 1| -1| 0| ““unknown””|"“no”""|
This is how the file looks like by itself (no " at the beginning):
age;“job”;“marital”;“education”;“default”;“balance”;“housing”;“loan”;“contact”;“day”;“month”;“duration”;“campaign”;“pdays”;“previous”;“poutcome”;“y”
58;“management”;“married”;“tertiary”;“no”;2143;“yes”;“no”;“unknown”;5;“may”;261;1;-1;0;“unknown”;“no”