Converting string to date in Spark code using native Scala

Is there an elegant way to convert a string to a date format using native Scala? I am using Scala 2.11.11 with Spark 2.3.2.

Thanks!

convert a string to a date format

by “string”, it’s a safe bet you mean java.lang.String

but when you say “date format”, exactly what type does that refer to…?

By date format, it could be mm/dd/yyyy or mon, dd, yyyy.

Thank you!

Parshu

By date format, it could be mm/dd/yyyy or mon, dd, yyyy

Contained in a String, you mean? Or in some other type? If so, exactly what type?

We’ll have a better chance at helping you if you use very precise language — whenever possible, code rather than English — to characterize what you are trying to do.

If you want to extract from a String representing a date the format of that date, that is not generally possible due to ambiguity.

For example, “03/04” can mean “March 4”, or “3rd of April” or “March 2004” depending on context.

Maybe you a working in a specific context where it is possible, but then you’d have to explain. Why do you need this?

See java.text.SimpleDateFormat, which is easy to use from Scala. With an instance of this class you can both parse a String to a Date object, and format a Date object to a String. I suspect what you may want to do is String => Date => String. That is, parse a String in your RDD/DataFrame to a Date, then format the Date to a canonical String form. There are several more full-featured open source date/time utility packages for Java & Scala, but good old java.text.SimpleDateFormat can get the job done if my understanding is correct.

Best regards,

Brian Maso

If that is what you want to do I would recommend going with the newer and better designed java.time.LocalDate and java.time.format.DateTimeFormatter. Those classes are also immutable and thread-safe. Especially in Spark that’s a desirable property.

1 Like

One caveat: DateTimeFormatter is not Serializable. Easy enough to work around tho.

See this stackoverflow for more info: https://stackoverflow.com/questions/36132451/spark-and-not-serializable-datetimeformatter

Thank you, Everyone!!! I am learning so much!
I am reading from a csv with the date like 18000101 i.e. a string in the format yyyyMMdd. Since I am just learning Scala and using it with Spark, I am interested in soaking up as much as I can. So, I wanted to convert the date to a format for display that is more user-friendly. I apologize for not being more precise in my question. Following are relevant chunks of the code I have now:
import java.time._
import java.time.format.DateTimeFormatter

val filedte = fields(1).toString
val date1 = LocalDate.parse(filedte, DateTimeFormatter.ofPattern(“yyyyMMdd”))

for (result <- results.sorted) {
val day = result._1
val prcp = result._2
println(s"$day maximum precipitation: $prcp")
}
The code ran and produced the result - I am yet to format the date in the output, but I do have a warning for the line: for (result <- results.sorted) that there are:

  • not enough arguments specified for the method sorted and
  • there is no implicit ordering defined for (java.time.LocalDate, int)

The first error is caused by the second. You can read this discussion on stackoverflow about that.

Thank you! I added the following lines to my code:
type AsComparable[A] = A => Comparable[_ >: A]

implicit def ordered[A: AsComparable]: Ordering[A] = new Ordering[A] {
def compare(x: A, y: A): Int = x compareTo y
}

Now the warning says that there is implicit ordering, but is still looking for an argument…

Now the warning says that there is implicit ordering, but is still looking for an argument…

I’ve lost the narrative here.

Exactly what code is producing that warning? (How would I reproduce the problem on my own computer?)

And exactly what does the warning say? (You should never paraphrase error messages; always quote them in full. In paraphrasing, you’re likely to leave out important information that would help others help you.)

Hi Seth,

I’ve attached a copy of the csv file used by the code below.

The code producing warning is at the bottom of this program (***):

package com.sundogsoftware.spark

import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
import scala.math.max
import java.time._
import java.time.format.DateTimeFormatter

/** Find the maximum temperature by weather station */
object MostPRCPDay {

def parseLine(line:String)= {
val fields = line.split(",")
//val sc1 = new SparkContext()
//val df = sc1.parallelize(ListString).toDF(“filedte”)
// parse string to date
//val result = df.select(to_date(df(“dates”), “yyyyMMdd”))
val date = LocalDate.parse(“2018-05-05”);
val filedte = fields(1).toString
//val date1 = LocalDate.parse(filedte, DateTimeFormatter, BASIC_ISO_DATE);
//val date1 = LocalDate.parse(filedte, DateTimeFormatter.ofPattern(“mm/dd/yyyy”))
// val date1 = LocalDate.parse(filedte, DateTimeFormatter.ofPattern(“yyyymmdd”))
val date1 = LocalDate.parse(filedte, DateTimeFormatter.ofPattern(“yyyyMMdd”))

//val dte = to_date($"filedte", "MM/dd/yyyy")

val entryType = fields(2)
val prcp = fields(3).toInt
(date1, entryType, prcp)

}
/** Our main function where the action happens */
def main(args: Array[String]) {

// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)

// Create a SparkContext using every core of the local machine
val sc = new SparkContext("local[*]", "MostPRCPDay")



// Read each line of input data
val lines = sc.textFile("../1800.csv")

// Convert to (Day, entryType, prcp) tuples
val parsedLines = lines.map(parseLine)

// Filter out all but PRCP entries
val maxPrcps = parsedLines.filter(x => x._2 == "PRCP")

// Convert to (Day, prcp)
val dayPrcps = maxPrcps.map(x => (x._1, x._3.toInt))

// Reduce by Day retaining the maximum prcp found
val maxPrcpsByDay = dayPrcps.reduceByKey( (x,y) => max(x,y))

// Collect, format, and print the results
val results = maxPrcpsByDay.collect()

//*** warnings are produced by the line below: ***

for (result <- results.sorted) {
val day = result._1
day.format(DateTimeFormatter.ofPattern(“yyyy.MM.dd”))
val prcp = result._2
//val formattedTemp = f"$temp%.2f F"
println(s"$day maximum precipitation: $prcp")
}

The error messages are:

Thank you,

Parshu

(Attachment 1800.csv is missing)

Here’s a minimal reproduction of the problem:

scala 2.11.12> List(java.time.LocalDate.now).sorted
<console>:12: error: No implicit Ordering defined for java.time.LocalDate.
               List(java.time.LocalDate.now).sorted
                                             ^

You have two choices:

You can provide an implicit Ordering yourself, as in https://stackoverflow.com/questions/38059191/how-make-implicit-ordered-on-java-time-localdate/38059584#38059584

Or, you can use sortBy instead of sorted, and specify a sort order that way.

I added the following lines in the code:
type AsComparable[A] = A => Comparable[_ >: A]

implicit def ordered[A: AsComparable]: Ordering[A] = new Ordering[A] {
def compare(x: A, y: A): Int = x compareTo y
}

Full program:

package com.sundogsoftware.spark

import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
import scala.math.max
import java.time._
import java.time.format.DateTimeFormatter
import scala.math.Ordering.Implicits._

/** Find the maximum temperature by weather station */
object MostPRCPDay {

def parseLine(line:String)= {
val fields = line.split(",")
//val sc1 = new SparkContext()
//val df = sc1.parallelize(ListString).toDF(“filedte”)
// parse string to date
//val result = df.select(to_date(df(“dates”), “yyyyMMdd”))
val date = LocalDate.parse(“2018-05-05”);
val filedte = fields(1).toString
//val date1 = LocalDate.parse(filedte, DateTimeFormatter, BASIC_ISO_DATE);
//val date1 = LocalDate.parse(filedte, DateTimeFormatter.ofPattern(“mm/dd/yyyy”))
// val date1 = LocalDate.parse(filedte, DateTimeFormatter.ofPattern(“yyyymmdd”))
val date1 = LocalDate.parse(filedte, DateTimeFormatter.ofPattern(“yyyyMMdd”))

//val dte = to_date($"filedte", "MM/dd/yyyy")

val entryType = fields(2)
val prcp = fields(3).toInt
(date1, entryType, prcp)

}
/** Our main function where the action happens */
def main(args: Array[String]) {

// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)

// Create a SparkContext using every core of the local machine
val sc = new SparkContext("local[*]", "MostPRCPDay")

type AsComparable[A] = A => Comparable[_ >: A]

implicit def ordered[A: AsComparable]: Ordering[A] = new Ordering[A] {
def compare(x: A, y: A): Int = x compareTo y
}

// Read each line of input data
val lines = sc.textFile("../1800.csv")

// Convert to (Day, entryType, prcp) tuples
val parsedLines = lines.map(parseLine)

// Filter out all but PRCP entries
val maxPrcps = parsedLines.filter(x => x._2 == "PRCP")

// Convert to (Day, prcp)
val dayPrcps = maxPrcps.map(x => (x._1, x._3.toInt))

// Reduce by Day retaining the maximum prcp found
val maxPrcpsByDay = dayPrcps.reduceByKey( (x,y) => max(x,y))

// Collect, format, and print the results
val results = maxPrcpsByDay.collect()



for (result <- results.sorted) {
   val day = result._1
   day.format(DateTimeFormatter.ofPattern("yyyy.MM.dd"))
   val prcp = result._2
   //val formattedTemp = f"$temp%.2f F"
   println(s"$day maximum precipitation: $prcp")
}

}
}

and still get the following warnings:

Please also tell me how I may specify the order in SortBy.

Thanks!

Parshu

works for me:

scala 2.11.12> List(java.time.LocalDate.now).sorted
<console>:12: error: No implicit Ordering defined for java.time.LocalDate.
               List(java.time.LocalDate.now).sorted
                                             ^

scala 2.11.12> type AsComparable[A] = A => Comparable[_ >: A]
defined type alias AsComparable

scala 2.11.12> implicit def ordered[A: AsComparable]: Ordering[A] = new Ordering[A] {
             | def compare(x: A, y: A): Int = x compareTo y
             | }
ordered: [A](implicit evidence$1: AsComparable[A])Ordering[A]

scala 2.11.12> List(java.time.LocalDate.now).sorted
res1: List[java.time.LocalDate] = List(2020-04-20)

Thank you, Seth. My code ran as well - but wasn’t sure if I can just ignore the warning or if there is a way to not get it.

Thanks,

Parshu

Are those messages coming from some kind of IDE?

Here’s an example of using sortBy:

scala 2.13.1> List("fox", "quick", "jumped", "the", "brown")
res2: List[String] = List(fox, quick, jumped, the, brown)

scala 2.13.1> res2.sortBy(_.length)
res3: List[String] = List(fox, the, quick, brown, jumped)

Yes - those messages are from Eclipse…