Refactoring and output data

Hello,
I’d like to know if the method readData() which may have Null output could be refactored ?
Also, is it better to a separate readData outside of readSetOfData ?
Thank you very much

def readData(
originalDi: String,
caseID: String,
mFolder: String,
lFolder: String,
name: String)
: (File, DataType1, File, DataType2, String) = {

val mDir: String = originalDir + caseID + “/” + mFolder
val lDir = new File(originalDir + caseID + “/” + lFolder)
val mFile = new File(mDir, name)

if (mFile.exists() & lDir.exists()) {
val lFile = lDir.listFiles().filter(f => f.getName.endsWith(".extension"))(0)
val data1: DataType1 = readData1(mFile).get
val data2: DataType2 = readData2(lmFile).get
(mFile, data1, lFile, data2, caseID)
}
else (null, null, null, null, null)
}

=====
def readSetOfData(
originalDir: String,
mFolder: String,
lFolder: String,
name: String)
: Seq[(File, DataType1, File, DataType2, String)] =
for {
db ← ReadDatabase()
(mFile, data1, lFile, data2, caseID) = readData(originalDir, db.ID, mFolder, lFolder, name)
if data1 != null
} yield (mFile, data1, lFile, data2, caseID)

This code is very difficult to follow. And it appears to also be incomplete as it is referring to types and methods not included.

IOW, it is too much work to try and tease out what the code is, what your intention is, and then how to possibly model it in idiomatic Scala for you.

Please consider reconstructing your question by providing some actually compilable code in Scastie.

Here’s your code copied into Scastie.

Please clean up the code at this link, improve its focus so your question becomes more specific, and then reply with your improved question including the new Scastie link (click “Save” after you are done editing, and the new URL is the link to your improved code example).

That will improve the odds someone would be willing to volunteer their time to help you navigate your difficulties.

2 Likes

Hi.

I think some general advice can be given without looking into too much details here.

First, never use null in Scala (only for Java interop). If your function might not always return a value, encode this in its return type. There are some alternatives here (Try, Option, Either). If you are not interested in why the function failed to compute a value, just use Option:

def readData(originalDi: String, caseID: String, mFolder: String, lFolder: String, name: String)
: Option[(File, DataType1, File, DataType2, String)] = {
  if (...) {
    Some((mFile, data1, lFile, data2, caseID))
  } else {
    None
  }
}

Then, it seems readData1 and readData2 already return an Option, and you’re calling get on their return values. This is a code smell. Never call get on an Option, unless you checked that it is not None.

This would throw an exception if no matching file exists in that directory. In Scala, we try to avoid exceptions if possible (use Array.find instead of Array.filter(...)(0) ). What is your expectation about that file? Do you really want to get any file having a .extension suffix? Note that File.listFiles does not specify the ordering of the resulting array.

In general, avoid checking for the existence of files prior to reading them. Just try to read the file. It will fail if the file does not exist, or if you cannot read it, or have no permission to do so. And even if File.exists returns true, the file could have been removed between the check and your attempt to read it. So there’s no use in checking whether the file exists in the first place.

Throwing in some syntactic sugar (the for comprehension), I would refactor the readData function to:

def readData(originalDi: String, caseID: String, mFolder: String, lFolder: String, name: String)
: Option[(File, DataType1, File, DataType2, String)] = {
  val mDir = new File(originalDir + caseID, mFolder)
  val lDir = new File(originalDir + caseID, lFolder)
  val mFile = new File(mDir, name)

  for {
    files <- Option(lDir.listFiles())
    lFile <- files.find(f => f.getName.endsWith(".extension"))
    data1 <- readData1(mFile)
    data2 <- readData2(lmFile)
  } yield {
    (mFile, data1, lFile, data2, caseID)
  }

(and I would change the name of the function, what is “data” anyways?)

Also, the java.nio.file classes should be preferred over the old java.io.File API.

1 Like

Hi,
Thank you so much for all the advices. I am new to Scala and I tried to do my best but I’d like to improve the quality of my codes: your advices are gold.
For confidentiality reasons, I couldn’t show here the real codes with all the variables. Thank you for just considering the codes as they are.
Best regards

Hi,
I am struggling on how to use Option in a for … yield in order to get a Sequence.
I want to call a function (readSingleData) in another function readSetOfData() where each case is obtained with listOfCases and I need to check wether the file to be read, mFile, exists or not.
I could use a filter (e.g, by adding if mFile.exists() before the yielding), but I want to use None in order to have clean codes (as said previously).
Here is my code:

I still don’t understand where to put the Option when used in Sequence.

As far as I understand, I need to use pattern matching but I don’t know where to use it: is it inside the readSingleData() or not and how ? BTW, I got an error here.
Could you have a look please at my codes and give advices ?
Thank you very much,
Best regards

Hi @Maia ,

please do not post your code as a picture, that makes it hard to copy… Even better, just prepare a minimal example on scastie and paste the link here, so that we can fiddle with it.

Apparently, you steered off your original path quite a bit. For example, from its prototype readData1 is given a File and now always returns a DataType1. It seems the function can no longer fail… ?!

What you are asking is how to combine List and Option. Remember, that the for comprehension is just syntactic sugar for map and flatMap calls:

val xs: List[Int] = List(1, 2, 3)
val ys: List[Int] = List(4, 5, 6)
for {
  x <- xs
  y <- ys
} yield x + y // returns List(5, 6, 7, 6, 7, 8, 7, 8, 9): List[Int]

is equivalent to:

xs.flatMap(x => ys.map(y => x + y))

In your case, ys is an Option[(File, DataType1, String)] and xs is a List[String]. Now, both of these classes have map and also flatMap methods:

// from Option:
final def map[B](f: (A) => B): Option[B]

// from List:
final def flatMap[B](f: (A) => IterableOnce[B]): List[B]

Notice that mapping over an Option returns an Option again.

Notice that the argument type of List’s flatMap does not require the given function f to return a List again, it only requires an IterableOnce[B]! Luckily, an Option just happens to be a IterableOnce[B]:

sealed abstract class Option[+A] extends IterableOnce[A] ...

So, in order to combine, you can just call those functions:

listOfCases.flatMap(idCase =>
  readSingleData(originalDir, idCase, mFolder, name).map(data => data)
)

(Note, you tried to “unpack” the return value from readSingleData and then you yielded a new tuple with the same values again. This is equivalent to calling map with the identity function, what is what I did above.)

Transforming this into the for comprehension again:

for {
  idCase <- listOfCases
  data <- readSingleData(originalDir, idCase, mFolder, name)
} yield data
1 Like

Hi @cbley
Thank you again and sorry for the picture.
Maybe I had to explain the aim of my problem the first time …
In fact, I have a list of records in a directory (a database) and each case, identified by caseID, has several fields with different types. Sometimes, a field or more may miss and I want to identify those cases (either with missing fields or not). Some fields are related to physical data. I created a csv file to record the list of cases with a set of fields. It is similar to a relational database as stated by Oracle here:
https://www.oracle.com/ca-en/database/what-is-a-relational-database/
" A relational database is a type of database that stores and provides access to data points that are related to one another. Relational databases are based on the relational model, an intuitive, straightforward way of representing data in tables. In a relational database, each row in the table is a record with a unique ID called the key. The columns of the table hold attributes of the data, and each record usually has a value for each attribute, making it easy to establish the relationships among data points."

ex:
case1/field1/data1.txt
case1/field2/data2.xyz …

case2/field1/data1.txt
case2/field2/data2.xyz …

Therefore, while using Option, I’d like to identify the cases with None as well and read the cases with Some(type) with a separate function to read some field (e.g. readData1(filename: File) is used to read the file in /field1/data1.txt, where filename (data1.txt) is the name of the file to be read in the database) . I read here that I need to use pattern matching for that:
https://www.cis.upenn.edu/~matuszek/cis554-2011/Pages/scalas-option-type.html

Until now, I had no problem while using Null but I want to refactor my codes so that it is more Scala (I come from Matlab).
I hope it is clearer now (sorry that I didn’t explain it very well the first time).

How could I get the None cases with the solution you provided ?
Does this solution work if there are missing files to be read ?

Thank you very much,
Best regards,

Sorry, you kind of lost me here…

Could you maybe show in pseudo-code what you want to achieve? What should be the output of the function(s)? What should happen if readSingleData returns None?

Maybe you want the readSetOfData function to return a Map[String, Option[(File, DataType1)]] instead so that you can handle missing / invalid cases at the caller’s side?

@cbley
Sorry …
I’d like to get two lists of cases: one list of cases with missing data (ex: missing files), and another one with cases which have all data.
My question is how to use Option (Some and None) in a sequence: I found simple example with pattern matching but I am lost when I want to use in a sequence.
Here is what I want to achieve:

for each case in caseIDsList
currentFilename ← create_case_filename(idCase, folder)
currentDataPair ← readSinglePair(idCase, currentFilename, folder)
if currentFilename_does_not_exist
invalidCaseList ← idCase
if currentFilename_exists
validCaseList ← (idCase, currentDataPair)

Here is another attempt:

    case class DataType1(f: File, a: String)
    def readData1(file: File): DataType1 = ???

    def readSinglePair(caseID: String,  bName: String, mFolder: String)
    : Option[(File, DataType1, String)] = {
      try {
        val mDir: String = caseID + "/" + mFolder
        val mFile = new File(mDir, bName)
        val mData: DataType1 = readData1(mFile)
        Some(mFile, mData, caseID)
      } catch {
        case e: Exception => None
      }

      def readSetOfData(caseIDsList: List[String], bName: String, mFolder: String)
      : Seq[(File, DataType1, String)] =
        for {
          idCase <- caseIDsList
          (mFile, data, caseID) = readSinglePair(idCase, mFolder, bName)
        } yield (mFile, data, caseID)

I got error in readSetOfData() and I don’t know where to put Option and how to use it in those codes.
Your idea of using Map is interesting also, maybe I’m looking to very complicated codes (I am really new to Scala).
Thank you,
Best regards

You get an error, because of this line:

(mFile, data, caseID) = readSinglePair(idCase, mFolder, bName)

Here, readSinglePair returns an Option[(File, DataType1, String)]. But writing it this way, Scala expects the right-hand side of the = to be just the tuple, not wrapped in an Option. Handling errors with Option takes some getting used to, because things that can fail don’t have the same type any longer as things that can’t (which is good. When using nulls, you can’t see from a method’s signature, that it can fail and return null).
If you would want to only keep the good cases, you could replace = with <- here, which would then use the map function on option, like @cbley explained. map only does something, if there is a value in the Option, i.e. if it is a Some, and otherwise just returns None again, which when combined with a for that starts with a list is interpreted as an empty sequence.

But if I understand you correctly, your function should return a second sequence of all ids of failed cases too, correct?

In that case, you’ll probably want an Either instead of Option in the readSetOfData method. An Option is normally used, when there is only one way something can fail or you aren’t interested in how it failed (e.g. if in your readSinglePair you don’t care which exception occurred, you could keep Option).

An Either is similar to an Option, except that its error case can contain a value, it is either a Right (analogous to Some) or a Left (analogous to None, but containing a value). In your case, a Left would contain the id of a failed case.

As you already have a method that returns an Option for a single file, you can use the Option's conversion method toRight:

for {
  idCase <- caseIDsList
} yield readSinglePair(idCase, mFolder, bName).toRight(idCase)

If readSinglePair returns a Some, toRight converts it to a Right with the same contents. Otherwise, it returns a Left containing the given parameter, in this case its id. This for would then return a List[Either[String, (File, DataType1, String)]].

Now, as you want a separat validCaseList and invalidCaseList, you’d have to split this again, but there is a better way. List provides a method, that for each element runs a given function returning an Either and sorts the result into two lists: partitionMap.

With that, the method would look like this:

def readSetOfData(caseIDsList: List[String], bName: String, mFolder: String)
: (List[String], List[(File, DataType1, String)]) =
  caseIDsList.partitionMap(
    idCase => readSinglePair(idCase, mFolder, bName).toRight(idCase)
  )

Note that it now returns a tuple of two lists, one with the failed ids and one with the successful results:
It runs readSinglePair for each id in the input list. If it fails, the toRight converts it to a Left(idCase) and it ends up in the first resulting list, if it works, the toRight keeps the contents of the Option and it ends up in the second list.

1 Like

@crater2150 , @cbley , @chaotic3quilibrium
Wow, I really appreciate all your help.
I am going to try all those ideas. I’ll let you know how it goes.
Thank you again very much !

@crater2150
Something weird: partitionMap is not recognized by IntelliJ IDEA. (I put the image just to highlight the error):
The following codes work on REPL and there is no error on Scastie:
image

What is happening here ?
I have scalaVersion := “2.12.8”

The whole codes with Either and partitionMap:

> case class DataType1(f: File, a: String)
def readData1(file: File): DataType1 = ???

def readSinglePair(caseID: String, bName: String, mFolder: String)
: Either[String, (File, DataType1, String)] = {
  try {
    val mDir: String = caseID + "/" + mFolder
    val mFile = new File(mDir, bName)
    val mData: DataType1 = readData1(mFile)
    Right(mFile, mData, caseID)
  } catch {
    case e: Exception => Left(caseID)
  }
}

def readSetOfData(caseIDsList: List[String], bName: String, mFolder: String)
: (List[String], List[(File, DataType1, String)]) =
caseIDsList.partitionMap(
idCase => readSinglePair(idCase, bName, mFolder).toRight(idCase)
)

General rule: never, ever trust errors from IntelliJ.

IntelliJ is a good IDE, but it includes its own presentation compiler, entirely separate from the real one. It works 99% of the time, but is prone to occasional mistakes, and particularly to false negatives like this, where it reports an error for something that is actually just fine.

When you find yourself scratching your head at an IntelliJ error like this, try compiling with sbt (or whatever your build tool is), which is more authoritative – that will tell you whether you actually have a bug, or whether IntelliJ is just confused.

(This is one of the primary advantages of using Metals plus VSCode or some other editor that is Metals-compatible – those talk directly to the real Scala compiler, so their error messages are more reliable.)

2 Likes

Note that Intellij now uses the compiler for Scala 3 errors but I agree: never trust your IDE’s suggestions/errors if you don’t understand them. Trust your compiler, then your own understanding first.

Note: IDEA’s compiler-based inspection is experimental.

1 Like

Ah, in this case, it isn’t the IDE’s fault (although the others are right about being wary of IntelliJ’s errors). partitionMap was added in 2.13. Your REPL probably uses a newer version than your project. If you use SBT for your project, you can run console from the SBT shell to get a REPL with the same version your project is using (and with access to your project code).

If you have to use 2.12, your code will have to be a bit longer. You could use map and partition, but that would give you two List[Either[...]]. Instead, I would solve this with a fold:

   def readSetOfData(caseIDsList: List[String], bName: String, mFolder: String)
  : (List[String], List[(File, DataType1, String)]) =
    caseIDsList.foldRight(List.empty[String], List.empty[(File, DataType1, String)]){
      case (idCase, (invalid, valid)) =>
        readSinglePair(idCase, mFolder, bName) match {
          case None =>          (idCase :: invalid,           valid)
          case Some(result) =>  (          invalid, result :: valid)
        }
    }

Folding a list combines it elements with an accumulator, in this case a tuple of two lists (one for the valid cases, one for the invalid ones). As first parameter, foldRight takes the start value for the accumulator, we start with empty lists. The function we give it receives an element from the list and the accumulator, and has to return a new value for the accumulator. Here we either append the id to the invalid list, if the reading fails, or append the result to the valid list, passing the other list on unchanged.

1 Like

@jducoeur , @Iltotore , @cbley
Thank you for advice. @cbley is right: the error comes from my built.sbt project (2.12.8): I am working with an API that uses this Scala version and I can’t change it right now as I am afraid that I broke my whole project. My scala version for REPL is 2.13.2 … but I will be careful with my IDE from no on.

1 Like

Hi @cbley
Thank you but I have error on None and Some(result).
https://scastie.scala-lang.org/FeYVfLuqSJqPqe9lDpau3A

@cbley
Sorry … I changed the signature of readSinglePair() that’s why I got the error on None and Some(result). Everything is ok with:
https://scastie.scala-lang.org/JyIr2YHBS4y0sJ05y44xXQ

Thank you a LOT for your help.
I’m reading and studying Functional programming of Alvin Alexander: I still have a lot to learn !

Could you explain this line please ?
Thanks