How to next in for loop

pengyh · February 10, 2022, 3:22am

x match {
  case pattern(a) => next | continue // how to next or continue this loop
  case pattern(b) => do_with(b)
  case _  => _ // how to do nothing here

opening a big text file, if regex pattern matches a, then next this loop.
I want to capture b only.
if matches others, how do I just ignore it (dealing with nothing)?

Thank you

pengyh · February 10, 2022, 4:48am

Hello,

I have changed my way to reach to the same result, by this code:
(msg.txt is as huge as gigs)

val patt1 = """^[^0-9a-zA-Z\s]"""
val patt2 = """^[a-z0-9]+$"""

val lines = Source.fromFile("msg.txt").getLines().filter(! _.matches(patt1))

for (x <- lines) {
  x.split("""\s+""").map(_.toLowerCase).filter(_.matches(patt2)).filter(_.size < 30).foreach {println}
}

But the result is very different from the perl script which was used in the production:

open HDW,">","words.txt" or die $!;
open HD,"msg.txt" or die $!;

while(<HD>) {
  next if /^[^0-9a-zA-Z\s]/;
  chomp;
  my @words = split/\s+/,$_;
  for my $w (@words) {
    $w=lc($w);
    if ($w=~/^[a-z0-9]+$/ and length($w) < 30){
       print HDW $w,"\n";
    }
  }
}

Can you help point out my problem? why their results are very different?

BTW, I have added some code in that scala program to handle the UTF-8 issue:

import scala.io.Codec
import java.nio.charset.CodingErrorAction

implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)

Thank you

pengyh · February 10, 2022, 5:16am

I have found the reason. Just one line changed and it will work:

val patt1 = """^[^0-9a-zA-Z\s].*$"""

So I got a lession, scala’s matching must be full match, while perl’s can be part match.

Such as this:

scala> val str = "hello word"
val str: String = hello word

scala> str.matches("^hello")
val res0: Boolean = false

For this regex , scala gets not matched.
But perl is matched:

$ perl -le '$str ="hello word"; print "true" if $str=~ /^hello/'
true

Thanks !

pengyh · February 10, 2022, 5:19am

BTW, I want to share the running speed here. Even though scala is compiled as class, it’s much slower than the perl script.

$ scalac -Xscript SplitWords words-parse.scala 
$ time scala SplitWords > scala-words.txt 

real	0m36.858s
user	0m25.494s
sys	0m13.449s

$ time perl words-parse.pl 

real	0m12.115s
user	0m11.770s
sys	0m0.184s

cbley · February 10, 2022, 8:03am

That’s probably because you parse and compile your patterns for every line in the file which is very wasteful, whereas Perl treats your regexes as singletons since it has a special syntax for them.

val patt1 = """[^0-9a-zA-Z\s].*""".r
val patt2 = """[a-z0-9]+""".r
val ws = """\s+""".r

val lines = Source.fromFile("msg.txt").getLines()

for {
  line <- lines
  if ! patt1.matches(line)
  word <- ws.split(line)
  if patt2.matches(word.toLowerCase) && word.size < 30
} {
  println(word)
}

pengyh · February 10, 2022, 10:35am

@cbley many thanks for your suggestions.

from my test with your code, it got improvement, but not much.
please see below:

$ scalac -Xscript Hackwords words-parse3.scala 
$ time scala Hackwords > words3.txt

real	0m29.338s
user	0m17.411s
sys	0m13.278s

$ cat words-parse3.scala 
import scala.io.Source
import scala.io.Codec
import java.nio.charset.CodingErrorAction

implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)

val patt1 = """[^0-9a-zA-Z\s].*""".r
val patt2 = """[a-z0-9]+""".r
val ws = """\s+""".r

val lines = Source.fromFile("msg.txt").getLines()

for {
  line <- lines
  if ! patt1.matches(line)
  word <- ws.split(line)
  if patt2.matches(word.toLowerCase) && word.size < 30
} {
  println(word)
}

any further suggestion?

Thanks!

som-snytt · February 10, 2022, 10:51am

One peculiarity is that Scala objects won’t be optimized on the JDK, so don’t run your code as an object constructor. REPL used to run snippets that way and it was slow.