x match {
case pattern(a) => next | continue // how to next or continue this loop
case pattern(b) => do_with(b)
case _ => _ // how to do nothing here
opening a big text file, if regex pattern matches a, then next this loop.
I want to capture b only.
if matches others, how do I just ignore it (dealing with nothing)?
I have changed my way to reach to the same result, by this code:
(msg.txt is as huge as gigs)
val patt1 = """^[^0-9a-zA-Z\s]"""
val patt2 = """^[a-z0-9]+$"""
val lines = Source.fromFile("msg.txt").getLines().filter(! _.matches(patt1))
for (x <- lines) {
x.split("""\s+""").map(_.toLowerCase).filter(_.matches(patt2)).filter(_.size < 30).foreach {println}
}
But the result is very different from the perl script which was used in the production:
open HDW,">","words.txt" or die $!;
open HD,"msg.txt" or die $!;
while(<HD>) {
next if /^[^0-9a-zA-Z\s]/;
chomp;
my @words = split/\s+/,$_;
for my $w (@words) {
$w=lc($w);
if ($w=~/^[a-z0-9]+$/ and length($w) < 30){
print HDW $w,"\n";
}
}
}
Can you help point out my problem? why their results are very different?
BTW, I have added some code in that scala program to handle the UTF-8 issue:
BTW, I want to share the running speed here. Even though scala is compiled as class, it’s much slower than the perl script.
$ scalac -Xscript SplitWords words-parse.scala
$ time scala SplitWords > scala-words.txt
real 0m36.858s
user 0m25.494s
sys 0m13.449s
$ time perl words-parse.pl
real 0m12.115s
user 0m11.770s
sys 0m0.184s
That’s probably because you parse and compile your patterns for every line in the file which is very wasteful, whereas Perl treats your regexes as singletons since it has a special syntax for them.
val patt1 = """[^0-9a-zA-Z\s].*""".r
val patt2 = """[a-z0-9]+""".r
val ws = """\s+""".r
val lines = Source.fromFile("msg.txt").getLines()
for {
line <- lines
if ! patt1.matches(line)
word <- ws.split(line)
if patt2.matches(word.toLowerCase) && word.size < 30
} {
println(word)
}
One peculiarity is that Scala objects won’t be optimized on the JDK, so don’t run your code as an object constructor. REPL used to run snippets that way and it was slow.