How to determine a character class

jimka · June 25, 2019, 11:13am

Is there a set of character class predicates in the standard library? I’m looking for something like the following?

c.isDigit
c.isAlphaNumeric
c.isUpperCase
c.isPunctuation
c.isWhiteSpace
c.isEndOfLine
c.isPrintable

etc.

Without such predicates, I’m putting expressions like the following in my code, which of course works for my needs, but there’s probably a better way. AND I can detect end of line in UNIX with c == '\n', but that may not work on other operating systems.

val digits = Set('0','1','2','3','4','5','6','7','8','9')

if ( digits.contains(c))
   ...
else
   ...

curoli · June 25, 2019, 12:02pm

Yes, you can do:

scala> ‘1’.isDigit
res0: Boolean = true

The class that has these methods is, I think, scala.runtime.RichChar.

The problem with end-of-line is that on some platforms, including Windows, it is two characters.

jimka · June 25, 2019, 12:40pm

so for EOL do I need to do something like this? Set('\n','\r').contains(c)

sangamon · June 25, 2019, 12:48pm

Parser libraries will have builtin support for this, of course, either via regexes or as dedicated predicates/lexers/whatever, or both.

jducoeur · June 25, 2019, 3:01pm

I suspect that for a tiny case like that, it’s actually most efficient (and clearest) to just use a plain old comparison:

(c == '\n' || c == '\r')

tyohDeveloper · June 25, 2019, 4:14pm

https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html

Side note. Always use eq for integers and characters.* Other than 0.0, one should never compare two floating point numbers.0

jimka · June 25, 2019, 4:44pm

are sets containing constants constructed at compile time or run time? Set(1,2,4,8) for example?

martijnhoekstra · June 25, 2019, 5:11pm

Those are runtime.

That’s not generally a problem.

jducoeur · June 25, 2019, 6:36pm

Interesting – why?

tyohDeveloper · June 25, 2019, 8:47pm

@jducoeur
Integers and characters are scalar values. When boxed, == translates to equals, which then will call eq. So it’s an extra method call.

I’m not sure, but an optimizer could avid boxing entirely. I’m not sure if the scala compiler does this or not. Commercial CLOS and old Smalltalk compilers avoided boxing if eq was used (and inlined the function). Equals requires a function/method. Generally, any identity checks are cheaper than equality checks. The first thing most languages do is an eq. If that passes, it doesn’t need to do anything more. Eq bypasses that extra, often dynamic method call.

With floating point numbers, few computations produce precisely a number. 0, 1, trunc, round, NaN, sNaN, the various infinities are exact numbers. In purely functional computations, a test harness usually can check for exact results. With the caveat that base 10 doesn’t translate to fp binary codes. Numerical codes seldom check for equality. Except to avoid divide by zero.

Russ · June 26, 2019, 6:03am

If you can never “compare” two floating point numbers, then floating point numbers are almost completely useless. I think you meant “test for equality” rather than “compare,” but even then it is OK to test for equality in some cases. For example, if the numbers are set based on literals or read from text files and are not further processed, then testing for equality may be perfectly safe. As always, common sense overrules arbitrary draconian rules.

jimka · June 26, 2019, 6:54am

Isn’t it better to use equality with equal when you know the results are exact? for example if (x == 0.5). That’s an exact check. or if (Set(2.0, 1.0, 0.5, 0.25, 0.125).contains(x)).

jimka · June 26, 2019, 6:56am

My naive assumption was that a set of literals like Set(1,2,3) was allocated (or perhaps eliminated) at compile time or perhaps load time. So Set(1,2,3).contains(x) was fast and perhaps even complied to the same code as (x == 1 || x == 2 || x == 3).

tyohDeveloper · June 26, 2019, 7:11am

@russ There is never (almost) need to check for equality in numerical codes. Ranges yes. Strict equality, no. One uses <, <=, >=, > almost exclusively. There are library code to heck if things are close enough to be considered equal.

Think about a convolution, differential equation, an integral. The physicist, engineer, economist, statisticians aren’t looking for an exact result. Computers can’t produce them. Humans can’t either. Pi, e, other kings aren’t representable.A ln or exp function by definition can’t be correct. However, we can calculate the volume of a sphere to enough accuracy and precision to build what we need or analyze things. From subatomic particles to galactic clusters.

== is bad <= is good.

jimka · June 26, 2019, 11:47am

If we should always use eq with integers and characters, shouldn’t the compiler do that for us?

jducoeur · June 26, 2019, 12:18pm

I believe you’re making an incorrect assumption here – scalars aren’t always boxed in Scala. That’s kind of the point of AnyVal: it corresponds to value types (as opposed to reference types). This has nothing to do with optimization – it’s a central part of the language.

(They do box sometimes, but I don’t see any reason that would happen in a simple situation like this.)

I could be wrong, but AFAIK == between AnyVals doesn’t involve boxing, so it should work just fine…

jducoeur · June 26, 2019, 12:28pm

While that seems theoretically possible, I’m pretty sure that the compiler isn’t that smart. That’s a pretty ornate transformation to infer automatically, and remember, Set is a library feature, not a language one.

In practice, I would expect the numbers themselves to be compile-time literals, and it’s a small Set so it ought to be efficient. (The standard library specializes very small Maps and Sets for efficiency: this is unobvious, but buried in the code they’re actually different classes for Sets of 1, 2, 3 and 4 members.) But it’s still going to allocate and populate the Set and call contains on it, so I think the explicit comparison is a good deal faster – although we’re probably talking microseconds here, so it may well be irrelevant…

martijnhoekstra · June 26, 2019, 2:36pm

The scala compiler doesn’t generally know about any libraries, including the standard library (though there are some exceptions).

To the compiler, Set(1, 2, 3) is just a method call, and it doesn’t know that it generates a set of values.

Russ · June 26, 2019, 6:57pm

@tyohDeveloper I am well aware of that. In your earlier post, you said that “Other than 0.0, one should never compare two floating point numbers.” The inequality tests are comparisons and, as you later point out, they are perfectly acceptable.

I am well aware of the potential pitfalls of equality testing on processed floating point numbers. But my point was that equality tests are perfectly appropriate in some cases. In particular, if the numbers were based on literals and have not been processed in floating point operations, then testing for equality can be perfectly appropriate and reasonable.

Here’s a trivial example. Suppose I am keeping an inventory of some commodity that comes in packages of several different sizes. Let’s say I am selling coconut oil in containers of 4 different sizes or quantities. It would be perfectly safe to use an equality test to categorize a particular container based on the quantity of oil it contains.

martijnhoekstra · June 26, 2019, 7:14pm

Maybe it’s time to split this thread. There are three almost entirely different discussions going on in this thread. Can an admin intervene?