Why are some unicodes illegal?

Scala supports most symbols as identifiers that I attempt, e.g

val Δ = ???
def ⇄() = ???
def √(n: Int) = ???
def ℕ = ???

but there some similar symbols it rejects

def ⌊(n: int) = ??? // illegal character '\u230a'
def … = ??? // illegal character '\u2026'

Why is this?

Can’t answer “why”, but the language spec lists the allowed Unicode categories.

2 Likes

How hard would it be for me to change this in the compiler to allow all unicodes? And would I also need to do fiddling with metals/bloop/bsp? I hope only the compiler would need to change.

It doesn’t make sense to allow all unicode because various things like spaces, punctuation, ASCII art, etc., are extremely confusing as identifiers.

On the other hand, there is some weird stuff there, like being able to use the bottom hook of a parenthesis as an operator:

scala> case class X(i: Int):
     |   def ⎠(x: X): X = X(i*(x.i+1))
     | 
// defined case class X
                                                                                
scala> X(3) ⎠ X(4)
val res61: X = X(15)

So, which categories should be allowed but aren’t, if that is the granularity? Punctuation is generally not an identifier, so the “other punctuation” category from which dot-dot-dot comes is not a great choice; it includes various sorts of commas and apostrophes which would be very confusing.

I think the classes as defined are about right. They’re not perfect, but they’re pretty good. The only thing to do would be to increase the granularity beyond character categories. And that seems like a huge hassle.

Note that you can put basically anything between backticks if you really need it as an identifier.

2 Likes

Note that you can put basically anything between backticks if you really need it as an identifier.

Well the backticks are enough of a deterrence for me to avoid symbols altogether, defeats the motivation. Would be too irritating.

Anyway, I could understand outlawing symbols that are extremely close to built-in symbols like , but there are good and non-confusing use cases for punctuation operators

if n ∋ { 1 … 10 }

as an alternative to writing

if n >= 1 && n <= 10 // or (1 to 10).contains(n)

where scala complains about . even though are no similar operators users may confuse it with, and yet it allows as an identifier which is arbitrary.

Do you know how I may tweak the compiler to allow this, for my team’s codebase? Would it be a few simple lines of code to change, or cause other scala micro-services to break, like the linter?

1 Like

I don’t know about Metals, but here is some useful information about modifying the compiler:
https://docs.scala-lang.org/scala3/reference/changed-features/compiler-plugins.html
(Note that I’m not sure this allows you to change something as early as the parser)

From a more design point of view, I would highly recommend trying to stay away from DSLs and/or symbolic operators as much as possible:

  • Symbols are hard to type and speak about
  • DSLs are harder to write and maintain than regular APIs
  • They are very enticing, potentially making you implement things through them that could be done easier in a different way

For example I would encourage something like:

if n.in(1, 10)
3 Likes

What I meant was cloning the scala3 codebase and modifying the compiler’s source code, like a fork of the langauge, not just a compiler plugin. I am hoping that the whole apparatus is well-engineered and DRY enough that changing the 1-3 lines of code that ban unicodes will result in the other dependent services like the parser and LSP and Metals also allowing the unicodes.

For example I did experiment with changing the keyword for def to be fn, and it does actually compile code using fn keyword right away as expected, that is a sign of good engineering in the compiler; but Metals still complains and my vscode highlights red (which must be due to duplicate code somewhere that redefines the keyword in Metals rather than deferring to the compiler’s definitions)

We can agree to disagree on the advisability, but just to engage with your recommendation if you were interested in my thoughts:

Symbols are hard to type and speak about

The typing criticism has always been easily solvable; we create workspace snippets for the IDE, so typing elementOf automatically turns into , no difference in effort.

Speaking about them, usually symbols have word counterparts, /: being “fold right”, |> being “pipe”, and Δt being “delta time”, so you can just say those.

1 Like

For example I would encourage something like:

if n.in(1, 10)

Also curious, would you at least find this appealing?

if n in (1 to 10)

The infix as an improvement over n.in(1, 10).

I don’t think this could work, the reason is the compiler and metals need very different parsers:

  • The compiler’s is built to discard everything it doesn’t need, like white space and maybe comments
  • On the other hand, metals needs to record everything, including what white-space encodes the indentation, how much white space there is at the end of the line, etc

But at this point, why not just write:

if n.elementOf(1, 10)

It will also help with on-boarding new people: no need to learn all the magic words

Personnally, I tend to forget those words when I’m trying to explain something, so having real text I can speak out loud is very helpful

Example:

No, not that, the triangle thingy

This ? :points at |>:

No the other one :points at Δ:

Also small nitpick, it should be n ∈ {1, 10}
But I had trouble typing it, and so would you, because there is no workspace snippet on this forum !

I don’t know, I used to love infix, these days I don’t know how I feel about it

1 Like

But at this point, why not just write:

if n.elementOf(1, 10)

Because it’s harder to parse on the eyes and brain. I’m sure empirically provable once someone finds a way to scientifically test it. I think micro-improvements like this do matter when hundreds of hours are going to be spent staring at it.

It will also help with on-boarding new people: no need to learn all the magic words

For people who are already informed, it could make on-boarding easier… You can come from non-JVM languages and immediately recognize n ∈ 1 … 10 if you have the formal mathematics exposure my team wants you to have.

It’s a question of company culture and the sort of candidates you are selecting for; for a smaller team consisting entirely of driven and passionate programmers who are in it for the love of the craft and obsess over every groove and nuance, the style can increase overall comfort and productivity. Magic words aren’t an issue if everyone on your team is a wizard.

For an enterprise java shops with hundreds of employees and a lower common denominator of developer profile, keeping things as boring and conventional as possible is the more efficient route. It’s like the debate about formatters - Standardization limits the highest performers

“A lot of times there are choices made in programming languages and systems that are made to reinforce the goodness of the median case – if you think about things happening on a bell curve – but that limits and restricts the outliers, both on the top and the bottom.”

I picked Scala over other languages largely because it does not restrict my ability to express and style my thoughts as much as other languages do. The signs I choose to use to represent concepts is part of that process of expression. It’s subtextual.

Personnally, I tend to forget those words when I’m trying to explain something, so having real text I can speak out loud is very helpful

Well I really haven’t experienced this problem. The scenario sounds kind of contrived also. Like, you mean instructing someone on how to type the method name in a pair-programming setting? Shouldn’t they already know? Or you’re telling them what line and column to look at in the file? What you’re describing sounds slightly helpful sometimes but still not a big deal.
Anyway, in work meetings you will talk in terms of the word concept counter-parts “we have this mappingpiped into … and the delta of …, for every element of the set.” etc. and in code you will see and write symbols.

But I had trouble typing it, and so would you, because there is no workspace snippet on this forum !

Heh, I actually typed it by pressing cmd + space, e, elem, to open my emoji picker on macos and paste it, no problem. Probably still faster then the average person’s time to typeout “elementOf”. We really have had solutions for the symbol-typing problem for decades, but most environments don’t ship with them by default (I had to install Raycast on my Mac for this :roll_eyes:)

No, actually, this one’s very empirical.

You may be missing the deep history here. Back 10-15 years ago, some of the Scala community (in particular, the scalaz side of the world) used this sort of symbolic operators heavily, for exactly the same reasons you would like to. (Well, more or less: the focus there was category theory.)

In practice, over the years, it was found to be a real hindrance to communication. Looking stuff up was hard, non-experts didn’t understand it; in general, the Scala community at large mostly came to regard it as having been a mistake, and very consciously moved away from it.

It might work within a small team that is specifically chosen to know that symbology. But we know from hard-won experience that, in the larger world, it tends to be a problem.

2 Likes

the Scala community at large mostly came to regard it as having been a mistake, and very consciously moved away from it.

I would like to see some sort of data on this. I see it said a lot on these forums, but it is said by the same 12 people. Obviously like any forum, there will be an extreme minority contributing the vast majority of content (like wikipedia editors or reddit posters).

For example at this PR deprecating the /: symbol, there are a couple posters who expressed upset over it. Deprecate /: and :\ as symbolic aliases for foldLeft and foldRight · Issue #9607 · scala/bug · GitHub

shu on Aug 30, 2019
:frowning: They was so convenient…

som-snytt on Feb 7, 2018
/:frowning:

Did Martin conduct any sort of poll first?

I have no idea since I also was not around back then. Stronger evidence would sway me more.

No offense, and it’s none of my business, but forking the Scala compiler just for some syntax sugar / symbols is quite extreme isn’t it?

If you need such mathematical language, why not try Lean4, you can even use Greek letters, subscripts, sum and product symbols, integration symbols, etc. Its type system / expressiveness is far more powerful than Scala’s (quite literally, everything is a type). The downside is that its ecosystem / libraries is nowhere near Scala. But if your use case is special enough, maybe…

forking the Scala compiler just for some syntax sugar / symbols is quite extreme isn’t it?

It’s only extreme if it would be a large effort. If it’s just changing a few lines of code and then using the modified compiler in our codebase, I don’t see why not. If a future update is important enough for us to want to merge, we can re-do the simple modification.

I strongly support the use of infix notation (make sure you use the magic incantation to suppress warnings of things you need to use and can’t mark infix) and symbolic operators when doing so clarifies the logic of difficult problems.

It really can make a substantial difference in how complex of problems you can tackle. a + b * (c - d) is instantly comprehensible whereas add(a, multiply(b, subtract(c, d))) is a huge cognitive burden, comparatively.

Maybe non-experts are put off, but honestly, they’re not going to understand something that experts find challenging enough to need to simplify it with symbols, if it really is the case that you need that.

However, that said, I would make absolutely sure that you can’t survive with the symbols that the Unicode consortium and Scala spec gives proper blessings to. Having to maintain an extra compiler branch, even if it’s a small change, for a feature that you haven’t even developed yet, seems like a strange choice of expenditure of resources in order to be less compatible than usual. You have a lot of symbols, and honestly, you’re likely to be bitten by precedence rules because all the unicode symbols have the same (high) precedence and all the unicode letters have the same (low) precedence as Latin letters.

Most people, both because the precedence built-ins work and because they’re easier to type, use multi-character glyphs like ++ and :- and so on. Short words are good too. There are non-essential but not-entirely-convention-based reasons to do it the usual way. I’d give that a really good try before deciding that you absolutely must have, say, the three-dot glyph.

For example, I do data processing in Python now and then and got somewhat comfortable with the [1:2, 3:, :-2] syntax. It’s…weird. But you get used to it. But even though it’s possible to get nearly the same thing in Scala (not with colon, but I could use some other symbol), I ended up finding that even though it was a little less compact, to was much easier to understand, an End was a lot better than missing values and negative numbers.

And so I have stuff that reads like xs.visit(3 to End-2) and it is much clearer than the inspiration of for x in x[3:-1]:.

So it’s worth thinking hard about whether there might be a really nice and yet also Scala-compatible syntax that will meet your needs.

6 Likes

I would like to see some sort of data on this. I see it said a lot on these forums, but it is said by the same 12 people.

This is a good point. I think I fit on that set of 12 people, but indeed “I felt the pain first hand” is not a very good argument.

So here are some links that I gathered from a quick (obviously cherry-picked) search for some blog posts and talks circa 2015 (which I think was around the time things started to change).

Scala: The Industrial parts (regarding Scala at Twitter)

Restrict feature set, e.g.,
[…]

  • limit “scalaz-style” programming;
  • no/limited DSLs;

Keeping Scala Simple

I do have issues with the use of symbolic operators in Scala libraries.
[…]
Many other libraries have similarly come to regret heavy use of symbolic operators (recall the infamous dispatch periodic table). Luckily we are seeing declining use of symbols in Scala libraries, and I think the community as a whole has sufficient experience that we can state:

Simple code does not use symbolic operators unless:

  • there are only a few such operators in use; and
  • they are used frequently enough that they are worth the mental load to remember what the symbols mean.

Scala’s 2016 New Year Resolutions

Another example, where I have doubts if not regrets are the /: and :\ operators in scala.collections. They are cute synonyms for folds, and I am still fond of the analogy with falling dominoes they evoke. But in retrospect I think maybe they did give a bad example for others to go overboard with symbolic operators.

I think that dispatch periodic table is a great example of the issues at the time: << turns a request into a POST and <<< turns a request into a PUT… That will just make new people confused, and make the code harder to read out loud.

Fortunately, for common math operators, this is not much of an issue.

Regarding unicode operations, they also have other issues. You might want to look at the reasons for deprecating the unicode arrows.
IIRC, the main issues were:

  • Not being obvious how to type
  • Being hard to manipulate with common search/replace tools
  • Not playing well with the Scala operator precedence rules
2 Likes

I mean I spend most of my time thinking in the text I write, and then I have to speak with someone and I have to translate the symbols to words in real time

For example I tend to forget is “sigma” so I would just say “the sum thingy”
This is even worse with lambda, gamma and theta which I confuse all the time

For the I agree, for I was never thought that notation
(and I have formal mathematics exposure)

This has probably been at least attempted, I’m not familiar with the keywords from the litterature so I was not able to find something specifically about this
But similar research has definitively been done:
https://dl.acm.org/doi/abs/10.1145/2534973

Regardless, I may have over-reacted:

  1. has a very well defined meaning, and the usage in your code conforms to it
  2. It is fine for closed-source code maintained by a small group of people to use unusual symbolic operators
  3. However, forking the compiler and tooling to support such syntax seems very ill-advised and will probably nullify any time gain from the syntax

Especially given the precedence issues highlighted above
(While foo + bar * baz / fal is very legible, I would argue foo + ((bar * baz) / fal) starts being less so)

When i talk to people about code, it’s in terms of the high-level concepts of what’s happening, not the literal code. I might say “we take the summation of …”, I would never say “we take the Sigma” or “sum symbol”, The name of the symbols never come up; our IDE snippet for Σ is mapped to “sum” not “sigma”. etc. But anyway, I do believe you if you say you personally find it a bit confusing, though me and my team do not, so I think the advice “avoid symbols” is per-team basis and per-symbol basis, not a general principle. I think we agree at this point.

But similar research has definitively been done:
https://dl.acm.org/doi/abs/10.1145/2534973

Yeah I’ve seen this before, it’s relevant but not exactly what I’m talking about. I’m not talking about which words or syntaxes are more intuitive, but rather which ones take the eyeballs and visual system in our brain less physical effort to process, before any sort of comprehension or analysis comes into play. Physiological. Example, I’m quite certain that

hello                        world

is slightly harder to visually process (not harder to understand understand) than

hello world

due to eye saccade distance,
and

((3) + ((4) / (5)))

is harder than

3 + 4 / 5

because there are now 10 extra elements to register in your vision, and () parentheses are such sharp jagged shapes that make everything claustrophobic, your eyes have to jump over them.

This leads to conclusions like the typical c style if statement requiring braces being a bad idea, because it will make negations harder to process, etc.

if (!(n > 0))

the ! is squished against the (, it’s a lot easier to see if we just have

if !(n > 0)

or if you simply had a less vertical, more horizontal sign, like ¬

if ¬(n > 0)

again, easier for the eyes to discern because it is less similarly shaped to (

and could conclude that curly-brace languages are worse than whitespaced languages (on the eyes). I think that is true.

As far as I know, no research on this sort of visual optimization is being done at all, (except maybe in unrelated fields like typography and graphic design), most programmers have never considered that this could matter in the code they write, and when they hear about it they usually just laugh and dismiss it immediately “who cares”, since they fundamentally don’t view source code as something that deserves to look good, or something that’s primarily meant to be read by humans and secondarily by computers.

1 Like

Most, possibly, but we spent literally years arguing about pretty much exactly this topic (the readability of braces and parentheses) in this community, as part of the design of Scala 3. Indeed, I think it was the single most contentious subject in the design of the language, with some epic-length arguments in GitHub.

2 Likes