Recommended Scala/JVM ecosystem learning resources for Pythonista?

Hi everyone,

New user here. I’m hoping you all can recommend learning resources for the Scala / JVM ecosystem to a Pythonista with experience in data analysis & numerical computing, very little experience with Java, and with some exposure to functional programming.

Here’s a much more lengthy set of details about my background:

I’m well-versed in Python-centric tools for prototyping code and exploring data, ie Jupyter notebooks and iPython, and I’m very familiar with Python-as-a-glue-language and as a shell scripting replacement. I have some experience doing functional-like programming in Python; iterators, lambdas, list comprehensions, and passing functions as objects (I’m looking at you, map()) are all very useful in certain circumstances, as is some of the functional-style operation chaining you can do on Pandas dataframes. I also have experience using Python (or Python modules) to do numerical computing, often in a parallel (multi-core) and/or distributed (multi-node) environment.

Unfortunately, this also means I’ve run headlong into some of Python’s limitations; the limits of the GIL on shared memory multithreading have caused me headaches, as has the slow speed of pure Python code on occasion.

So for the last few years, I’ve been on the hunt for a Python alternative or companion that shares many of its strengths but not its major weaknesses. In particular, large ecosystem and low barrier to entry are important to me. As an example of the importance of ecosystem, this study out of UC Berkeley & Princeton found that “existing code, existing expertise, and open source libraries are the dominant drivers of adoption” of a programming language. In other words, having access to lots of high quality libraries is really useful to someone who wants to be productive, as is the ability to find solutions (online) to problems that someone else has inevitably had at some point. As for the desire for a low barrier to entry, I think one of the reasons that Python is so popular is that, as Li Haoyi pointed out in a recent blog post about Scala’s future, Python is unmatched in its ease of getting started. It’s accessible. Developers and non-developers alike can start doing productive stuff pretty quickly in Python. I want that in a language, given that I’m not really a developer!

Over the course of my search, I’ve evaluated a number of languages but had issues with them all: systems programming languages like C++ and Rust have high barriers to entry (I’m not very good at explicitly managing memory, and pointers make my brain hurt) and/or issues with ecosystem (small or fragmented). Cython and Numba are interesting projects in the Python ecosystem but the former requires good knowledge of C and the latter only shows large performance increases in pretty specific cases. Julia is pretty easy to pick up, and has an energetic community, but still has a fairly small number of libraries, and its learning curve becomes much steeper when performance becomes important (also, not being able to produce a stand-alone executable can be problematic). C# and Java have huge ecosystems, but they’re both really verbose and they seem to want users to use an OOP programming style (and C# is still pretty Windows-centric). Kotlin looks rather interesting, but it seems to me like it was maybe designed to be a Java replacement - most of the books and other learning resources for it assume a working knowledge of Java, which I don’t have. It’s also not clear to me how large the Kotlin community is. Golang looks fast and simple, but it seems to me like the domains it plays in don’t really overlap the ones I’m interested in.

That brings me to Scala. It looks interesting. I’ve grabbed a copy of Odersky’s “Programming in Scala,” which I think will teach me what I need to know about the language. I can pick up a copy of “Hands-on Scala Programming” if I want exercises. A book on doing numerical computing or data analysis using Scala would be useful if there’s a good one.

But I think my big struggle is going to be how to take advantage of the JVM’s huge ecosystem, since Scala plays so nicely with it. If I want to educate myself on the ecosystem (what libraries are available, where do I go to find them, etc), what’s the best way to do that?

Thanks!

Howdy and welcome!

While Programming in Scala isn’t bad, it’s not actually the book I tend to point beginners at: it’s a little large, and sometimes more detailed than you care about. The book I’ve been starting to use for new folks recently is Essential Scala, which seems to be working well – it’s a nicely structured course on the topic, with both teaching and exercises, and the students I’m facilitating seem to be enjoying it.

(And regardless of which sources you use, please don’t hesitate to ask questions here.)

Ecosystem is harder, simply because it is almost overwhelmingly large. The Scala library ecosystem itself at least has Scaladex, a searchable index of the available open-source libraries, but that doesn’t touch the vastly larger number of Java ones. I’m not sure if the Java side of things has anything as straightforward to use. (Not to mention all the other languages in the JVM ecosystem that can potentially interoperate.) I will admit that I’ve tended to go via word-of-mouth there.

(Although, in practice, I tend to find Scala wrappers on Scaladex rather than use raw Java libraries anyway – while they interoperate, the idioms are rather different, so you tend to benefit from having an adapter that translates a Java library into friendlier Scala idiom.)

4 Likes

I really like Li Haoyi’s book (https://www.handsonscala.com) for its pragmatism. It emphasizes his amazing suite of tools, but that’s a good thing.

Years ago, I wrote an open-source Jupyter notebook that’s a crash course on Scala syntax aimed for Spark developers with no Scala experience, GitHub - deanwampler/JustEnoughScalaForSpark: A tutorial on the most important features and idioms of Scala that you need to use Spark's Scala APIs. You might find it complements what you’ve already learned.

My book, Programming Scala, 3rd Edition (Programming Scala), is a book to consider as you go deeper into Scala features and start applying it to projects. I wrote it with working developers in mind. It’s intended to be comprehensive, but only touches data topics, for example. It’s coming out in a month or two. One of the great things about Scala 3 syntax is a new “optional indentation” feature that makes Scala code look a lot more like Python (i.e., almost no {}). It’s controversial, but I’ve grown to really like it and I use it exclusively in the book.

That leads me to a final point. As rich as the Scala ecosystem is, it doesn’t have the breadth of data-centric libraries and tools that Python has, so be aware of that. Spark is a great flagship tool. There are some powerful and interesting Scala libraries that are relevant, like Typelevel Spire (Spire: Readme) for numerics, as well as various other Typelevel projects (Typelevel.scala | Projects). Finally, many of the popular ML frameworks, like TensorFlow, have Java APIs that are easy to use from Scala.

Good luck!

4 Likes

Thanks; I took a quick look at it. It was published in 2017 - how up-to-date is that? Looks like the version they’re running (in chapter 1) is 2.11.4, but I don’t know how much has changed between then and now (2.13.5). I’m recalling a time when I looked at a C# + .NET programming book that was rather out of date after only a year or two after publication because .NET Core had just been released.

Yes, at first glance, I love that Li’s tools are shameless copies of Python libraries - I will likely use Ammonite as my go-to REPL because it looks so much better than Scala’s default prompt (just like iPython is so much better than Python’s default).

Also, I’ve already pulled down a copy of the free portion of Li’s book.

Do you include “Java ecosystem” or “JVM ecosystem” in your definition of “Scala ecosystem?” I ask because it seems clear that the two former are larger, maybe much larger, than the latter is, if only because of how much longer Java has been around than Scala, and how ubiquitously it’s used.

For what it’s worth, pandas, numpy, and matplotlib/plotly are sufficient for 95% of the analytics I do. I’m careful not to describe myself as a data scientist because I stay away from ML frameworks.

Thanks. Yeah, Java / JVM ecosystem is what I’m hoping to figure out how to wrap my head around. Maybe, once I learn more about Java - Scala interoperability, I’ll stop wishing that. However, as an example of why I think the larger Java ecosystem will be useful, I know I’ll want to find some equivalent to Python’s itertools that I can plug into a personal Scala project that will involve iterating through a large combinatorial space in parallel. A quick dumb search in Scaladex doesn’t bring anything up that looks promising, but Google brings up several potential Java libraries. Of course, I may be getting ahead of myself and will find I can implement permutations, combinations (with and without repetitions), products, and chains using Scala’s standard library, so I don’t need anything external.

2 Likes

Good question! I haven’t taken a specific look at that aspect of the book, but it’s probably not a major problem. 2.12 tweaked some details, but mostly added a few subtle new features. 2.13 was mainly a dramatic rewrite of the standard Collection types, but that was more an internal restructuring – the details of the type hierarchy changed, and the implementation details changed a lot, but the user-level types came out looking largely the same as they had been, with some tweaks at the power level. Don’t be astonished if one or two functions work a little differently than the book describes them, but it’s fairly likely that everything described there is largely unchanged.

So my suspicion is that it’s mostly fine for the language itself. The tooling situation has improved dramatically since the 2.11 days (it’s been a major focus over the past few years), so take anything about that with a grain of salt.

And note that Scala 3 is going to be released soon, likely within the next month or so. That is a pretty deep and serious rewrite of the language, so it’s worth thinking about whether you want to focus on Scala 2 (the code that’s out there) or Scala 3 (the new stuff). There’s been a massive effort to make upgrading decently straightforward, and provide compatibility between 2 and 3 (everyone is very conscious of the risks of a major language upgrade), but expect to see both forms in common usage for the next couple of years.

Yeah, I don’t know of any silver bullets there, save the usual “Google is your friend”. Like I said, the main thing to keep in mind is that Java and Scala think a bit differently, so working with Java libraries requires some compromise on Scala best practices. In particular:

  • Scala idiom tries to avoid null whenever possible, but it is extremely common in the Java world. The standard workaround is to wrap all calls to Java functions that may return null in an Option() constructor, which will transform nulls into Scala Options.
  • Scala tends to focus on immutability; Java tends to focus on mutable data types. That mostly is what it is – just keep it in mind as you code.
3 Likes

Okay, thanks for the summary. 2.12’s release was…a while ago. And 2.11.4 was released Oct 2014. It surprises me a little that a 2017 book on Scala programming uses a version that’s so old…but then again, I don’t know the Scala community or Scala’s cadence.

Well, hmmm. I don’t want to shelve my learning the language for a month (or several) until Scala 3 comes out, along with accompanying learning materials. But I also wonder what degree of relevance current Scala books will lose when v3 is released. My general preference is to stick with what’s current, especially if learning the old followed by the new means learning things twice.

Yeah, some people complain very loudly about Python’s indentation feature. It doesn’t change Python’s popularity.

Given how close Scala 3 apparently is, I’ll be looking for the release of your updated book. Does anyone have visibility into the timelines for updates of other books?

Thanks for the heads up - I’ll keep your pointers in mind as I learn.

Let’s put it this way: the concepts aren’t changing a huge amount, especially the normal application-level ones. The most conspicuous change is arguably the least important – the optional-braces thing makes it look very different but doesn’t change anything meaningful. The zillion different meanings of the word “implicit” are being replaced by more specialized constructs, but the underlying concepts are largely still the same. And there are a lot of new power features in the language, but they are mostly adding to stuff that was already there.

(A few things are going away – most notably the never-officially-released Scala 2 macro system – but that’s not stuff you would deal with while initially learning the language. The new restrictions on implicit conversions are probably the most user-visible change, mainly because it has proven to be fairly dangerous if used casually.)

So learning Scala 2 is still reasonably worthwhile if you want to dive in – the important stuff translates fairly easily, and the documentation goes into excellent detail about what’s changing. OTOH, depending on what you want to get out of it, you might want to just dive directly into Scala 3.

Things are preparing to come out, although folks are waiting for the final release before calling them done. For example, you can get the Scala 3 edition of Programming in Scala in preprint. And a bunch of stuff will be released shortly after the official language release – for example, they’ve redone the Coursera courses for Scala 3, and plan on releasing them soon.

So basically, we’re at the “waiting with bated breath” stage in the community – once we get an RC that doesn’t turn up any major bugs, it’ll be released, and a lot of libraries and resources are preparing to release as soon as possible afterwards.

1 Like

Thanks! I was looking for that. Found the blog post about the roadmap from 2 to 3, but wasn’t sure what the changes are.

Oof. I was on the Artima website for the 4th edition about 2 days ago. Didn’t see the preprint. Guess I didn’t do my HW very well…

Either way, thanks for the pointers!

I just remembered two other libraries you should check out.

Disclaimer: I haven’t had the opportunity to use either of them, so caveat emptor.

Dean

3 Likes

…interesting…very interesting.

From the ACM paper about it last year:

Today, Python is the dominant language for data science with a plethora of machine learning and scientific computing libraries. Scala, on the other hand is the dominant language for big data processing and is widely used across the industry in production systems through platforms like Spark. With machine learning and big data analytics becoming critical components of modern products, developers often find themselves switching frequently between the two. After data scientists experiment with data models in Python, software developers must rewrite these models in Scala for production use. What if we could bridge the gap with a common language for both research and production?

Color me interested. At this point, the biggest question is about the overhead required for interoperability. The authors say it’s low, but I’ll have to read the rest of the paper.

Author of the Gallia Project here. Funny I landed on this post completely by chance, while searching for a Scala.js-like project for Python (I will set up alerts from now on, to be notified).

Interestingly the reason I was searching for such a Python-related project is because time and again I find myself impressed with their tools (seaborn specifically here). To give perspective in the context of Gallia, at some point I was considering the kind of visualisation libraries I could offer support for (the way I “support” mongodb for instance). But the libraries I found in Scala felt over-complicated, or had giant dependency graphs that seemed unwarranted. The Java ones of course felt clunky. And then sure enough, I stumbled upon seaborn: easy to use, to the point, pretty. Now the thing is Python itself is something I dislike generally speaking, for reasons that should be obvious to users of this forum at least. But I think Scala must find a way to better bridge the wonderful libraries that exist in Python world, because their pragmatism is what the Scala ecosystem most crucially misses in my opinion (as illustrated in Gallia’s goals).

Now regarding how Gallia fits in the picture for @phendric , it would probably be most comparable to Python’s pandas. One major difference however would be the fact that Gallia isn’t a "dataframe’ library, in the sense that it does not expect the data to be tabular. I’m not a pandas expert, but looking at this SO answer for instance makes me think that pandas might at least feel a bit unnatural upon dealing with nested structures. Here’s what I mean: example of handling gene interactions with Gallia, for a given gene (“Genemania” dataset). Gallia also offers a shorthand to help with referencing nested fields as if they were top-level ones: see documentation. Lastly these two examples show what some pandas processing look like in Gallia:

In terms of other other data manipulation libraries on the JVM (excluding big data tools), I’m also aware of these:

  • Scala: Saddle and Frameless
  • Java: Tablesaw
  • Kotlin: Krangl

I have however only played around with them a little, not enough that I could really comment on their merits.

Hope that helps!