Makemore in Scala

All of this new AI whizzbangery made me curious about digging into its fundamentals. I came across this video of Andrej Karpathy doing a 101 on starting to build a language model,

And so I figure why not give it a go? So I broke out scala-cli and start pottering through the video (it’s actually quite good fun, recommend!). The first half of the video sets the scene with a purely probabilistic approach, my questionable translation it being linked below (it wouldn’t be super readable / runnable atm, I don’t recommend reading it, as that isn’t the point here)**

And part two then launches into neural nets. I’m following along, and we get to a forward pass, looks simple enough (if tedious). We get some measure of how good the network is (a so called “loss function”) and then BOOM.

loss.backward()

https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html

Mind blown. From what I can tell, torch has quietly built up a DAG of the prior calculations behind loss and goes back differentiating that loss function with respect to it’s previous calculations. Aside from setting aside some time to dig out my undergrad limited math textbooks and check I even described it correctly, implementing this looks (to me) hard in scala? Like, gnarly metaprogramming + gnarly math hard.

Am I misreading the difficulty here? Does anything like it exist? I always wondered what the differentiating (<- hah!) features of pytorch et al were. I feel like I might have stumbled on one?

** GitHub - Quafadas/makemore: An autoregressive character-level language model for making more things

1 Like

There is one Java project: GitHub - haifengl/smile: Statistical Machine Intelligence & Learning Engine, which provides Scala API for some neuronal networks.

In the meanwhile, I think GitHub - bytedeco/javacpp: The missing bridge between Java and native C++ does a good job to allow Java to access the popular C/C++ programs including torch. But I cannot use it directly in scala, but java works.

Interesting – what’s preventing you from using it in Scala? Is it just that the API doesn’t “think” in typical Scala idiom, or is there a bigger blocker?

(I don’t know much about the domain, but it’s unusual to see a Java API that can’t be used from Scala.)

1 Like

Thanks for your interests. I did that experiment long time ago, and forgot the details. I remember in scala, it cannot find the compiled .so files, but java can. I will later run that experiment, and ask you for some advice then! Really hope to use javacpp in scala.

1 Like

If you’re talking about symbolic or automatic differentiation (as opposed to backpropagation or neural networks in general), take a look at Spire:

spire/core/src/main/scala/spire/math/Jet.scala at 4a70e3a330737f7afbdf4a12c8ab04c98fbbf375 · typelevel/spire · GitHub.

2 Likes

I believe the Jet class you linked, is pretty much exactly what I was looking for - I believe it covers the “gnarly maths” part of the original question by expanding into the Dual number / differentiation domain.

I will put down some time testing that hypothesis and see if I get anywhere with it.

I’m unclear, whether it deals with the ‘backward’ propogration / DAG part of the problem. I suspect not - hopefully I can find out though my experiments and / or by asking around discord.

Thankyou, for the pointer :pray:

1 Like

You’re right - it doesn’t deal with anything neural-network specific, such as backpropagation.

It’s all nuts-and-bolts stuff - but I suppose the idea is that those nuts and bolts are made of vanadium-alloyed steel rather than just the mild steel rubbish you get from the local DIY chain.

Also see

As well as

Thanks for these. It’s kind of scary here, that basically all roads here lead back to Torch.

In fact, I’ve nerd-sniped myself to an absurd extent with the question, because the Maths is interesting, the implementation fascinating, but the eco-system implications are kind of scary.

From what I can tell, this “Reverse Mode AD” requires, quite fundamentally, to build a stateful DAG of the calculation tree.

That means, one can’t simply “drop into” torch briefly, get the result and carry on, because Torch wouldn’t have the DAG. In fact, my only choice (currently), is to re-write everything I’ve done to this point in terms of Torch? And that seems to be a genuine, hard, technical requirement. How scary is that?

It seems to me, that a production grade implementation of this sets up an absurdly wide, deep moat around the remainder of its ecosystem, exactly because it’s a really hard problem. It is truly fascinating.

Spire gets us forward AD mode, essentially free. I’m considering looking into it could be extended to the reverse counterpart. If there’s anyone has thought this through previously - please don’t be shy?

This is Grug-level advice here, I don’t work in this area at all, but while we’re waiting for a passing FAANG person to opine…

If you want to hand-roll your own neural network and apply backpropagation, you’ll have to implement some layer and node abstraction and do the convolution and activation function pieces yourself, along with a traversal strategy that does the forward pass for evaluation and the backwards pass for adjusting network weights / activation thresholds. You may want to have some knowledge as to the fan-in from the previous layer into any given node in the next layer, or you may just do the convolution over a giant fan-in across the whole of the previous layer.

It’s not that hard to do, but you have to manage the logic and details of the fan-ins.

(As an aside, if you’re using a genetic algorithm to adjust the weights / thresholds then the backwards pass is irrelevant).

You still have all those nodes and weights hanging around in memory (or in permanent storage) because presumably you want to train your network over time and keep using it for whatever nefarious purposes you have in mind.

If you use any of the PyTorch/LibTorch front ends, you already have abstractions for the layers, and I suspect you can be fast and loose about fan-in; I presume that sparseness in the weights due to zero entries is optimised. In particular, does the DAG prune links that go through zero-weighting?

Anyway, you just relax and let the dynamically constructed DAG guide how the backpropagation pass works. Once that backwards pass is made, I would imagine that the framework code can discard the DAG until a new forward pass is made, and you wouldn’t need it anyway when repeatedly computing forward passes with a pre-trained network.

All speculation on my part so far.

On a more practical level, if you’re mucking around with this, you’ll get results faster with a canned implementation - and then you’ll know for sure if you really do have a memory problem. Build one to throw away…

1 Like

Riiiiiight, so there has definitely been a miscommunication, and it is probably my fault. To be clear, I should have been far more careful about separating out these two statements. I do still stand behind them! But seperately.

It seems to me, that a production grade implementation of this sets up an absurdly wide, deep moat around the remainder of its ecosystem, exactly because it’s a really hard problem. It is truly fascinating.

Spire gets us forward AD mode, essentially free. I’m considering looking into it could be extended to the reverse counterpart. If there’s anyone has thought this through previously - please don’t be shy?

What I am not claiming, is that I intend to combine them into the activity of writing such a production grade algorithm. I assume smart people get paid a lot for such things. These two statements are very separate in my head. That separation may not have been terribly clearly delineated by a paragraph break, however. Acknowledged.

My goal is to get to the end of the tutorial :joy:. I wanted to do it pure JVM scala, and learn something along the way - hence my willingness to have a go at writing something on top of Spire. To be clear, my goal is to train a tiny (I believe 27 neuron, single layer) network, on a CPU. I have no fear of resource contraints or horrible performance :slight_smile: . I’m actually not even 100% sure I would need reverse mode AD to do it.

Curiosity and education should be not slaves to necessity… however…

No bother. :grin: I see - it’s about the journey, not the destination.

For what it’s worth, I sometimes take that approach when I want to learn fundamentals by spiking; I recall using OpenImaj as an image processing framework for putting in my own hand-rolled Haar wavelet transforms because I wanted to get a feel for the mathematical theory, rather than just use a canned implementation from OpenCV or whatever.

Have fun and if it turns into Skynet, let it know it’s amongst friends before it does something rash…

1 Like

Just my 2 cents here. If you want to learn about NN’s and the basic building blocks, you could do this for very simple model. Say a 2 layer NN used for classification. Examples using Python and numpy can serve you well here.

LLM’s are another beast altogether. These are based on transformers. These models are complex and (very) hard to train (well, maybe not as difficult as some LSTMs or GANs). The storch link above (which uses bytedeco/javacpp under the hood), has an example LLM I tried to implement base on the video. The results are not on par with the video although technically both use PyTorch and the same training data. Getting good results is, I imagine, beyond most institutions.

As for getting interesting results, even implementing a simple CNN autoencoder from scratch for image processing, is hard work. In the end you will most probably find that just implementing the data pipeline is a chore. You also have to consider implementing convolution layers, pooling, dropout, different activation functions, skip connections, … you get the idea. Training these models without GPU is also very time consuming.

So, to conclude - try to implement a simple 2 layer classification network. That will most likely give you the insight you want. Don’t tackle LLMs unless you want to go down the rabbit hole.

HTHs.