Can Spark help me?

Russ · June 1, 2021, 6:31am

As I explained in an earlier post, I am having problems with excessive computation time. Some problems are taking too long to solve for real time, even with parallel Vector processing (".par"). It occurred to me today that perhaps Spark can help, but I have never used it. I have read a few high-level overviews, but it is not clear to me that it can help me. Is it possible to convert parallel Vector processing into Spark code for improved speed? If so, where can I find a tutorial on it? Thanks.

hmf · June 1, 2021, 8:30am

If it is a just a question of simple compute (don’t leverage map/flatMap), I believe not. Spark is framework that designed for parallel computation on distributed data. The infrastructure, even though you can run it on a single machine, is quite “heavy”. Ultimately you will consume more resources without any noticeable advantage.

If your work is compute heavy then think about taking advantage of the CPU and GPU. You may consider libraries that can use BLAS and friends for matrix calculations. Also use Java JNI libraries for this, see for example the Java deep learning libraries.

You may also be interested in an OpenCL solution - see Tornado VM. Having said this, TornadoVM is based on the GraalVM - the GraalVM enterprise edition may also help.

Finally, you should profile your code. Many times it is not the compute that is an issue but memory management. If you are communicating between threads live locks may also be a problem. Review the code and use imperative programming for the critical parts.

HTHs

BalmungSan · June 1, 2021, 1:18pm

Can you provide more details?

What is the nature of your data, which format, where it is stored?

What kind of processing are you doing?

How does your program interact with the external world? Cli, API (REST o Graphql), gRPC, email, etc.

You say that your code is too slow, what does that means? How much time does it takes, how much would you expect? Where are you running it right now?

Russ · June 1, 2021, 6:53pm

Thanks for the replies. Let me start by clarifying that the whole point of using Spark, if I understand it correctly, is to bring in more computers to solve the problem at hand. So I assume that running Spark on a single machine would be pointless (except perhaps for testing).

I am working on an automation concept for air traffic control that involves resolving conflicts between trajectories. I have a fast-time simulation that starts with stored default trajectories that need to be deconflicted. I won’t get into the details, but one of the deconflicton methods involves generating a large number of candidate maneuvers and checking them for conflicts with the current traffic. The “current traffic” could typically be around 50 to 60 active flights in the terminal airspace serving a major airport, each of which has already been assigned a conflict-free trajectory.

For example, one resolution method involves delaying takeoff for departures in time steps of 30 seconds up to two or three minutes. Another method involves holding the departure climb at a “temporary altitude” for some period of time to allow other traffic to pass. The temporary altitude could be any altitude from 4,000 to 15,000 feet in increments of 1,000 feet, and the hold time could be anywhere from 2 to 6 minutes in increments of 2 minutes.

Now if we permute all those parameters (takeoff delay, temp altitude, and altitude hold time), we end up with several hundred candidate trajectories, which are then sorted by the resulting “cost” (delay) and tested for conflicts. The first one found to be free of conflicts is then selected. (Actually, it’s a bit more complicated than that, but I don’t want to get into too much detail here.)

In any case, the problem is that for some traffic scenarios the full list of several hundred candidate trajectories need to be tested, and that can take well over a minute, even with parallel Vector processing. That is too long for real time.

Moreover, I would like to be able to generate even more candidates with higher resolution (smaller steps) in takeoff delay and altitude hold time, but that quickly increases the required computation time even more.

And yes, I realize that the algorithm itself could be made more efficient, but I am just trying to get an idea of what Spark can do, if anything, to bring more computational power to bear. Thanks.

BalmungSan · June 1, 2021, 7:07pm

Starting Spark would take minutes before starting to process so I do not think that would help a lot.

Unless, you have an always running Spark cluster that is receiving jobs to be executed, but again, just the parallelization and distribution of the task could take more than what you win. Spark is intended for Big Data processing, it is intended to solve problems in hours rather than days, not in seconds rather than minutes.
The Streaming module of Spark may be useful; but it is usually used in contexts where minutes are okay.
A different, and maybe more useful, approach could be an Akka cluster.

However, I doubt any approach would help really, for the way you describe your problem I do not think it is solvable by using multiple machines. You want to solve your problem in seconds, network communication between machines will eat that time before they do something useful with the data

Russ · June 1, 2021, 8:26pm

OK, thanks. I will look into akka clusters. I assume that “cluster” means multiple computers. Has anyone out there ever used an akka cluster to get computation times from minutes down to a few seconds?

molotch · June 1, 2021, 9:52pm

Starting a Spark job won’t take “minutes” and if you have a Spark cluster you can run “interactive” queries using Livy. This will give you quite reasonable response times (if the Spark session is instantiated and running you can get sub second response times) for smaller jobs. Maybe not in the milliseconds but still in the seconds. Though there is a cluster scheduler involved so the responses might not be as deterministic as necessary.

That said, Spark won’t magically solve your problem unless it’s parallelizable and you know how to split the workload. If your job is suitable for a map reduce kind solution and response times in the multiple seconds are fine Spark might fit the bill.

Though you would have to factor in operating a Spark cluster, learn Spark well enough to adapt your code to Spark (if possible). Might be cheaper to just buy a much larger computer and hope that solves the problem with the current code.

And if you still want to go with a cluster solution you have other options, like the mentioned Akka.

Still, if operating a cluster is a viable solution and response times in the seconds are ok you should definitely evaluate Spark before choosing path. I wouldn’t rule it out off the cuff.

deanwampler · June 1, 2021, 10:33pm

A Spark job, like one of these trial scenarios, would not take more than seconds to start, IF the cluster daemons are already running and there are resources available for the job, so it doesn’t have to wait for scheduling. From the problem description, it appears these calculations would be done frequently, so you would probably have a deployed cluster available to which you submit jobs.

Spark would only be optimal if you can structure the computation has a try all combinations at the same time (or chunks of them). If the optimal algorithm is highly iterable, say it starts with trial values and iterates to a solution, then Spark is less optimal.

Spark is highly optimized internally. Even if it’s not a perfect fit, you might find it takes you a long ways towards your goal. I certainly think it’s worth a try starting on your workstation/laptop. How does it compare to your current approach?

Russ · June 2, 2021, 1:58am

Thanks again for all the replies. The ultimate goal is to get a future version of this research software installed at an air traffic control facility operated by the FAA. That is a long way off, but if it ever happens, they would obviously be able to afford a dedicated cluster of computers. In the meantime, I am just trying to demonstrate the feasibility of the concept, including computational feasibility. I should be able to find a “cluster” of computers within my organization to test and demonstrate the concept.

My trajectory class (called “Traj”) is an immutable case class. The idea is that trajectories are assigned one by one (as each flight takes off or enters the airspace), with each new trajectory being free of conflicts with all previously assigned trajectories. When a list of maneuver candidates is generated, it is sorted by “cost” (the delay resulting from the modification of the original trajectory), then each candidate trajectory is tested for conflicts. If it is free of conflicts, it can be selected and assigned.

In most cases, a conflict-free trajectory is found very quickly, usually within a second or two. However, in some cases, no conflict-free trajectory is found, and then other methods are tried, including trying two maneuver types combined (e.g., a temporary altitude combined with a reroute). In those cases, the entire list of candidates may have to be tested, which can take well over a minute. But each test is independent and can be done in parallel, which is what I am currently doing with parallel Vector processing. In other words, I just add “.par” to the “for” loop for testing the candidates.

If I were to use Spark or akka, the first step would be to determine for each case whether the resolution can be found quiekly, in which case it makes no sense to fire up a cluster.

So what are the advantages and disadvantages of Spark vs. akka clusters for this problem domain?

jducoeur · June 2, 2021, 2:44am

They’re really very different approaches, but I tend to think the heart of it is high vs low level. Spark is a fairly high-level approach, where you are describing the distributed computation but not worrying too much about the nuts and bolts of the underlying communication. Akka is fairly low-level, giving you a very high level of control over the details but barely even pretending that any of this is simple – it lets you distribute the work amongst a number of Actors spread around the cluster, but (typically, at least) doesn’t even try to pretend that making all of this reliable is easy.

hmf · June 2, 2021, 8:56am

Hope I am not increasing entropy here. Take this with a grain of salt.

In respects to Spark, which I have not used in quite a while, you organize work as a map + reduce job (not referring to streaming). In your case the map allows you to generate the candidate trajectories - these are done concurrently/ in parallel. The reduce combines these to decide if the trajectories are viable (I am assuming you cannot do this separately). Note that Spark assumes that the reduce operations are associative - order is not guaranteed (Monad?). One issue here is that the map works on data - which does not seem to be your case. How are the candidates generated? Is it cheap? Do you need a global state for this?

In the case of Akka you have to decide how to define and distribute a task. For compute workloads that are embarrassingly parallel, I tend to opt for the “work steal” like structure. You have a set of consumers. When these are ready (idle) they take/request a task. They then compute and return the result to the “requester”. You have one or more producers that generates requests for trajectories generation/and or testing. You hand these out to available actors. Get the results, collate these and produce the final outcome. You will have to code all of the messaging and state for the actors. You also have to configure the start-up of the actors. This will take some work.

Having said this, the solutions above increase latency considerably (given N trajectories what is the big O cost of collecting and selecting?). I only use these solutions if the compute time is significant - seconds to hours. Again, I don’t know if this is a good fit for you. I have a hunch that it is only viable if you could also parallelize the testing and selection of trajectories (maybe trajectory generation is very cheap).

As for the results you have, do you have any idea how it scales? I would personally first stick to threads only, test this in a good machine (64 CPUs ~ 128 threads) and see how the workload scales as a function of the threads. Measure each step and see where the bottle neck is.

HTHs

Russ · June 2, 2021, 9:07pm

The construction of the candidate trajectories is relatively cheap compared to testing them for conflicts against all current traffic. The candidate trajectories are sorted by “cost”, and the one with the lowest cost that is free of conflicts is selected.

Once a candidate is selected, no others have to be tested. so I have to be careful about how I test them in parallel. If I have more candidates than cores on my machine (which is often the case, depending on the resolution maneuver type), then I will often end up testing many candidates unnecessarily.

To do this, I start with

val ncores = Runtime.getRuntime().availableProcessors()

to determine how many cores I have available, which comes up as 32 on the machine i normally use. I then reduce that by two to allow two cores to be used for other purposes:

val npar = max(ncores - 2, 1) // number of parallel threads

I then group the candidates in groups of that number and test each group in sequence.

for group <- candidates.grouped(npar) do ...

If npar is too high, there won’t be enough cores for each group, and efficiency will suffer, I assume.

Ian any case, I would have to map that “for” loop to Spark or akka clusters.

hmf · June 4, 2021, 7:36am

@Russ So you send off a batch of trajectories for testing. In Akka’s case, one approach would be to have npar “consumer” actors and one “producer”. The producer generates the set of trajectories and sends them off to the consumers. The consumer can then respond with the result (cost). The producer does the sorting (say using a priority queue) and decides if more trajectories should be tested.

Here you have some choices to make. The simplest solution is to hand off one trajectory per consumer. As soon as the producer has a final decision, no more tasks are generated. In the maximum you will have npar-1 wasted computes (not the approximately npar*(number of trajectories)). This increases total compute latency but wastes less actual CPU time depending on the number of “rounds” you go through.

Important points here:

The consumer must flag when they are ready for processing (producer does not take the initiative to select the consumer)
You do not generate all trajectories up front (producer loops)
You can have other options to structure have interactions: multiple producers, actors in hierarchies, etc.

I still think Spark is not the best solution because you would have to generate all trajectories up front. Also, in this case you are not forced to send off a set (group) of trajectories. Cancelling/stopping trajectory test will be a problem.

It has also occurred to me that you can use alternate thread based solutions such as Monix or cats-effects. They also provide the notion of cancelable tasks. It does however require effort to learn and use these libraries.

HTHs.

Russ · July 26, 2021, 2:07am

Does parallel Vector processing (".par") work on a Linux cluster (e.g., a Rasberry Pi cluster)? If not, how hard would it be to make it work? Any ideas? Thanks.

bmaso · July 26, 2021, 7:54pm

In addition to TornadoVM, there’s also Aparapi – these are two different ways to access computing resources available through OpenCL. I suspect either way you’re going to need to do some serious analysis of the opportunities of parallelism in your specific problem. You really want to find the “embarrassingly parallel” parts of your algorithm, and try to target those for running through OpenCL.

Brian Maso

bmaso · July 26, 2021, 7:57pm

– Note that a typical home computer’s GPU (available through TornadoVM or Aparapi) has over 1,000 “cores”, while even a recent CPU chip would have a couple orders of magnitude smaller number available. If the problem involves churning through a great number of similar, CPU-bound computations, an OpenCL-based solution might be just the ticket.

Brian Maso

Russ · July 26, 2021, 9:21pm

Interesting. Can a GPU “core” do any general computation that a CPU core can do? And is it possible to set up a system so that parallel Vector processing can make use of those cores with no (or minimal) changes to the original Scala code?

I assume that my HP ZBook engineering workstation does not have a GPU, but I’d still like to explore the possibilities. As I said before, my code is a research prototype that does air traffic conflict detection and resolution, and some traffic scenarios take minutes to resolve one flight even with parallel Vector processing. That is too long for real time. I don’t need a solution now for my fast-time simulation, but I would like to at least know that there is a viable path forward for future use. If I can implement a demo version, that would be great, but it is not absolutely necessary at this time.