Designing types for a generalized framework

fffz · May 11, 2023, 8:57am

Hello,

We want to build a catalog of spark transformations using the Dataset API of spark with strong type safety. The Dataset API is a generic class with type T representing the schema of the underlying data

Users are free to define their transformations that we call a “Definition” which is nothing but sparksql code. Definitions have inputs that are simply other “definitions”

A definition class would look like ( subject to change as per your suggestion, not set in stone )

abstract class Definition {
       def name: String
       def upstreamInputs: Set[Definition]
       def transformation(List<Dataset>): Dataset
}

The issue is that we don’t know how to define the transformation method’s signature since different definitions can have different inputs and have their own schema. Not sure how a centralized framework can even be able to call the transformation method which can take many shapes

The flow of our framework would look like this

we receive a stimuli that a definition with name X needs to be executed
we get its upstreamInputs and resolve them ( from external storage ) and prepare a Dataset[T] for each input where T is the schema of that definition. With multiple upstreams we’ll have a bunch of Dataset[X], Dataset[Y] …
call the transformation method by passing all the Dataset we just resolved from external storage

To summarize

We need a catalog of spark transformations with strong type safety that also represent relations between each to form a DAG
We want to be able to build this DAG for our scheduler to be able to run them in order
We want a centralized framework that can resolve the inputs from external storage, run the transformation and write back the transformation to an external storage

Sorry if i am not being clear. I don’t have a lot of expertise in generics and its lingo. I think HList can be of help here or if we redesign our Definition class in a different way that will allow us to build a generic framework around it