Hello,
We want to build a catalog of spark transformations using the Dataset API of spark with strong type safety. The Dataset API is a generic class with type T representing the schema of the underlying data
Users are free to define their transformations that we call a “Definition” which is nothing but sparksql code. Definitions have inputs that are simply other “definitions”
A definition class would look like ( subject to change as per your suggestion, not set in stone )
abstract class Definition {
def name: String
def upstreamInputs: Set[Definition]
def transformation(List<Dataset>): Dataset
}
The issue is that we don’t know how to define the transformation method’s signature since different definitions can have different inputs and have their own schema. Not sure how a centralized framework can even be able to call the transformation method which can take many shapes
The flow of our framework would look like this
- we receive a stimuli that a definition with name X needs to be executed
- we get its
upstreamInputs
and resolve them ( from external storage ) and prepare aDataset[T]
for each input where T is the schema of that definition. With multiple upstreams we’ll have a bunch ofDataset[X]
,Dataset[Y]
… - call the transformation method by passing all the
Dataset
we just resolved from external storage
To summarize
- We need a catalog of spark transformations with strong type safety that also represent relations between each to form a DAG
- We want to be able to build this DAG for our scheduler to be able to run them in order
- We want a centralized framework that can resolve the inputs from external storage, run the transformation and write back the transformation to an external storage
Sorry if i am not being clear. I don’t have a lot of expertise in generics and its lingo. I think HList can be of help here or if we redesign our Definition class in a different way that will allow us to build a generic framework around it