Hi everyone,
This is an announcement for Gallia, a new library for data manipulation that maintains a schema throughout transformations.
Here’s a very basic example of usage on an individual object:
"""{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}"""
.read() // will infer schema if none is provided
.toUpperCase('foo)
.increment ('bar)
.remove ('qux)
.nest ('baz).under('parent)
.flip ('parent |> 'baz)
.printJson()
// prints: {"foo": "HELLO", "bar": 2, "parent": { "baz": false }}
Trying to manipulate 'parent |> 'baz
as anything other than a boolean results in a type failure at runtime (but before the data is seen):
.square ('parent |> 'baz ~> 'BAZ) // instead of "flip" earlier
// ERROR: TypeMismatch (Boolean, expected Number): 'parent |> 'baz
SQL-like processing looks like the following:
"/data/people.jsonl.gz2"
// case class Person(name: String, ...)
.stream[Person]
// INPUT: [{"name": "John", "age": 20, "city": "Toronto"}, {...
/* 1. WHERE */ .filterBy('age).matches(_ < 25)
/* 2. SELECT */ .retain('name, 'age)
/* 3. GROUP BY + COUNT */ .countBy('age)
// OUTPUT: [{"age": 21, "_count": 10}, {"age": 22, ...
More examples:
It’s also possible to process data at scale by leveraging Spark RDDs.
A much more thorough tour can be found at https://github.com/galliaproject/gallia-core/blob/init/README.md
I would love to hear whether this is an effort worth pursuing!
Anthony (@anthony_cros)