Introducing Gallia: a library for data manipulation

anthony.cros · February 5, 2021, 8:15pm

|logo

Hi everyone,

This is an announcement for Gallia, a new library for data manipulation that maintains a schema throughout transformations.

Here’s a very basic example of usage on an individual object:

"""{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}"""
  .read() // will infer schema if none is provided
    .toUpperCase('foo)
    .increment  ('bar)
    .remove     ('qux)
    .nest       ('baz).under('parent)
    .flip       ('parent |> 'baz)
  .printJson()
  // prints: {"foo": "HELLO", "bar": 2, "parent": { "baz": false }}

Trying to manipulate 'parent |> 'baz as anything other than a boolean results in a type failure at runtime (but before the data is seen):

    .square ('parent |> 'baz ~> 'BAZ) // instead of "flip" earlier
    // ERROR: TypeMismatch (Boolean, expected Number): 'parent |> 'baz

SQL-like processing looks like the following:

"/data/people.jsonl.gz2"

  // case class Person(name: String, ...)
  .stream[Person]

  // INPUT: [{"name": "John", "age": 20, "city": "Toronto"}, {...

    /* 1. WHERE            */ .filterBy('age).matches(_ < 25)
    /* 2. SELECT           */ .retain('name, 'age)
    /* 3. GROUP BY + COUNT */ .countBy('age)

  // OUTPUT: [{"age": 21, "_count": 10}, {"age": 22, ...

More examples:

It’s also possible to process data at scale by leveraging Spark RDDs.

A much more thorough tour can be found at https://github.com/galliaproject/gallia-core/blob/init/README.md

I would love to hear whether this is an effort worth pursuing!

Anthony (@anthony_cros)

deanwampler · February 5, 2021, 8:51pm

Looks interesting! Have you thought about supporting Scala 2.13, too?

One recommendation; don’t use symbol literals for keys, because they are no longer supported in Scala 3 (you would have to use Symbol("foo")). So, I would just use strings.

Good luck!

anthony.cros · February 5, 2021, 8:58pm

Hi Dean,

Yes 2.13 is on the roadmap but I haven’t tried it yet so I preferred to stay in 2.12-land at first.

It’s a bit unfortunate about Symbols, though Gallia supports String for keys as well: see this section

Thanks a lot for the feedback, it’s greatly appreciated!

anthony.cros · February 15, 2021, 7:41pm

Quick update:

Scala 2.13: I’ve migrated the codebase to Scala 2.13
- I added a few comments on the github commit to describe the experience
- There is one hack left due to source incompatibilities with ArrayDeque[T] and SeqView[T, Seq[_]]
- I think for the time being I’ll create a 2.12 branch to accommodate said hack (open to suggestions!)
- Interestingly the dbNSFP example runs ~5 times faster now, I’m not sure how come yet
Biostar announcement: I made another announcement for Gallia on Biostars, tailored to bioinformatics concerns
- In the process I added a new example: re-processing Clinvar’s VCF file
- and added an example input row and output object for the dbNSFP example, with the permission of the data owner
Upcoming example: The next example will showcase wide transformations to highlight how Spark RDDs can be leveraged in Gallia, when necessary
License: I’m leaning towards using MariaDB’s Business Source License (BSL), with Additional Use Grant terms along the lines of “free unless you can largely afford it”; also see CockroachDB’s interesting take on BSL
Contact: Some people have reached out to me directly, which is great, but don’t hesitate to provide input for others to see!

crater2150 · February 16, 2021, 10:05am

As you are using sbt, and it looks like you need different code for 2.12 and 2.13, you can just have two versions of a file in version-specific source folders. From a short look at your linked commit, it looks like you want different versions of your CrossPackage.scala. So your folder structure would look like this:

src
└── main
    ├── scala
    │   └── Other files...
    ├── scala-2.12
    │   └── CrossPackage.scala
    └── scala-2.13
        └── CrossPackage.scala

sbt will automatically use the files in src/main/scala plus the ones in the folder matching the currently compiling scala version. So you can compile 2.12 and 2.13 in the same branch.

anthony.cros · February 16, 2021, 3:12pm

This worked wonders! Thanks for pointing it out.

ekrich · February 16, 2021, 5:54pm

The collections where reworked in 2.13 so maybe that is helping your performance.

anthony.cros · February 16, 2021, 8:02pm

Looks like my benchmark number was off, the gain is “only” x2.2:

$ sbt
[info] welcome to sbt 1.4.7 ...

sbt:gallia-dbnsfp> ++ 2.12.13
sbt:gallia-dbnsfp> runMain galliaexample.dbnsfp.DbNsfpDriver
...
[success] Total time: 217 s (03:37), completed Feb 16, 2021 2:52:58 PM

sbt:gallia-dbnsfp> ++ 2.13.4
sbt:gallia-dbnsfp> runMain galliaexample.dbnsfp.DbNsfpDriver
...
[success] Total time: 98 s (01:38), completed Feb 16, 2021 2:54:50 PM

Still, quite impressive considering I didn’t touch anything pertaining to Iterators, which is what the example uses internally (via this call client-side)

anthony.cros · February 23, 2021, 8:07pm

Quick update:

License: I kicked off the process of adopting BSL, the terms are being worked out

Examples: Added a few more examples to the docs, notably:

Word Count example (the “hello world” of big data), which looks essentially like:

      .split(_line ~> "word").by(" ")
      .flattenBy("word")
      .generate(_count).from("word").using(_ => 1)        
      .sum(_count).by("word")

Reproduced the two example queries from this Medium article I stumbled upon; I found a number of similar articles and I’ll try to reproduce the queries for each
Re-processing rare disease LOVD data; I intend to expand on this example a lot more in the future

anthony.cros · March 5, 2021, 7:15pm

Another update:

I added five more examples:

Three additional examples reproducing the processing found in various articles pertaining to data manipulation: see articles sub-section.
Two more bioinformatics examples:
- Re-processing the output for SnpEff’s rather convoluted “ANN” value
- Re-processing the GeneMania set of TSV files for Homo Sapiens, see details on their mailing list; resulting data is CC-BY-4.0-licensed (example output document)

Examples just seem to be the best way to showcase where Gallia shines at the moment. I’d definitely be curious to see their counterparts in alternative technologies (libraries/languages); maybe as bounties?

I also expanded a bit on the “poor man” scaling section - basically wrapping GNU sort for wide operations - since it is used by the GeneMania example above.

anthony.cros · March 30, 2021, 8:26pm

Another update:

Switched license to BSL:
a. See the FAQ explaining the license in non-legal terms; in a nutshell it’s free if you what you do is essential or your are a small entity
b. See the license itself: especially the Additional Use Grant part
Added a full-blown example of leveraging Spark RDDs with Gallia, in combination with EMR: see repo
a. Try it out with test-run the script
b. See driver: GeneManiaSparkDriver.scala
c. See transformations in spark-unaware GeneMania.scala (in parent repo)
Added macros in their own repo, which basically help going to and from from case classes/gallia instances; see them in action here.
Next steps will be:
a. Publishing binaries: requires some legal scrutiny due to Apache 2 dependencies
b. Trying out Scala 3/Dotty and starting to adapt the code wherever it’s not too disruptive
c. Trying to present Gallia in conference/meetups (I’ll come wherever there’s free pizza)

Stay tuned!

anthony.cros · June 21, 2021, 5:35pm

Latest update:

Aptus: the utility library that was bundled with Gallia has now been externalized and open-sourced (APL2): github repo.
Binaries (on maven central):
- Aptus: 2.12, 2.13, 3.0 - libraryDependencies += "io.github.aptusproject" %% "aptus-core" % "0.2.0"
- Gallia: 2.12, 2.13 - libraryDependencies += "io.github.galliaproject" %% "gallia-core" % "0.2.0"
Scala 3: Porting Aptus was easy but porting Gallia will be a challenge:
1. Heavy reliance of WeakTypeTag from scala.reflect.runtime.universe (especially in the gallia.reflect package)
2. the macros repo must be rewritten from scratch
3. Spark must support for Scala 3.0 (not an issue for gallia-core however)
4. Enumeratum usage
Tests: I pushed more tests on the dedicated repo: gallia-testing, which can somewhat serve as documentation. Note that I will likely adopt the great utest library, as I did with the (rather bare) aptus’ tests.
Performance: “narrow transformations” now run ~100 faster, though of course it depends on the exact task at hand. While this sounds impressive, it isn’t really: it just wasn’t optimized at all before (being evil and all). The gains mostly come from the following changes:
1. using Array over ListMap for Obj: see Aliases.scala
2. AtomPlan optimizations: see AtomPlan.scala and AtomNodes.scala
3. less reliance on the _Custom atom (which processes too opaquely): see e.g. ActionsZZ and ActionsFor
A good example that showcases the new speed is the dbNSFP example (clone then sbt run). Note that a lot more optimizations are still planned.
Strengths: a recurring theme in feedback I’ve gotten so far is that the docs focus too much on simple cases. Gallia can handle them but it’s not where it shines. Meanwhile complex cases where alternatives would struggle - or so I contend - are buried too deep. I will try to improve on that aspect, but in the meantime a good place to find the more challenging cases is this section.

anthony.cros · December 17, 2021, 7:13pm

I wrote an article on Gallia in TowardsDataScience last week, presenting it from a new angle: https://towardsdatascience.com/gallia-a-library-for-data-transformation-3fafaaa2d8b9

Have a great weekend everybody!

anthony.cros · October 14, 2022, 6:04pm

Hi all,

Gallia 0.4.0 has been released for Scala 2.13 and 2.12.

It comes with quite a few improvements and new features, which are summarized in the CHANGELOG.md. I will also write a second article in Towards Data Science to better highlight some of these changes.

The next iterations will focus on the following aspects: NEXT_RELEASES.md, but I’m open to suggestions.

Stay tuned!

anthony.cros · October 31, 2022, 2:13pm

And the corresponding new post in Towards Data Science:
https://towardsdatascience.com/data-transformations-in-scala-with-gallia-version-0-4-0-is-out-f0b8df3e48f3

Feedback is always welcome (especially on where to go next)!

anthony.cros · May 12, 2023, 2:52pm

I will be presenting Gallia at Scala Days in Seattle on June 7th (3.30PM): Scala Days - Meet the Speaker, Anthony Cros

Hope to see you there!

anthony.cros · September 21, 2023, 2:05pm

Hi All!

A few things:

I changed Gallia’s license to Apache 2 across all the repos under galliaproject (so core, macros, spark, …)
Here’s the video of my presentation at ScalaDays in Seattle, and the corresponding slides+code.
The main feedback I got at the conference was two fold: Scala 3.x support and performance improvements. As a result I will be prioritizing these two aspects for 0.6.x, starting with Scala 3.x support.

As always I welcome feedback!