Introducing Gallia: a library for data manipulation

anthony.cros · June 21, 2021, 5:35pm

Latest update:

Aptus: the utility library that was bundled with Gallia has now been externalized and open-sourced (APL2): github repo.
Binaries (on maven central):
- Aptus: 2.12, 2.13, 3.0 - libraryDependencies += "io.github.aptusproject" %% "aptus-core" % "0.2.0"
- Gallia: 2.12, 2.13 - libraryDependencies += "io.github.galliaproject" %% "gallia-core" % "0.2.0"
Scala 3: Porting Aptus was easy but porting Gallia will be a challenge:
1. Heavy reliance of WeakTypeTag from scala.reflect.runtime.universe (especially in the gallia.reflect package)
2. the macros repo must be rewritten from scratch
3. Spark must support for Scala 3.0 (not an issue for gallia-core however)
4. Enumeratum usage
Tests: I pushed more tests on the dedicated repo: gallia-testing, which can somewhat serve as documentation. Note that I will likely adopt the great utest library, as I did with the (rather bare) aptus’ tests.
Performance: “narrow transformations” now run ~100 faster, though of course it depends on the exact task at hand. While this sounds impressive, it isn’t really: it just wasn’t optimized at all before (being evil and all). The gains mostly come from the following changes:
1. using Array over ListMap for Obj: see Aliases.scala
2. AtomPlan optimizations: see AtomPlan.scala and AtomNodes.scala
3. less reliance on the _Custom atom (which processes too opaquely): see e.g. ActionsZZ and ActionsFor
A good example that showcases the new speed is the dbNSFP example (clone then sbt run). Note that a lot more optimizations are still planned.
Strengths: a recurring theme in feedback I’ve gotten so far is that the docs focus too much on simple cases. Gallia can handle them but it’s not where it shines. Meanwhile complex cases where alternatives would struggle - or so I contend - are buried too deep. I will try to improve on that aspect, but in the meantime a good place to find the more challenging cases is this section.