Introducing Gallia: a library for data manipulation

|logo

Hi everyone,

This is an announcement for Gallia, a new library for data manipulation that maintains a schema throughout transformations.

Here’s a very basic example of usage on an individual object:

"""{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}"""
  .read() // will infer schema if none is provided
    .toUpperCase('foo)
    .increment  ('bar)
    .remove     ('qux)
    .nest       ('baz).under('parent)
    .flip       ('parent |> 'baz)
  .printJson()
  // prints: {"foo": "HELLO", "bar": 2, "parent": { "baz": false }}

Trying to manipulate 'parent |> 'baz as anything other than a boolean results in a type failure at runtime (but before the data is seen):

    .square ('parent |> 'baz ~> 'BAZ) // instead of "flip" earlier
    // ERROR: TypeMismatch (Boolean, expected Number): 'parent |> 'baz

SQL-like processing looks like the following:

"/data/people.jsonl.gz2"

  // case class Person(name: String, ...)
  .stream[Person]

  // INPUT: [{"name": "John", "age": 20, "city": "Toronto"}, {...

    /* 1. WHERE            */ .filterBy('age).matches(_ < 25)
    /* 2. SELECT           */ .retain('name, 'age)
    /* 3. GROUP BY + COUNT */ .countBy('age)

  // OUTPUT: [{"age": 21, "_count": 10}, {"age": 22, ...

More examples:

It’s also possible to process data at scale by leveraging Spark RDDs.

A much more thorough tour can be found at https://github.com/galliaproject/gallia-core/blob/init/README.md

I would love to hear whether this is an effort worth pursuing!

Anthony (@anthony_cros)

5 Likes

Looks interesting! Have you thought about supporting Scala 2.13, too?

One recommendation; don’t use symbol literals for keys, because they are no longer supported in Scala 3 (you would have to use Symbol("foo")). So, I would just use strings.

Good luck!

1 Like

Hi Dean,

Yes 2.13 is on the roadmap but I haven’t tried it yet so I preferred to stay in 2.12-land at first.

It’s a bit unfortunate about Symbols, though Gallia supports String for keys as well: see this section

Thanks a lot for the feedback, it’s greatly appreciated!

Quick update:

  • Scala 2.13: I’ve migrated the codebase to Scala 2.13
    • I added a few comments on the github commit to describe the experience
    • There is one hack left due to source incompatibilities with ArrayDeque[T] and SeqView[T, Seq[_]]
    • I think for the time being I’ll create a 2.12 branch to accommodate said hack (open to suggestions!)
    • Interestingly the dbNSFP example runs ~5 times faster now, I’m not sure how come yet
  • Biostar announcement: I made another announcement for Gallia on Biostars, tailored to bioinformatics concerns
    • In the process I added a new example: re-processing Clinvar’s VCF file
    • and added an example input row and output object for the dbNSFP example, with the permission of the data owner
  • Upcoming example: The next example will showcase wide transformations to highlight how Spark RDDs can be leveraged in Gallia, when necessary
  • License: I’m leaning towards using MariaDB’s Business Source License (BSL), with Additional Use Grant terms along the lines of “free unless you can largely afford it”; also see CockroachDB’s interesting take on BSL
  • Contact: Some people have reached out to me directly, which is great, but don’t hesitate to provide input for others to see!

As you are using sbt, and it looks like you need different code for 2.12 and 2.13, you can just have two versions of a file in version-specific source folders. From a short look at your linked commit, it looks like you want different versions of your CrossPackage.scala. So your folder structure would look like this:

src
└── main
    ├── scala
    │   └── Other files...
    ├── scala-2.12
    │   └── CrossPackage.scala
    └── scala-2.13
        └── CrossPackage.scala

sbt will automatically use the files in src/main/scala plus the ones in the folder matching the currently compiling scala version. So you can compile 2.12 and 2.13 in the same branch.

2 Likes

This worked wonders! Thanks for pointing it out.

The collections where reworked in 2.13 so maybe that is helping your performance.

Looks like my benchmark number was off, the gain is “only” x2.2:

$ sbt
[info] welcome to sbt 1.4.7 ...

sbt:gallia-dbnsfp> ++ 2.12.13
sbt:gallia-dbnsfp> runMain galliaexample.dbnsfp.DbNsfpDriver
...
[success] Total time: 217 s (03:37), completed Feb 16, 2021 2:52:58 PM

sbt:gallia-dbnsfp> ++ 2.13.4
sbt:gallia-dbnsfp> runMain galliaexample.dbnsfp.DbNsfpDriver
...
[success] Total time: 98 s (01:38), completed Feb 16, 2021 2:54:50 PM

Still, quite impressive considering I didn’t touch anything pertaining to Iterators, which is what the example uses internally (via this call client-side)

Quick update:

License: I kicked off the process of adopting BSL, the terms are being worked out

Examples: Added a few more examples to the docs, notably:

  • Word Count example (the “hello world” of big data), which looks essentially like:
      .split(_line ~> "word").by(" ")
      .flattenBy("word")
      .generate(_count).from("word").using(_ => 1)        
      .sum(_count).by("word")
  • Reproduced the two example queries from this Medium article I stumbled upon; I found a number of similar articles and I’ll try to reproduce the queries for each
  • Re-processing rare disease LOVD data; I intend to expand on this example a lot more in the future
1 Like

Another update:

I added five more examples:

  1. Three additional examples reproducing the processing found in various articles pertaining to data manipulation: see articles sub-section.
  2. Two more bioinformatics examples:

Examples just seem to be the best way to showcase where Gallia shines at the moment. I’d definitely be curious to see their counterparts in alternative technologies (libraries/languages); maybe as bounties?

I also expanded a bit on the “poor man” scaling section - basically wrapping GNU sort for wide operations - since it is used by the GeneMania example above.

Another update:

  1. Switched license to BSL:
    a. See the FAQ explaining the license in non-legal terms; in a nutshell it’s free if you what you do is essential or your are a small entity
    b. See the license itself: especially the Additional Use Grant part

  2. Added a full-blown example of leveraging Spark RDDs with Gallia, in combination with EMR: see repo
    a. Try it out with test-run the script
    b. See driver: GeneManiaSparkDriver.scala
    c. See transformations in spark-unaware GeneMania.scala (in parent repo)

  3. Added macros in their own repo, which basically help going to and from from case classes/gallia instances; see them in action here.

  4. Next steps will be:
    a. Publishing binaries: requires some legal scrutiny due to Apache 2 dependencies
    b. Trying out Scala 3/Dotty and starting to adapt the code wherever it’s not too disruptive
    c. Trying to present Gallia in conference/meetups (I’ll come wherever there’s free pizza)

Stay tuned!

Latest update:

  1. Aptus: the utility library that was bundled with Gallia has now been externalized and open-sourced (APL2): github repo.

  2. Binaries (on maven central):

    • Aptus: 2.12, 2.13, 3.0 - libraryDependencies += "io.github.aptusproject" %% "aptus-core" % "0.2.0"
    • Gallia: 2.12, 2.13 - libraryDependencies += "io.github.galliaproject" %% "gallia-core" % "0.2.0"
  3. Scala 3: Porting Aptus was easy but porting Gallia will be a challenge:

    1. Heavy reliance of WeakTypeTag from scala.reflect.runtime.universe (especially in the gallia.reflect package)
    2. the macros repo must be rewritten from scratch
    3. Spark must support for Scala 3.0 (not an issue for gallia-core however)
    4. Enumeratum usage
  4. Tests: I pushed more tests on the dedicated repo: gallia-testing, which can somewhat serve as documentation. Note that I will likely adopt the great utest library, as I did with the (rather bare) aptus’ tests.

  5. Performance: “narrow transformations” now run ~100 faster, though of course it depends on the exact task at hand. While this sounds impressive, it isn’t really: it just wasn’t optimized at all before (being evil and all). The gains mostly come from the following changes:

    1. using Array over ListMap for Obj: see Aliases.scala
    2. AtomPlan optimizations: see AtomPlan.scala and AtomNodes.scala
    3. less reliance on the _Custom atom (which processes too opaquely): see e.g. ActionsZZ and ActionsFor

    A good example that showcases the new speed is the dbNSFP example (clone then sbt run). Note that a lot more optimizations are still planned.

  6. Strengths: a recurring theme in feedback I’ve gotten so far is that the docs focus too much on simple cases. Gallia can handle them but it’s not where it shines. Meanwhile complex cases where alternatives would struggle - or so I contend - are buried too deep. I will try to improve on that aspect, but in the meantime a good place to find the more challenging cases is this section.

1 Like