Introducing Gallia: a library for data manipulation

|logo

Hi everyone,

This is an announcement for Gallia, a new library for data manipulation that maintains a schema throughout transformations.

Here’s a very basic example of usage on an individual object:

"""{"foo": "hello", "bar": 1, "baz": true, "qux": "world"}"""
  .read() // will infer schema if none is provided
    .toUpperCase('foo)
    .increment  ('bar)
    .remove     ('qux)
    .nest       ('baz).under('parent)
    .flip       ('parent |> 'baz)
  .printJson()
  // prints: {"foo": "HELLO", "bar": 2, "parent": { "baz": false }}

Trying to manipulate 'parent |> 'baz as anything other than a boolean results in a type failure at runtime (but before the data is seen):

    .square ('parent |> 'baz ~> 'BAZ) // instead of "flip" earlier
    // ERROR: TypeMismatch (Boolean, expected Number): 'parent |> 'baz

SQL-like processing looks like the following:

"/data/people.jsonl.gz2"

  // case class Person(name: String, ...)
  .stream[Person]

  // INPUT: [{"name": "John", "age": 20, "city": "Toronto"}, {...

    /* 1. WHERE            */ .filterBy('age).matches(_ < 25)
    /* 2. SELECT           */ .retain('name, 'age)
    /* 3. GROUP BY + COUNT */ .countBy('age)

  // OUTPUT: [{"age": 21, "_count": 10}, {"age": 22, ...

More examples:

It’s also possible to process data at scale by leveraging Spark RDDs.

A much more thorough tour can be found at https://github.com/galliaproject/gallia-core/blob/init/README.md

I would love to hear whether this is an effort worth pursuing!

Anthony (@anthony_cros)

3 Likes

Looks interesting! Have you thought about supporting Scala 2.13, too?

One recommendation; don’t use symbol literals for keys, because they are no longer supported in Scala 3 (you would have to use Symbol("foo")). So, I would just use strings.

Good luck!

1 Like

Hi Dean,

Yes 2.13 is on the roadmap but I haven’t tried it yet so I preferred to stay in 2.12-land at first.

It’s a bit unfortunate about Symbols, though Gallia supports String for keys as well: see this section

Thanks a lot for the feedback, it’s greatly appreciated!

Quick update:

  • Scala 2.13: I’ve migrated the codebase to Scala 2.13
    • I added a few comments on the github commit to describe the experience
    • There is one hack left due to source incompatibilities with ArrayDeque[T] and SeqView[T, Seq[_]]
    • I think for the time being I’ll create a 2.12 branch to accommodate said hack (open to suggestions!)
    • Interestingly the dbNSFP example runs ~5 times faster now, I’m not sure how come yet
  • Biostar announcement: I made another announcement for Gallia on Biostars, tailored to bioinformatics concerns
    • In the process I added a new example: re-processing Clinvar’s VCF file
    • and added an example input row and output object for the dbNSFP example, with the permission of the data owner
  • Upcoming example: The next example will showcase wide transformations to highlight how Spark RDDs can be leveraged in Gallia, when necessary
  • License: I’m leaning towards using MariaDB’s Business Source License (BSL), with Additional Use Grant terms along the lines of “free unless you can largely afford it”; also see CockroachDB’s interesting take on BSL
  • Contact: Some people have reached out to me directly, which is great, but don’t hesitate to provide input for others to see!

As you are using sbt, and it looks like you need different code for 2.12 and 2.13, you can just have two versions of a file in version-specific source folders. From a short look at your linked commit, it looks like you want different versions of your CrossPackage.scala. So your folder structure would look like this:

src
└── main
    ├── scala
    │   └── Other files...
    ├── scala-2.12
    │   └── CrossPackage.scala
    └── scala-2.13
        └── CrossPackage.scala

sbt will automatically use the files in src/main/scala plus the ones in the folder matching the currently compiling scala version. So you can compile 2.12 and 2.13 in the same branch.

2 Likes

This worked wonders! Thanks for pointing it out.

The collections where reworked in 2.13 so maybe that is helping your performance.

Looks like my benchmark number was off, the gain is “only” x2.2:

$ sbt
[info] welcome to sbt 1.4.7 ...

sbt:gallia-dbnsfp> ++ 2.12.13
sbt:gallia-dbnsfp> runMain galliaexample.dbnsfp.DbNsfpDriver
...
[success] Total time: 217 s (03:37), completed Feb 16, 2021 2:52:58 PM

sbt:gallia-dbnsfp> ++ 2.13.4
sbt:gallia-dbnsfp> runMain galliaexample.dbnsfp.DbNsfpDriver
...
[success] Total time: 98 s (01:38), completed Feb 16, 2021 2:54:50 PM

Still, quite impressive considering I didn’t touch anything pertaining to Iterators, which is what the example uses internally (via this call client-side)

Quick update:

License: I kicked off the process of adopting BSL, the terms are being worked out

Examples: Added a few more examples to the docs, notably:

  • Word Count example (the “hello world” of big data), which looks essentially like:
      .split(_line ~> "word").by(" ")
      .flattenBy("word")
      .generate(_count).from("word").using(_ => 1)        
      .sum(_count).by("word")
  • Reproduced the two example queries from this Medium article I stumbled upon; I found a number of similar articles and I’ll try to reproduce the queries for each
  • Re-processing rare disease LOVD data; I intend to expand on this example a lot more in the future
1 Like