Skip to content

Conversation

@aljazerzen
Copy link
Contributor

@aljazerzen aljazerzen commented Oct 8, 2025

This PR implements the Schema data structure in Rust.

It:

  • implements the data structure in gel-schema crate in gel-rust repo,
  • adds a PyO3 wrapper crate that provides a Schema class,
  • adds a Python class RustSchema that proxies method calls to the PyO3 class,
  • replaces usages of FlatSchema with RustSchema.

Challenges:

  • the edb.schema.Schema abstract class has a pretty large interface. In Minimize FlatSchema interface #9016, I've tried to push as many abstract methods from schema implementations up into Schema itself.
  • data that we put in the schema can be of many different types (Python primitives, uuid, None, ObjectSet, ObjectList, Expressions, Version, ...), each of which needs a rust-native representation,
  • all data we put in the schema is stored there in so-called "reduced representation" (a tuple of data that can be pickled). This means that the PyO3 class must produce precisely this repr, so it can be schema_restore-d later by getter methods.
  • conversion between "Python reduced repr" and "Rust repr" is currently slow. That's because:
    • I'm importing classes just to check if a value is an instance of it,
    • some data needs to be copied (strings, uuids, lists, ...),

I've managed to get it so far that it compiles std lib, bootstraps and works on all queries that I've tried. Let's see the test suite.

Plan:

  • Cache imports using static PyOnceLock,
  • Reducing values is no longer needed, because serialization is now done with serde and bincode. This means that we should remove the "reduced repr" and make PyO3 class consume and produce the "normal" Python repr of values. I suspect that this will provide a huge speed-up.
  • ObjectList, ObjectSet, ObjectDict and ObjectIndex are currently copied on each access. This means that if we want to, for example, lookup a Pointer of an ObjectType, we do get_pointers(), which copies all pointers from schema into a new ObjectIndex instance, just to pick a single Uuid out of it. This is so wrong that it feels unethical and immoral. To improve this, I want to store these values in schema in an Rc, so that when we retrieve a field value, we just clone the Rc and not the value itself. This means that eacn of these object containers would need its PyO3 wrapper class. A bit of work, but huge potential speed-up.

Current benchmark: sometimes this is 3x slower than master

time

This reverts commit ed80851e1e90c40e9c4a44cc5008347112deb9d5.
I'm not convinced this is worth doing on a large scale:
- .get_fully_qualified is much more efficient, so it should be used when possible,
- it is also much less convenient,
- it can be mostly be used only in places that are not hot paths of our compiler
This reverts commit 5e4ad5adda3ee557abc9b5ea699679d6d3fa2218.
This reverts commit 0474076.
@aljazerzen
Copy link
Contributor Author

Latest benchmarks:

time

... are great! This impl is only ~5% slower than current python one.

The major improvement was enabling release build (lol) and a significant one was also using im::HashMap instead of im::OrdMap. I though I need ordering in maps, but apparently I don't. Great.

There are still some optimizations left that I can implement, so I plan to get this to run faster than the python impl on master. Although they require more work, which might take some time.

@aljazerzen
Copy link
Contributor Author

One concerning data point is still the compile_migration_01 benchmark, I have to investigate that.

Base automatically changed from refactor-schema to master October 27, 2025 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant