Skip to content
Ladislav Sulak edited this page Oct 30, 2025 · 3 revisions

Support for Spark 3.x

Right now the library is build against Spark 2.4.x. The next major version of Spark 3.x has been out for some time already. Version 4.x is on the horizon. It would be great to have the library updated to be sure to support Spark 3.x perhaps 4.x as well.

Incorporate ErrorHandling

The ErrorHandling capability within spark-commons (or if extracted into its own library) is rather useful and generic enough to be incorporated into spark-data-standardization. That would make it easier to handle errors during the standardization process as the user requires, not forcing the current one behavior.

Some rare bugs fixing

There are some known bugs, particularly in corner cases date/timestamp parsing, that would deserve to be fixed.

Missing types support

There are some types that are not really supported by spark-data-standardization. Namely IntervalType and NullType. Their full support would enhance the versatility of the library.

There's also BinaryType, that could be supported. Conversion from StringType could inlcude encoding metadata accepting url-encoding and Base64 or uuencode.

The complex types support can also be improved.

Custom unsystematic transformations done via plugins and UDFs

Some unsystematic and Absa specific transformations are currently implemented directly within the library. Moving them into plugins and UDFs would make the open source library more generic and extensible.