DRILL-8474: Add Daffodil Format Plugin to Drill #2989

cgivre · 2025-05-08T23:11:38Z

DRILL-8474: Add Daffodil Format Plugin to Drill

Description

This PR replaces: #2836 which is closed. That was to retain history/comments while squashing numerous debug-related commits together into this PR. This PR also replaces #2909.

Documentation

New format-daffodil module created

Still uses absolute paths for the schemaFileURI. (which is cheating. Wouldn't work in a true distributed drill environment.)

We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved.

The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice.

Test show this works for data as complex as having nested repeating sub-records.

These DFDL types are supported:

int
long
short
byte
boolean
double
float (does not work. Bug DAFFODIL-2367)
hexBinary
string

Registering Daffodil Schemata

The Daffodil schema management system allows users to:

Register Daffodil schema JARs dynamically without restarting Drill
Distribute schema JARs across all Drillbits in a cluster
Unregister schemas when no longer needed
Version control schema registrations

Architecture

Key Components

RemoteDaffodilSchemaRegistry: Main registry managing schema JARs across the cluster
- Persistent storage for registered schemas
- Transient locks for concurrent access control
- Version-controlled updates with retry logic
DaffodilSchemaProvider: Lifecycle manager for schema registry
- Initialized during Drillbit startup
- Provides access to RemoteDaffodilSchemaRegistry
CreateDaffodilSchemaHandler: Handles schema registration SQL commands
DropDaffodilSchemaHandler: Handles schema unregistration SQL commands

File System Layout

The system uses three directories for managing schema JARs:

Staging: /path/to/base/staging/
- Users copy schema JAR files here before registration
- Temporary location before validation and registration
Registry: /path/to/base/registry/
- Permanent storage for registered schema JARs
- Accessible by all Drillbits in the cluster
Tmp: /path/to/base/tmp/
- Temporary backup during registration process
- Used for rollback on registration failure

Configuration

Configure Daffodil schema directories in drill-override.conf:

drill.exec.daffodil {
  directory {
    # Base directory for all Daffodil schema directories
    base: "/opt/drill/daffodil"

    # Optional: specific filesystem (defaults to default FS)
    # fs: "hdfs://namenode:8020"

    # Optional: root directory (defaults to user home)
    # root: "/user/drill"

    # Staging directory for uploading schema JARs
    staging: ${drill.exec.daffodil.directory.base}"/staging"

    # Registry directory for registered schemas
    registry: ${drill.exec.daffodil.directory.base}"/registry"

    # Temporary directory for backup during registration
    tmp: ${drill.exec.daffodil.directory.base}"/tmp"
  }
}

Usage

Registering a Daffodil Schema

Copy the schema JAR to the staging directory:

# Copy your Daffodil schema JAR to the configured staging directory
cp my-schema.jar /opt/drill/daffodil/staging/

Register the schema using SQL:

CREATE DAFFODIL SCHEMA USING JAR 'my-schema.jar';

Success Response:

+------+----------------------------------------------------------+
| ok   | summary                                                  |
+------+----------------------------------------------------------+
| true | Daffodil schema jar my-schema.jar has been registered    |
|      | successfully.                                            |
+------+----------------------------------------------------------+

Unregistering a Daffodil Schema

DROP DAFFODIL SCHEMA USING JAR 'my-schema.jar';

Success Response:

+------+------------------------------------------------------------+
| ok   | summary                                                    |
+------+------------------------------------------------------------+
| true | Daffodil schema jar my-schema.jar has been unregistered    |
|      | successfully.                                              |
+------+------------------------------------------------------------+

Error Handling

Common Errors

JAR not found in staging:

File /opt/drill/daffodil/staging/my-schema.jar does not exist on file system

Solution: Ensure the JAR file is copied to the staging directory

Duplicate schema registration:

Jar with my-schema.jar name has been already registered

Solution: Use DROP to unregister the existing schema first, or use a different JAR name

Schema not registered:

Jar my-schema.jar is not registered in remote registry

Solution: Verify the schema was previously registered using CREATE DAFFODIL SCHEMA

Concurrent access:

Jar with my-schema.jar name is used. Action: REGISTRATION

Solution: Wait for the current operation to complete

Development and Testing

Running Tests

# Run all Daffodil schema tests
mvn test -pl exec/java-exec -Dtest=TestDaffodilSchemaHandlers

# Run RemoteDaffodilSchemaRegistry tests
mvn test -pl exec/java-exec -Dtest=TestRemoteDaffodilSchemaRegistry

Test Coverage

Basic syntax validation
JAR registration and unregistration
Duplicate detection
Concurrent access handling
Error scenarios (missing JAR, not registered, etc.)
File system operations

Testing

See tests under src/test in the new daffodil contrib module.

pom.xml

cgivre · 2025-05-09T23:00:26Z

@mbeckerle I'm working on the logic to add queries similar to the Dynamic UDF capabilities which would allow a user to import the DFDL files. That will be a separate PR once this is merged.

mbeckerle · 2025-11-02T17:11:20Z

Glad to see this is moving forward. I did as much as I could without detailed understanding of Drill. Once this is done and usable I can perhaps help create demos of data and queries of it that are non-toy to show what it can do.

cgivre · 2025-11-02T23:09:24Z

Glad to see this is moving forward. I did as much as I could without detailed understanding of Drill. Once this is done and usable I can perhaps help create demos of data and queries of it that are non-toy to show what it can do.

@mbeckerle I think this is done and workable. Users can now execute queries like

CREATE DAFFODIL SCHEMA xxx USING JAR yyyy

and that schema file will be copied from a staging directory to all drillbits and should then be accessible for queries. Can you give it a try?

mbeckerle · 2025-11-04T14:12:48Z

I will try this later this week. I will also spread the word about this to others.

mbeckerle · 2025-11-04T14:21:17Z

Glad to see this is moving forward. I did as much as I could without detailed understanding of Drill. Once this is done and usable I can perhaps help create demos of data and queries of it that are non-toy to show what it can do.

@mbeckerle I think this is done and workable. Users can now execute queries like
CREATE DAFFODIL SCHEMA xxx USING JAR yyyy
and that schema file will be copied from a staging directory to all drillbits and should then be accessible for queries. Can you give it a try?

Is there any other documentation/how-to other than this one liner about schemas? Like where is the staging directory? Is there a hello-world type SQL query against a DFDL data file containing say, just a single string?

cgivre · 2025-11-04T18:47:59Z

Glad to see this is moving forward. I did as much as I could without detailed understanding of Drill. Once this is done and usable I can perhaps help create demos of data and queries of it that are non-toy to show what it can do.

@mbeckerle I think this is done and workable. Users can now execute queries like
CREATE DAFFODIL SCHEMA xxx USING JAR yyyy
and that schema file will be copied from a staging directory to all drillbits and should then be accessible for queries. Can you give it a try?
Is there any other documentation/how-to other than this one liner about schemas? Like where is the staging directory? Is there a hello-world type SQL query against a DFDL data file containing say, just a single string?

Take a look here: exec/java-exec/src/main/java/org/apache/drill/exec/schema/daffodil/README.md

mbeckerle · 2025-11-05T16:10:31Z

I guess I don't understand how this connects to queries. This looks like some very nice registry behavior that handles the distributed nature of Drill w.r.t moving a schema jar around and making it available on the classpath for Daffodil to use in every drill bit.

But how then does a query access things in the jar? Is there some sort of path/access mechanism to load things from the jar? I.e., the jar ends up on the class path, and then normal Java loading i.e., getResource() calls, are used to get stuff out of the jar?

I guess I'm looking for the piece that puts this registry together with a query that uses it.

cgivre · 2025-11-05T18:05:46Z

I guess I don't understand how this connects to queries. This looks like some very nice registry behavior that handles the distributed nature of Drill w.r.t moving a schema jar around and making it available on the classpath for Daffodil to use in every drill bit.

But how then does a query access things in the jar? Is there some sort of path/access mechanism to load things from the jar? I.e., the jar ends up on the class path, and then normal Java loading i.e., getResource() calls, are used to get stuff out of the jar?

I guess I'm looking for the piece that puts this registry together with a query that uses it.

I added a new parameter to Daffodil reader: schemaFile. If that is defined, Drill will look in the persistent storage for a schema file.

So the workflow would be:

"Register the schema"

CREATE DAFFODIL SCHEMA USING JAR 'schema.jar'

That schema will now be propagated to all Drillbits and is ready to use.

Query data:

SELECT * FROM table(dfs.`data/data06Int.dat` 
        (type => 'daffodil', "
        validationMode => 'true', 
        schemaFile => 'schema.jar', 
        rootName => 'row',
        rootNamespace => null))

In theory, Drill should handle all the file management. The schemaURI variable functions as before.
The query language uses the word JAR but the schema files can be anything supported by Daffodil.

mbeckerle · 2025-11-05T21:41:03Z

Ok, If I specify an actual jar file containing some compiled java code, will that be put onto the java classpath in the drill bits?

The issue I'm seeing is that schemas are normally pre-compiled into a ".bin" file which is fast to load, but in addition to this file, the schema may have a dependency on certain Daffodil plug in code, which is compiled java in jar files. This dependency can be on multiple different jar files. All these dependency jar files need to be on the classpath.

The daffodil plugins are of 3 kinds. UDFs, "layers" (which compute checksums or decompress zip files, etc. ), and charset definitions. All are dynamically loaded into the JVM when the DFDL schema requests them. They are found using the

All these different jar files need to be on the Java classpath so that their metadata allows dynamic loading.

So while a simple DFDL schema might be contained in one jar file, in general there can be a dependency on multiple jar files which must be placed onto the Java classpath in a specific order. The schema may be needed in source form also for validation of data.

As a case in point, on github there are DFDL schema projects named:

envelope-payload
tcpMessage
mil-std-2045
PCAP
ethernetIP

These are separate component DFDL schemas that are assembled to form an assembly schema by way of schema composition.
The only jar file that needs to be on the classpath is the one from ethernetIP, since that defines a layer algorithm for computing IPv4 checksums.

The DFDL schema that combines all these components can be pre-compiled into an envelope-payload.bin file.

So in this case I need this ".bin" file to be distributed across the cluster and loaded by Daffodil in each drill bit, and with the ethernetIP.jar file distributed across the drill cluster and the ethernetIP.jar needs to be on the classpath of the drill bit java process.

cgivre · 2025-11-07T15:48:34Z

@mbeckerle
See inline.

The issue I'm seeing is that schemas are normally pre-compiled into a ".bin" file which is fast to load, but in addition to this file, the schema may have a dependency on certain Daffodil plug in code, which is compiled java in jar files. This dependency can be on multiple different jar files. All these dependency jar files need to be on the classpath.

The daffodil plugins are of 3 kinds. UDFs, "layers" (which compute checksums or decompress zip files, etc. ), and charset definitions. All are dynamically loaded into the JVM when the DFDL schema requests them. They are found using the

All these different jar files need to be on the Java classpath so that their metadata allows dynamic loading.

In the current implementation, any file that the user registers will be copied into the Daffodil Schema directory. Would it be sufficient if the user added that directory to the classpath? I'm not sure if this would be a security issue or not.

So while a simple DFDL schema might be contained in one jar file, in general there can be a dependency on multiple jar files which must be placed onto the Java classpath in a specific order. The schema may be needed in source form also for validation of data.

As a case in point, on github there are DFDL schema projects named:

envelope-payload

tcpMessage

mil-std-2045

PCAP

ethernetIP

These are separate component DFDL schemas that are assembled to form an assembly schema by way of schema composition. The only jar file that needs to be on the classpath is the one from ethernetIP, since that defines a layer algorithm for computing IPv4 checksums.

The DFDL schema that combines all these components can be pre-compiled into an envelope-payload.bin file.

If the all this can be combined into one file, that would be the easiest route. Then a user could simply do a CREATE DAFFODIL SCHEMA query and that file would be copied to the schema directory where it can be accessed in Drill queries.

So in this case I need this ".bin" file to be distributed across the cluster and loaded by Daffodil in each drill bit, and with the ethernetIP.jar file distributed across the drill cluster and the ethernetIP.jar needs to be on the classpath of the drill bit java process.

If the classpath solution won't work, what would you suggest? Alternatively, we could simply require that the user add the JAR manually to the class path of all Drill nodes.

Requires Daffodil version 3.7.0 or higher. New format-daffodil module created Still uses absolute paths for the schemaFileURI. (which is cheating. Wouldn't work in a true distributed drill environment.) We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved. The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice. Note: Fix boxed Boolean vs. boolean problem. Don't use boxed primitives in Format config objects. Test show this works for data as complex as having nested repeating sub-records. These DFDL types are supported: - int - long - short - byte - boolean - double - float (does not work. Bug DAFFODIL-2367) - hexBinary - string apache#2835

cgivre marked this pull request as draft May 8, 2025 23:11

cgivre self-assigned this May 8, 2025

cgivre added enhancement PRs that add a new functionality to Drill new-format New Format Plugin labels May 8, 2025

cgivre requested a review from jnturton May 9, 2025 15:06

cgivre marked this pull request as ready for review May 9, 2025 15:06

cgivre changed the title ~~DRILL-8474: Add Daffodil Format Plugin to Drill~~ DRILL-8474: Add Daffodil Format Plugin to Drill: Phase 1 May 9, 2025

mbeckerle reviewed May 9, 2025

View reviewed changes

pom.xml Outdated Show resolved Hide resolved

cgivre force-pushed the dfdl_phase_1 branch from f5de2ff to f0c59c1 Compare July 6, 2025 14:01

cgivre force-pushed the dfdl_phase_1 branch from f0c59c1 to 1982e65 Compare July 13, 2025 19:52

cgivre force-pushed the dfdl_phase_1 branch from 1982e65 to 673a656 Compare August 11, 2025 04:32

cgivre force-pushed the dfdl_phase_1 branch from 7fc5bc7 to a9fd902 Compare November 2, 2025 14:44

cgivre changed the title ~~DRILL-8474: Add Daffodil Format Plugin to Drill: Phase 1~~ DRILL-8474: Add Daffodil Format Plugin to Drill Nov 2, 2025

mbeckerle and others added 7 commits November 9, 2025 10:14

Changes to fix checkstyle errors

f0ff092

Update to Daffodil 3.8.0

5aa7d44

Update pom and fixed unit tests

89984fe

Update pom.xml

518f587

Update pom.xml

8dbf92d

Added Exec Config Options

d75657b

cgivre and others added 11 commits November 9, 2025 10:14

Working on SQL

0197061

WIP

89e9e28

WIP

6246aea

Update pom.xml

fb56852

WIP

0854cc8

Added SQL

0ae251f

Fix checkstyle

b25f473

Add license header

e64cd21

Fixed unit tests

26bd828

Bump Daffodil to 3.11.0

d23b786

Added additional unit tests

893590e

cgivre force-pushed the dfdl_phase_1 branch from 289c341 to 893590e Compare November 9, 2025 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRILL-8474: Add Daffodil Format Plugin to Drill #2989

DRILL-8474: Add Daffodil Format Plugin to Drill #2989

Uh oh!

cgivre commented May 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

cgivre commented May 9, 2025

Uh oh!

mbeckerle commented Nov 2, 2025

Uh oh!

cgivre commented Nov 2, 2025

Uh oh!

mbeckerle commented Nov 4, 2025

Uh oh!

mbeckerle commented Nov 4, 2025

Uh oh!

cgivre commented Nov 4, 2025

Uh oh!

mbeckerle commented Nov 5, 2025

Uh oh!

cgivre commented Nov 5, 2025

Uh oh!

mbeckerle commented Nov 5, 2025

Uh oh!

cgivre commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DRILL-8474: Add Daffodil Format Plugin to Drill #2989

Are you sure you want to change the base?

DRILL-8474: Add Daffodil Format Plugin to Drill #2989

Uh oh!

Conversation

cgivre commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DRILL-8474: Add Daffodil Format Plugin to Drill

Description

Documentation

Registering Daffodil Schemata

Architecture

Key Components

File System Layout

Configuration

Usage

Registering a Daffodil Schema

Unregistering a Daffodil Schema

Error Handling

Common Errors

Development and Testing

Running Tests

Test Coverage

Testing

Uh oh!

Uh oh!

cgivre commented May 9, 2025

Uh oh!

mbeckerle commented Nov 2, 2025

Uh oh!

cgivre commented Nov 2, 2025

Uh oh!

mbeckerle commented Nov 4, 2025

Uh oh!

mbeckerle commented Nov 4, 2025

Uh oh!

cgivre commented Nov 4, 2025

Uh oh!

mbeckerle commented Nov 5, 2025

Uh oh!

cgivre commented Nov 5, 2025

Uh oh!

mbeckerle commented Nov 5, 2025

Uh oh!

cgivre commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cgivre commented May 8, 2025 •

edited

Loading