-
Notifications
You must be signed in to change notification settings - Fork 986
DRILL-8474: Add Daffodil Format Plugin to Drill #2989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@mbeckerle I'm working on the logic to add queries similar to the Dynamic UDF capabilities which would allow a user to import the DFDL files. That will be a separate PR once this is merged. |
|
Glad to see this is moving forward. I did as much as I could without detailed understanding of Drill. Once this is done and usable I can perhaps help create demos of data and queries of it that are non-toy to show what it can do. |
@mbeckerle I think this is done and workable. Users can now execute queries like CREATE DAFFODIL SCHEMA xxx USING JAR yyyyand that schema file will be copied from a staging directory to all drillbits and should then be accessible for queries. Can you give it a try? |
|
I will try this later this week. I will also spread the word about this to others. |
Is there any other documentation/how-to other than this one liner about schemas? Like where is the staging directory? Is there a hello-world type SQL query against a DFDL data file containing say, just a single string? |
Take a look here: exec/java-exec/src/main/java/org/apache/drill/exec/schema/daffodil/README.md |
|
I guess I don't understand how this connects to queries. This looks like some very nice registry behavior that handles the distributed nature of Drill w.r.t moving a schema jar around and making it available on the classpath for Daffodil to use in every drill bit. But how then does a query access things in the jar? Is there some sort of path/access mechanism to load things from the jar? I.e., the jar ends up on the class path, and then normal Java loading i.e., getResource() calls, are used to get stuff out of the jar? I guess I'm looking for the piece that puts this registry together with a query that uses it. |
I added a new parameter to Daffodil reader: So the workflow would be:
CREATE DAFFODIL SCHEMA USING JAR 'schema.jar'That schema will now be propagated to all Drillbits and is ready to use.
SELECT * FROM table(dfs.`data/data06Int.dat`
(type => 'daffodil', "
validationMode => 'true',
schemaFile => 'schema.jar',
rootName => 'row',
rootNamespace => null))In theory, Drill should handle all the file management. The |
|
Ok, If I specify an actual jar file containing some compiled java code, will that be put onto the java classpath in the drill bits? The issue I'm seeing is that schemas are normally pre-compiled into a ".bin" file which is fast to load, but in addition to this file, the schema may have a dependency on certain Daffodil plug in code, which is compiled java in jar files. This dependency can be on multiple different jar files. All these dependency jar files need to be on the classpath. The daffodil plugins are of 3 kinds. UDFs, "layers" (which compute checksums or decompress zip files, etc. ), and charset definitions. All are dynamically loaded into the JVM when the DFDL schema requests them. They are found using the All these different jar files need to be on the Java classpath so that their metadata allows dynamic loading. So while a simple DFDL schema might be contained in one jar file, in general there can be a dependency on multiple jar files which must be placed onto the Java classpath in a specific order. The schema may be needed in source form also for validation of data. As a case in point, on github there are DFDL schema projects named:
These are separate component DFDL schemas that are assembled to form an assembly schema by way of schema composition. The DFDL schema that combines all these components can be pre-compiled into an envelope-payload.bin file. So in this case I need this ".bin" file to be distributed across the cluster and loaded by Daffodil in each drill bit, and with the ethernetIP.jar file distributed across the drill cluster and the ethernetIP.jar needs to be on the classpath of the drill bit java process. |
|
@mbeckerle
In the current implementation, any file that the user registers will be copied into the Daffodil Schema directory. Would it be sufficient if the user added that directory to the classpath? I'm not sure if this would be a security issue or not.
If the all this can be combined into one file, that would be the easiest route. Then a user could simply do a
If the classpath solution won't work, what would you suggest? Alternatively, we could simply require that the user add the JAR manually to the class path of all Drill nodes. |
Requires Daffodil version 3.7.0 or higher. New format-daffodil module created Still uses absolute paths for the schemaFileURI. (which is cheating. Wouldn't work in a true distributed drill environment.) We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved. The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice. Note: Fix boxed Boolean vs. boolean problem. Don't use boxed primitives in Format config objects. Test show this works for data as complex as having nested repeating sub-records. These DFDL types are supported: - int - long - short - byte - boolean - double - float (does not work. Bug DAFFODIL-2367) - hexBinary - string apache#2835
DRILL-8474: Add Daffodil Format Plugin to Drill
Description
This PR replaces: #2836 which is closed. That was to retain history/comments while squashing numerous debug-related commits together into this PR. This PR also replaces #2909.
Documentation
New format-daffodil module created
Still uses absolute paths for the
schemaFileURI. (which is cheating. Wouldn't work in a true distributed drill environment.)We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved.
The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice.
Test show this works for data as complex as having nested repeating sub-records.
These DFDL types are supported:
Registering Daffodil Schemata
The Daffodil schema management system allows users to:
Architecture
Key Components
RemoteDaffodilSchemaRegistry: Main registry managing schema JARs across the cluster
DaffodilSchemaProvider: Lifecycle manager for schema registry
CreateDaffodilSchemaHandler: Handles schema registration SQL commands
DropDaffodilSchemaHandler: Handles schema unregistration SQL commands
File System Layout
The system uses three directories for managing schema JARs:
Staging:
/path/to/base/staging/Registry:
/path/to/base/registry/Tmp:
/path/to/base/tmp/Configuration
Configure Daffodil schema directories in
drill-override.conf:Usage
Registering a Daffodil Schema
# Copy your Daffodil schema JAR to the configured staging directory cp my-schema.jar /opt/drill/daffodil/staging/CREATE DAFFODIL SCHEMA USING JAR 'my-schema.jar';Success Response:
Unregistering a Daffodil Schema
DROP DAFFODIL SCHEMA USING JAR 'my-schema.jar';Success Response:
Error Handling
Common Errors
JAR not found in staging:
Solution: Ensure the JAR file is copied to the staging directory
Duplicate schema registration:
Solution: Use DROP to unregister the existing schema first, or use a different JAR name
Schema not registered:
Solution: Verify the schema was previously registered using CREATE DAFFODIL SCHEMA
Concurrent access:
Solution: Wait for the current operation to complete
Development and Testing
Running Tests
Test Coverage
Testing
See tests under src/test in the new daffodil contrib module.