|
| 1 | +# Neo4j Unity Catalog Connector |
| 2 | + |
| 3 | +A single shaded (fat) JAR that bundles the Neo4j JDBC driver, the SQL-to-Cypher translator, and the Spark subquery cleaner for use with Databricks Unity Catalog federated queries. |
| 4 | + |
| 5 | +Instead of downloading and uploading two separate JARs (`neo4j-jdbc-full-bundle` + `neo4j-jdbc-translator-sparkcleaner`), users upload this single JAR to a UC Volume and reference one path in their connection configuration. |
| 6 | + |
| 7 | +## Prerequisites |
| 8 | + |
| 9 | +- Java 17+ |
| 10 | + |
| 11 | +## Build |
| 12 | + |
| 13 | +```bash |
| 14 | +cd neo4j-unity-catalog-connector |
| 15 | +./mvnw clean verify |
| 16 | +``` |
| 17 | + |
| 18 | +The shaded JAR is produced at: |
| 19 | + |
| 20 | +``` |
| 21 | +target/neo4j-unity-catalog-connector-1.0.0-SNAPSHOT.jar |
| 22 | +``` |
| 23 | + |
| 24 | +## Run Tests |
| 25 | + |
| 26 | +Tests verify that the bundled translators are discoverable via SPI, the Spark subquery cleaner handles Databricks/Spark query patterns, and the JDBC driver class is loadable. |
| 27 | + |
| 28 | +```bash |
| 29 | +./mvnw test |
| 30 | +``` |
| 31 | + |
| 32 | +## Test in Databricks |
| 33 | + |
| 34 | +1. Build the JAR (see above). |
| 35 | + |
| 36 | +2. Upload to a Unity Catalog Volume: |
| 37 | + |
| 38 | + ```python |
| 39 | + # In a Databricks notebook |
| 40 | + dbutils.fs.cp( |
| 41 | + "file:/path/to/neo4j-unity-catalog-connector-1.0.0-SNAPSHOT.jar", |
| 42 | + "/Volumes/<catalog>/<schema>/jars/neo4j-unity-catalog-connector-1.0.0-SNAPSHOT.jar" |
| 43 | + ) |
| 44 | + ``` |
| 45 | + |
| 46 | +3. Create a JDBC connection referencing the single JAR: |
| 47 | + |
| 48 | + ```sql |
| 49 | + CREATE CONNECTION neo4j_connection TYPE JDBC |
| 50 | + ENVIRONMENT ( |
| 51 | + java_dependencies '["/Volumes/<catalog>/<schema>/jars/neo4j-unity-catalog-connector-1.0.0-SNAPSHOT.jar"]' |
| 52 | + safespark_memory '800m' |
| 53 | + ) |
| 54 | + OPTIONS ( |
| 55 | + host '<neo4j-host>', |
| 56 | + port '7687', |
| 57 | + user '<username>', |
| 58 | + password '<password>', |
| 59 | + jdbc_driver 'org.neo4j.jdbc.Neo4jDriver', |
| 60 | + jdbc_url 'jdbc:neo4j://<neo4j-host>:7687?database=neo4j&enableSQLTranslation=true' |
| 61 | + ) |
| 62 | + ``` |
| 63 | + |
| 64 | +4. Run a federated query: |
| 65 | + |
| 66 | + ```sql |
| 67 | + SELECT * FROM IDENTIFIER(neo4j_connection.`/`) LIMIT 10; |
| 68 | + ``` |
| 69 | + |
| 70 | +## What's Inside |
| 71 | + |
| 72 | +The shaded JAR bundles: |
| 73 | + |
| 74 | +| Dependency | Purpose | |
| 75 | +|---|---| |
| 76 | +| `neo4j-jdbc` | Core JDBC driver for Neo4j | |
| 77 | +| `neo4j-jdbc-translator-impl` | SQL-to-Cypher translation engine | |
| 78 | +| `neo4j-jdbc-translator-sparkcleaner` | Cleans Spark subquery wrapping (`SPARK_GEN_SUBQ_0 WHERE 1=0`) | |
| 79 | + |
| 80 | +All transitive dependencies (Jackson, Netty, jOOQ, Bolt protocol, Cypher DSL, Reactive Streams) are relocated under `org.neo4j.jdbc.internal.shaded.*` to avoid classpath conflicts with the Databricks runtime. |
| 81 | + |
| 82 | +## Design |
| 83 | + |
| 84 | +### Problem |
| 85 | + |
| 86 | +Connecting Neo4j to Databricks Unity Catalog previously required users to download and upload **two separate JARs** to a Unity Catalog Volume: |
| 87 | + |
| 88 | +1. `neo4j-jdbc-full-bundle-6.x.x.jar` — the main JDBC driver with SQL-to-Cypher translation |
| 89 | +2. `neo4j-jdbc-translator-sparkcleaner-6.x.x.jar` — handles Spark's subquery wrapping (`SPARK_GEN_SUBQ_0 WHERE 1=0`) |
| 90 | + |
| 91 | +Both had to be referenced individually in the `java_dependencies` array when creating a UC JDBC connection. This meant two manual downloads from Maven Central, two uploads to a Volume, two paths to manage, and two version numbers to keep in sync. If a user forgot the sparkcleaner JAR or used mismatched versions, the connection silently broke with confusing errors. |
| 92 | + |
| 93 | +### Precedent: The AWS Glue Project |
| 94 | + |
| 95 | +The `neo4j-aws-glue` project already does exactly this for AWS Glue. It is a small Maven project that: |
| 96 | + |
| 97 | +- Depends on `neo4j-jdbc`, `neo4j-jdbc-translator-impl`, and `neo4j-jdbc-translator-sparkcleaner` |
| 98 | +- Adds its own custom translator (`AwsGlueTranslator`) that rewrites `WHERE 1=0` to `LIMIT 1` for Glue's schema probing behavior |
| 99 | +- Uses `maven-shade-plugin` to merge everything into a single JAR with relocated packages (Jackson, Netty, jOOQ, Bolt, Cypher DSL, etc.) under `org.neo4j.jdbc.internal.shaded.*` to avoid classpath conflicts |
| 100 | +- Registers the custom translator via Java SPI (`META-INF/services/org.neo4j.jdbc.translator.spi.TranslatorFactory`) |
| 101 | +- Produces a self-contained JAR that users drop into AWS Glue with zero additional setup |
| 102 | + |
| 103 | +This project follows the same pattern. |
| 104 | + |
| 105 | +### Custom Databricks Translator |
| 106 | + |
| 107 | +The AWS Glue project has a custom `AwsGlueTranslator` because AWS Glue sends its own `WHERE 1=0` pattern for schema probing that differs from Spark's. Databricks uses standard Spark through SafeSpark, so the existing `neo4j-jdbc-translator-sparkcleaner` handles the subquery wrapping without any additional custom translator. |
| 108 | + |
| 109 | +If testing reveals Databricks-specific SQL patterns that the existing translators don't handle (for example, SafeSpark may introduce its own query wrapping beyond what standard Spark does), a custom `DatabricksTranslator` can be added later following the same SPI pattern. The project structure accommodates this possibility even if the current version ships without one. |
| 110 | + |
| 111 | +### SPI Service Registration |
| 112 | + |
| 113 | +Unlike the AWS Glue project, there is no custom translator factory to register via SPI. The bundled `neo4j-jdbc-translator-sparkcleaner` and `neo4j-jdbc-translator-impl` JARs each include their own `META-INF/services/org.neo4j.jdbc.translator.spi.TranslatorFactory` files. The `ServicesResourceTransformer` in the maven-shade-plugin automatically merges these SPI registrations into the shaded JAR, so no custom services file is needed. If a `DatabricksTranslator` is added later, its factory would be registered via a new services file at that point. |
| 114 | + |
| 115 | +### User-Agent Identification |
| 116 | + |
| 117 | +The project includes a `META-INF/neo4j-jdbc-user-agent.txt` file containing: |
| 118 | + |
| 119 | +``` |
| 120 | +neo4j-unity-catalog-connector/${project.version} |
| 121 | +``` |
| 122 | + |
| 123 | +This string is sent by the Neo4j JDBC driver to the Neo4j server with every connection. The `${project.version}` placeholder is substituted by Maven at build time (via `<filtering>true</filtering>` in the pom.xml). This lets Neo4j (especially Aura) distinguish connections coming from the Databricks UC connector vs the plain JDBC driver vs the Glue connector — useful for support, usage analytics, and debugging. |
| 124 | + |
| 125 | +### Package Relocation |
| 126 | + |
| 127 | +All bundled dependencies are relocated to avoid conflicts with whatever JARs are already on the Databricks SafeSpark sandbox classpath. The relocation scheme from the AWS Glue project (`org.neo4j.jdbc.internal.shaded.*`) is reused as-is since it was designed by the Neo4j Connectors team for exactly this purpose. |
| 128 | + |
| 129 | +### Impact on the User Experience |
| 130 | + |
| 131 | +**Before (two JARs):** |
| 132 | +```sql |
| 133 | +CREATE CONNECTION neo4j_connection TYPE JDBC |
| 134 | +ENVIRONMENT ( |
| 135 | + java_dependencies '[ |
| 136 | + "/Volumes/catalog/schema/jars/neo4j-jdbc-full-bundle-6.10.5.jar", |
| 137 | + "/Volumes/catalog/schema/jars/neo4j-jdbc-translator-sparkcleaner-6.10.5.jar" |
| 138 | + ]' |
| 139 | +) |
| 140 | +OPTIONS (...) |
| 141 | +``` |
| 142 | + |
| 143 | +**After (one JAR):** |
| 144 | +```sql |
| 145 | +CREATE CONNECTION neo4j_connection TYPE JDBC |
| 146 | +ENVIRONMENT ( |
| 147 | + java_dependencies '["/Volumes/catalog/schema/jars/neo4j-unity-catalog-connector-1.0.0.jar"]' |
| 148 | +) |
| 149 | +OPTIONS (...) |
| 150 | +``` |
| 151 | + |
| 152 | +### Decisions |
| 153 | + |
| 154 | +1. **Repo location:** Subdirectory within `neo4j-uc-integration` (`neo4j-unity-catalog-connector/`). |
| 155 | + |
| 156 | +2. **Artifact naming:** `neo4j-unity-catalog-connector` (groupId: `org.neo4j`, artifactId: `neo4j-unity-catalog-connector`). |
| 157 | + |
| 158 | +3. **Version alignment:** Independent versioning (starting at `1.0.0-SNAPSHOT`), with the upstream `neo4j-jdbc` dependency version pinned separately (initially `6.10.5`). |
| 159 | + |
| 160 | +4. **Custom translator:** Not needed initially. The existing `sparkcleaner` translator handles Databricks/Spark subquery wrapping. If testing reveals Databricks-specific SQL patterns, a `DatabricksTranslator` can be added following the `AwsGlueTranslator` SPI pattern. |
| 161 | + |
| 162 | +### Implementation Progress |
| 163 | + |
| 164 | +#### Phase 1: Create the Maven Project — COMPLETE |
| 165 | + |
| 166 | +Built and verified locally. The `neo4j-unity-catalog-connector/` subdirectory contains: |
| 167 | + |
| 168 | +``` |
| 169 | +neo4j-unity-catalog-connector/ |
| 170 | +├── .mvn/wrapper/ |
| 171 | +│ ├── maven-wrapper.jar |
| 172 | +│ └── maven-wrapper.properties |
| 173 | +├── src/ |
| 174 | +│ ├── main/resources/META-INF/ |
| 175 | +│ │ └── neo4j-jdbc-user-agent.txt |
| 176 | +│ └── test/java/org/neo4j/uc/ |
| 177 | +│ └── BundledTranslatorsTest.java |
| 178 | +├── mvnw |
| 179 | +├── mvnw.cmd |
| 180 | +├── pom.xml |
| 181 | +└── README.md |
| 182 | +``` |
| 183 | + |
| 184 | +**Build verification:** |
| 185 | +- `./mvnw clean verify` succeeds (6 tests pass) |
| 186 | +- Produces `neo4j-unity-catalog-connector-1.0.0-SNAPSHOT.jar` (11MB) |
| 187 | +- User-agent in JAR: `neo4j-unity-catalog-connector/1.0.0-SNAPSHOT` |
| 188 | +- SPI services merged: `SqlToCypherTranslatorFactory` + `SparkSubqueryCleaningTranslatorFactory` |
| 189 | +- 5952 classes relocated under `org.neo4j.jdbc.internal.shaded.*` |
| 190 | + |
| 191 | +**Unit tests (`BundledTranslatorsTest`):** |
| 192 | +- SPI discovery: verifies both `SqlToCypherTranslatorFactory` and `SparkSubqueryCleaningTranslatorFactory` are found via `ServiceLoader` |
| 193 | +- Factory creation: verifies all discovered factories produce non-null `Translator` instances |
| 194 | +- Pipeline integration: verifies the full translator pipeline (spark cleaner + SQL-to-Cypher) processes Spark-wrapped queries without error and removes `SPARK_GEN_SUBQ` wrapping |
| 195 | +- Spark cleaner pass-through: verifies the cleaner handles plain Cypher without throwing |
| 196 | +- JDBC driver loading: verifies `org.neo4j.jdbc.Neo4jDriver` is on the classpath |
| 197 | + |
| 198 | +**CI/CD workflows added:** |
| 199 | +- `.github/workflows/connector-build.yml` — builds on push to `main` and PRs, scoped to `neo4j-unity-catalog-connector/` path changes |
| 200 | +- `.github/workflows/connector-release.yml` — publishes GitHub Release on `connector-*` tags |
| 201 | + |
| 202 | +#### Phase 2: Validate with Databricks — NOT STARTED |
| 203 | + |
| 204 | +#### Phase 3: Update Documentation — NOT STARTED |
| 205 | + |
| 206 | +#### Phase 4: CI/CD and Release — PARTIAL |
| 207 | +- GitHub Actions workflows created (build + release) |
| 208 | +- Dependabot configuration not yet added |
| 209 | +- Maven Central publishing decision deferred |
0 commit comments