[Feature][Zeta]Add dynamic metadata provider#10838
[Feature][Zeta]Add dynamic metadata provider#10838chl-wxp wants to merge 5 commits intoapache:devfrom
Conversation
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for the contribution. I reviewed this PR from the full diff and traced the real runtime path from job parsing to the new REST layer.
What problem this PR is trying to solve
- User pain: contributors want datasource connection properties to be managed once on the engine side instead of being repeated in every job config.
- Fix approach: this PR adds a Hazelcast-backed
DynamicMetadataProvider, exposes REST CRUD endpoints for datasources, and injects datasource properties into connector configs throughmetadata_datasource_id. - One-line summary: this is trying to turn datasource configuration into a dynamically managed engine-side registry.
Example:
- Before this PR, a job still needed to inline its JDBC/Kafka/etc. datasource properties in the job config.
- After this PR, the user can register a datasource once and let the parser inject it dynamically when
metadata_datasource_idis present.
Core logic review
Exact change
The main path is:
MultipleTableJobConfigParser.java:857-867decides whether datasource resolution should happen.MetadataProviderManager.java:180-223asks the provider for a datasource map.MetadataProviderManager.java:298-355merges that map back into the connector config.DynamicMetadataProvider.java:109-156loads datasource properties from a Hazelcast IMap.JettyService.java:220-257exposes/metadata/datasourceand/metadata/datasources.DynamicMetadataDataSourceService.java:56-138creates/reads/lists/updates/deletes the datasource objects.
The main correctness problem is that the new serialized datasource object stores every property value as a string:
for (Map.Entry<String, Object> entry : properties.entrySet()) {
out.writeString(entry.getKey());
out.writeString(entry.getValue().toString());
}
String value = in.readString();
properties.put(key, value);And the REST layer returns the same properties directly:
datasource.getProperties().forEach(
(key, value) -> {
propertiesJson.add(key, value != null ? value.toString() : "");
});Key findings
- The normal metadata-enabled job parsing path definitely hits this PR. This is not dormant code.
- The new REST endpoints do not just manage harmless metadata. They can store and return real connector properties, which naturally includes passwords, tokens, and access keys.
DynamicMetadataDataSource.java:79-103currently flattens all values toString, so typed connector options lose their original types on the main path.- This PR changes both
seatunnel-engineandseatunnel-api, but the PR diff contains nodocs/files and nosrc/test/javafiles. - The current CI failure is PR-caused, not infrastructure noise.
Full execution path
Job parsing main path
-> MultipleTableJobConfigParser.handleDataSource() [857-867]
-> metadata.enabled=true and metadata_datasource_id exists
-> MetadataProviderManager.resolveDataSourceConfigs(...)
Dynamic datasource resolution
-> MetadataProviderManager.resolveConnectorConfig() [180-223]
-> provider.datasourceMap(connectorIdentifier, datasourceId)
-> DynamicMetadataProvider.datasourceMap() [109-156]
-> datasourceIMap.get(metaDataDatasourceId)
-> validate connectorType
-> return datasource.getProperties()
-> MetadataProviderManager.mergeConfig() [298-355]
-> merge datasource values back into connector config
-> ConfigFactory.parseMap(mergedMap)
REST registration / lookup path
-> JettyService.createJettyServer() [168-174]
-> BasicAuthFilter is only enabled when the user explicitly turns it on
-> JettyService [220-257]
-> expose /metadata/datasource and /metadata/datasources
-> DynamicMetadataDataSourceServlet.doGet/doPost/doPut/doDelete() [44-112]
-> call DynamicMetadataDataSourceService
-> DynamicMetadataDataSourceService.create/get/list(...) [56-138]
-> store connectorType + properties in Hazelcast IMap
-> DynamicMetadataDataSource.writeData/readData() [79-103]
-> coerce every property value to String during serialization
Compatibility impact
Compatibility judgment: fully compatible for existing jobs, because old jobs do not hit this path unless they explicitly enable dynamic metadata.
- API: additive.
- Configs: additive.
- Defaults: no old default is changed.
- Protocol / serialization: a new Hazelcast-serialized type is introduced.
- Historical behavior: existing jobs remain unchanged unless they opt into the new feature.
The current blockers are therefore not about old-job compatibility. They are about the safety and correctness of the new feature itself.
Performance / side effects
- The extra IMap lookup and config merge are acceptable from a performance perspective.
- The more important side effects are security and config correctness, not CPU cost.
- Datasource configs will also live in the cluster map and be replicated across the Hazelcast cluster.
Error handling and logs
I did not find a major swallowed-exception bug, but I found three formal issues.
Problem 1: the new REST API exposes datasource secrets on the default HTTP management surface, and returned values are not redacted
- Location:
seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/JettyService.java:168 - Why this is a problem:
JettyService.java:220-257exposes/metadata/datasourceand/metadata/datasources,DynamicMetadataDataSourceServlet.java:44-112provides GET/POST/PUT/DELETE, andDynamicMetadataDataSourceService.java:103-138,226-241returns the full properties map. The only protection in this PR is the optional global basic auth filter atJettyService.java:168-174. In the default HTTP management deployment, this effectively adds a plaintext secret read/write API. - Risk: this is a high-severity security issue. Anyone who can reach the engine HTTP port may be able to read or overwrite real datasource credentials.
- Best fix: recommended option A is to put these endpoints behind mandatory authentication and redact sensitive fields in GET/LIST responses. Option B, if that authentication boundary is not ready yet, is to avoid exposing plaintext read endpoints in this PR.
- Severity: High
Problem 2: the Hazelcast serialization path coerces every datasource property to String, which breaks the connector type contract on the main path
- Location:
seatunnel-engine/seatunnel-engine-core/src/main/java/org/apache/seatunnel/engine/core/job/DynamicMetadataDataSource.java:79 - Why this is a problem:
DynamicMetadataDataSource.writeData()/readData()at79-103serializes all values as strings, andupdateProperties()even narrows updates toMap<String, String>. But the main runtime path later feeds this map back intoMetadataProviderManager.mergeConfig()(MetadataProviderManager.java:298-355). So integers, booleans, lists, nested objects, and other typed connector values are all at risk of being silently turned into strings before the connector sees them. - Risk: this is a main-path correctness issue. It can cause validation failures or, worse, incorrect runtime behavior with the wrong config types.
- Best fix: recommended option A is to preserve real value types in the serialization protocol. Option B, if the feature is intentionally string-only for now, is to make that contract explicit in validation and docs instead of exposing
Map<String, Object>. - Severity: High
Problem 3: this is a new user-visible engine/API capability, but the PR ships without docs or regression tests, and the current Build is already red due to an example resource issue
- Location:
seatunnel-examples/seatunnel-engine-examples/src/main/resources/examples/fake_to_console.conf:1 - Why this is a problem:
gh pr diff 10838 --repo apache/seatunnel --name-onlyshows 16 changed files and nodocs/orsrc/test/javapaths. In addition,gh pr checks 10838 --repo apache/seatunnelshowsBuildfailing, and the failed log points to the example file above missing a valid ASF license header. - Risk: for a new
seatunnel-engine/seatunnel-apifeature, this means the usage contract is undocumented, the main path is untested, and the PR does not even pass the basic static quality gate yet. - Best fix: fix the license-header failure first, add at least one regression test for the register -> resolve -> inject main path, and add both English and Chinese docs describing the feature boundary, auth expectations, and supported property types.
- Severity: Medium
Code quality
Most of the Java code follows the usual project style, but the missing license header in the new example is already enough to keep CI red.
Test coverage
Local verification I actually ran:
gh pr diff 10838 --repo apache/seatunnel --name-only: passed, confirmed the PR changes 16 files and includes no docs or test files.git show seatunnel-review-10838:.../MultipleTableJobConfigParser.java | nl -ba | sed -n '857,867p': passed, confirmed the main job parser path now resolves dynamic datasources.git show seatunnel-review-10838:.../DynamicMetadataProvider.java | nl -ba | sed -n '109,156p': passed, confirmed the provider reads datasource configs from Hazelcast IMap.git show seatunnel-review-10838:.../DynamicMetadataDataSource.java | nl -ba | sed -n '79,103p': passed, confirmed all property values are currently serialized as strings.gh pr checks 10838 --repo apache/seatunnel: failed, currentBuildis red.gh run view 25092004758 --repo chl-wxp/seatunnel-laowang --log-failed: passed, confirmed the failure isRun / License header, and the invalid file isseatunnel-examples/seatunnel-engine-examples/src/main/resources/examples/fake_to_console.conf.
For this scope, the test/documentation gap is too large right now.
Documentation
This is a user-visible feature, but the PR contains no docs/en or docs/zh update at all. At minimum, it needs documentation for:
- how to enable the
dynamicmetadata provider; - how the REST datasource API should be used;
- what the authentication expectation is;
- what property types are officially supported;
- how this feature relates to other metadata/schema flows.
Architecture assessment
- Elegance: the product direction is interesting and potentially valuable.
- Maintainability: the current security boundary and type contract are not explicit enough.
- Extensibility: if the two high-severity issues are fixed, this could evolve into a reusable datasource registry.
- Historical compatibility: existing jobs are safe by default, but the new feature is not stable enough yet.
Issue summary
| No. | Issue | Location | Severity |
|---|---|---|---|
| 1 | The new REST API exposes datasource secrets on the default HTTP management surface and returns them unredacted | JettyService.java:168 |
High |
| 2 | The serialization path coerces every datasource property to String and breaks typed connector configs | DynamicMetadataDataSource.java:79 |
High |
| 3 | The feature ships without docs/tests, and the current Build is already red because of an example resource issue | fake_to_console.conf:1 |
Medium |
Merge conclusion
Conclusion: merge after fixes
- Blocking items
- Problem 1: define a real security boundary for secrets before exposing this datasource registry on the engine HTTP surface.
- Problem 2: preserve property types correctly instead of flattening everything to strings.
- Problem 3: get the current Build back to green and add the minimum documentation / regression coverage for a new engine/API feature.
- Suggested follow-up
- No extra non-blocking item beyond the blockers above. The current priority is to make the feature safe and correctly specified.
Overall assessment:
I like the direction and I can see the motivation clearly, but at the latest head this is not merge-ready yet. My preference would be:
- Option A, recommended: design this as a protected config-reference service with mandatory auth/redaction rules and a type-safe serialization contract.
- Option B: if a smaller first version is needed, narrow the feature boundary aggressively instead of exposing a general-purpose secret-bearing datasource registry immediately.
Purpose of this pull request
Closes #10816.
Add a built-in
dynamicMetadataProvider for Zeta runtime, so datasource connection properties can be registered through REST API and referenced in job configs bymetadata_datasource_id.What is changed
DynamicMetadataProviderfor resolving datasource properties from Zeta runtime storage.IMap.POST /metadata/datasourceGET /metadata/datasource/{metadataDatasourceId}GET /metadata/datasourcesPUT /metadata/datasource/{metadataDatasourceId}DELETE /metadata/datasource/{metadataDatasourceId}metadata_datasource_idin job config to avoid repeating datasource connection properties.Example
Register datasource once:
{ "metadataDatasourceId": "mysql_001", "connectorType": "Jdbc", "properties": { "url": "jdbc:mysql://127.0.0.1:3306/test", "user": "root", "password": "******", "driver": "com.mysql.cj.jdbc.Driver" } }Then reference it in job config:
Does this PR introduce any user-facing change?
Yes. Users can register datasource metadata through Zeta REST API and reference it in job configs by metadata_datasource_id.
How was this patch tested?