Conversation
|
TypeScript types have been updated based on the JSON schema changes in the PR |
🛡️ TRIVY SCAN RESULT 🛡️ Target:
|
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
libpam-modules |
CVE-2025-6020 | 🚨 HIGH | 1.5.2-6+deb12u1 | 1.5.2-6+deb12u2 |
libpam-modules-bin |
CVE-2025-6020 | 🚨 HIGH | 1.5.2-6+deb12u1 | 1.5.2-6+deb12u2 |
libpam-runtime |
CVE-2025-6020 | 🚨 HIGH | 1.5.2-6+deb12u1 | 1.5.2-6+deb12u2 |
libpam0g |
CVE-2025-6020 | 🚨 HIGH | 1.5.2-6+deb12u1 | 1.5.2-6+deb12u2 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: Java
Vulnerabilities (33)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.12.7 | 2.15.0 |
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.13.4 | 2.15.0 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42003 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4.2 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42004 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4 |
com.google.code.gson:gson |
CVE-2022-25647 | 🚨 HIGH | 2.2.4 | 2.8.9 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.3.0 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.3.0 | 3.25.5, 4.27.5, 4.28.2 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.7.1 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.7.1 | 3.25.5, 4.27.5, 4.28.2 |
com.nimbusds:nimbus-jose-jwt |
CVE-2023-52428 | 🚨 HIGH | 9.8.1 | 9.37.2 |
com.squareup.okhttp3:okhttp |
CVE-2021-0341 | 🚨 HIGH | 3.12.12 | 4.9.2 |
commons-beanutils:commons-beanutils |
CVE-2025-48734 | 🚨 HIGH | 1.9.4 | 1.11.0 |
commons-io:commons-io |
CVE-2024-47554 | 🚨 HIGH | 2.8.0 | 2.14.0 |
dnsjava:dnsjava |
CVE-2024-25638 | 🚨 HIGH | 2.1.7 | 3.6.0 |
io.netty:netty-codec-http2 |
CVE-2025-55163 | 🚨 HIGH | 4.1.96.Final | 4.2.4.Final, 4.1.124.Final |
io.netty:netty-codec-http2 |
GHSA-xpw8-rcwv-8f8p | 🚨 HIGH | 4.1.96.Final | 4.1.100.Final |
io.netty:netty-handler |
CVE-2025-24970 | 🚨 HIGH | 4.1.96.Final | 4.1.118.Final |
net.minidev:json-smart |
CVE-2021-31684 | 🚨 HIGH | 1.3.2 | 1.3.3, 2.4.4 |
net.minidev:json-smart |
CVE-2023-1370 | 🚨 HIGH | 1.3.2 | 2.4.9 |
org.apache.avro:avro |
CVE-2024-47561 | 🔥 CRITICAL | 1.7.7 | 1.11.4 |
org.apache.avro:avro |
CVE-2023-39410 | 🚨 HIGH | 1.7.7 | 1.11.3 |
org.apache.derby:derby |
CVE-2022-46337 | 🔥 CRITICAL | 10.14.2.0 | 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0 |
org.apache.ivy:ivy |
CVE-2022-46751 | 🚨 HIGH | 2.5.1 | 2.5.2 |
org.apache.mesos:mesos |
CVE-2018-1330 | 🚨 HIGH | 1.4.3 | 1.6.0 |
org.apache.thrift:libthrift |
CVE-2019-0205 | 🚨 HIGH | 0.12.0 | 0.13.0 |
org.apache.thrift:libthrift |
CVE-2020-13949 | 🚨 HIGH | 0.12.0 | 0.14.0 |
org.apache.zookeeper:zookeeper |
CVE-2023-44981 | 🔥 CRITICAL | 3.6.3 | 3.7.2, 3.8.3, 3.9.1 |
org.eclipse.jetty:jetty-server |
CVE-2024-13009 | 🚨 HIGH | 9.4.56.v20240826 | 9.4.57.v20241219 |
org.lz4:lz4-java |
CVE-2025-12183 | 🚨 HIGH | 1.8.0 | 1.8.1 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: Node.js
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: Python
Vulnerabilities (19)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
Werkzeug |
CVE-2024-34069 | 🚨 HIGH | 2.2.3 | 3.0.3 |
aiohttp |
CVE-2025-69223 | 🚨 HIGH | 3.12.12 | 3.13.3 |
aiohttp |
CVE-2025-69223 | 🚨 HIGH | 3.13.2 | 3.13.3 |
apache-airflow |
CVE-2025-68438 | 🚨 HIGH | 3.1.5 | 3.1.6 |
apache-airflow |
CVE-2025-68675 | 🚨 HIGH | 3.1.5 | 3.1.6 |
azure-core |
CVE-2026-21226 | 🚨 HIGH | 1.37.0 | 1.38.0 |
cryptography |
CVE-2026-26007 | 🚨 HIGH | 42.0.8 | 46.0.5 |
jaraco.context |
CVE-2026-23949 | 🚨 HIGH | 5.3.0 | 6.1.0 |
jaraco.context |
CVE-2026-23949 | 🚨 HIGH | 6.0.1 | 6.1.0 |
protobuf |
CVE-2026-0994 | 🚨 HIGH | 4.25.8 | 6.33.5, 5.29.6 |
pyasn1 |
CVE-2026-23490 | 🚨 HIGH | 0.6.1 | 0.6.2 |
python-multipart |
CVE-2026-24486 | 🚨 HIGH | 0.0.20 | 0.0.22 |
ray |
CVE-2025-62593 | 🔥 CRITICAL | 2.47.1 | 2.52.0 |
starlette |
CVE-2025-62727 | 🚨 HIGH | 0.48.0 | 0.49.1 |
urllib3 |
CVE-2025-66418 | 🚨 HIGH | 1.26.20 | 2.6.0 |
urllib3 |
CVE-2025-66471 | 🚨 HIGH | 1.26.20 | 2.6.0 |
urllib3 |
CVE-2026-21441 | 🚨 HIGH | 1.26.20 | 2.6.3 |
wheel |
CVE-2026-24049 | 🚨 HIGH | 0.45.1 | 0.46.2 |
wheel |
CVE-2026-24049 | 🚨 HIGH | 0.45.1 | 0.46.2 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: usr/bin/docker
Vulnerabilities (4)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
stdlib |
CVE-2025-68121 | 🔥 CRITICAL | v1.25.5 | 1.24.13, 1.25.7, 1.26.0-rc.3 |
stdlib |
CVE-2025-61726 | 🚨 HIGH | v1.25.5 | 1.24.12, 1.25.6 |
stdlib |
CVE-2025-61728 | 🚨 HIGH | v1.25.5 | 1.24.12, 1.25.6 |
stdlib |
CVE-2025-61730 | 🚨 HIGH | v1.25.5 | 1.24.12, 1.25.6 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: /etc/ssl/private/ssl-cert-snakeoil.key
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️ Target:
|
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
linux-libc-dev |
CVE-2024-46786 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-21946 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-22022 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-22083 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-22107 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-22121 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-37926 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-38022 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-38129 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-38361 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-38718 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-39871 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-68340 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-68349 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-68800 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-71085 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2025-71116 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2026-22984 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2026-22990 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2026-23001 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2026-23010 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2026-23054 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2026-23074 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2026-23084 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
linux-libc-dev |
CVE-2026-23097 | 🚨 HIGH | 6.1.159-1 | 6.1.162-1 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: Java
Vulnerabilities (33)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.12.7 | 2.15.0 |
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.13.4 | 2.15.0 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42003 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4.2 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42004 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4 |
com.google.code.gson:gson |
CVE-2022-25647 | 🚨 HIGH | 2.2.4 | 2.8.9 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.3.0 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.3.0 | 3.25.5, 4.27.5, 4.28.2 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.7.1 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.7.1 | 3.25.5, 4.27.5, 4.28.2 |
com.nimbusds:nimbus-jose-jwt |
CVE-2023-52428 | 🚨 HIGH | 9.8.1 | 9.37.2 |
com.squareup.okhttp3:okhttp |
CVE-2021-0341 | 🚨 HIGH | 3.12.12 | 4.9.2 |
commons-beanutils:commons-beanutils |
CVE-2025-48734 | 🚨 HIGH | 1.9.4 | 1.11.0 |
commons-io:commons-io |
CVE-2024-47554 | 🚨 HIGH | 2.8.0 | 2.14.0 |
dnsjava:dnsjava |
CVE-2024-25638 | 🚨 HIGH | 2.1.7 | 3.6.0 |
io.netty:netty-codec-http2 |
CVE-2025-55163 | 🚨 HIGH | 4.1.96.Final | 4.2.4.Final, 4.1.124.Final |
io.netty:netty-codec-http2 |
GHSA-xpw8-rcwv-8f8p | 🚨 HIGH | 4.1.96.Final | 4.1.100.Final |
io.netty:netty-handler |
CVE-2025-24970 | 🚨 HIGH | 4.1.96.Final | 4.1.118.Final |
net.minidev:json-smart |
CVE-2021-31684 | 🚨 HIGH | 1.3.2 | 1.3.3, 2.4.4 |
net.minidev:json-smart |
CVE-2023-1370 | 🚨 HIGH | 1.3.2 | 2.4.9 |
org.apache.avro:avro |
CVE-2024-47561 | 🔥 CRITICAL | 1.7.7 | 1.11.4 |
org.apache.avro:avro |
CVE-2023-39410 | 🚨 HIGH | 1.7.7 | 1.11.3 |
org.apache.derby:derby |
CVE-2022-46337 | 🔥 CRITICAL | 10.14.2.0 | 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0 |
org.apache.ivy:ivy |
CVE-2022-46751 | 🚨 HIGH | 2.5.1 | 2.5.2 |
org.apache.mesos:mesos |
CVE-2018-1330 | 🚨 HIGH | 1.4.3 | 1.6.0 |
org.apache.thrift:libthrift |
CVE-2019-0205 | 🚨 HIGH | 0.12.0 | 0.13.0 |
org.apache.thrift:libthrift |
CVE-2020-13949 | 🚨 HIGH | 0.12.0 | 0.14.0 |
org.apache.zookeeper:zookeeper |
CVE-2023-44981 | 🔥 CRITICAL | 3.6.3 | 3.7.2, 3.8.3, 3.9.1 |
org.eclipse.jetty:jetty-server |
CVE-2024-13009 | 🚨 HIGH | 9.4.56.v20240826 | 9.4.57.v20241219 |
org.lz4:lz4-java |
CVE-2025-12183 | 🚨 HIGH | 1.8.0 | 1.8.1 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: Node.js
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: Python
Vulnerabilities (9)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
apache-airflow |
CVE-2025-68438 | 🚨 HIGH | 3.1.5 | 3.1.6 |
apache-airflow |
CVE-2025-68675 | 🚨 HIGH | 3.1.5 | 3.1.6 |
cryptography |
CVE-2026-26007 | 🚨 HIGH | 42.0.8 | 46.0.5 |
jaraco.context |
CVE-2026-23949 | 🚨 HIGH | 6.0.1 | 6.1.0 |
starlette |
CVE-2025-62727 | 🚨 HIGH | 0.48.0 | 0.49.1 |
urllib3 |
CVE-2025-66418 | 🚨 HIGH | 1.26.20 | 2.6.0 |
urllib3 |
CVE-2025-66471 | 🚨 HIGH | 1.26.20 | 2.6.0 |
urllib3 |
CVE-2026-21441 | 🚨 HIGH | 1.26.20 | 2.6.3 |
wheel |
CVE-2026-24049 | 🚨 HIGH | 0.45.1 | 0.46.2 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: /etc/ssl/private/ssl-cert-snakeoil.key
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/extended_sample_data.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/lineage.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_data.json
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_data.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_data_aut.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_usage.json
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_usage.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_usage_aut.yaml
No Vulnerabilities Found
| recognizer_registry = RecognizerRegistry( | ||
| recognizers=recognizers, supported_languages=supported_languages | ||
| ) | ||
| effective_nlp = nlp_engine if nlp_engine is not None else self._nlp_engine |
There was a problem hiding this comment.
⚠️ Bug: nlp_engine=None sentinel can't disable NLP engine in "any" mode
The build_analyzer_with method uses None as the default for nlp_engine, intending it as a sentinel meaning "use self._nlp_engine":
effective_nlp = nlp_engine if nlp_engine is not None else self._nlp_engineIn the "any" mode path (_analyze_with, line 164), the code calls self.build_analyzer_with(lang_recognizers, nlp_engine=None) intending to pass no NLP engine to AnalyzerEngine. However, because None is the sentinel for "use default", effective_nlp falls back to self._nlp_engine.
This means in "any" mode, the AnalyzerEngine is still built with the NLP engine that was loaded with lang_code="any" (from load_nlp_engine(classification_language=ClassificationLanguage.any) in tag_processor.py). When analyzer.analyze(value, language="en") is called, Presidio will try to use the NLP engine whose supported language is "any" — not "en" — which will likely cause a language mismatch error at runtime.
The tests don't catch this because mock_nlp_engine = MagicMock() silently accepts any call. The tests only verify that build_analyzer_with is called with nlp_engine=None, not what AnalyzerEngine actually receives.
Fix: Use a sentinel object instead of None to distinguish "no argument given" from "explicitly no NLP engine":
_UNSET = object()
def build_analyzer_with(
self,
recognizers: list[EntityRecognizer],
nlp_engine: Optional[NlpEngine] = _UNSET,
) -> AnalyzerEngine:
...
effective_nlp = self._nlp_engine if nlp_engine is _UNSET else nlp_engine
...This allows build_analyzer_with(recs, nlp_engine=None) to actually pass None to AnalyzerEngine.
Suggested fix:
_UNSET_NLP = object()
def build_analyzer_with(
self,
recognizers: list[EntityRecognizer],
nlp_engine: Optional[NlpEngine] = _UNSET_NLP,
) -> AnalyzerEngine:
supported_languages = [rec.supported_language for rec in recognizers]
recognizer_registry = RecognizerRegistry(
recognizers=recognizers, supported_languages=supported_languages
)
effective_nlp = self._nlp_engine if nlp_engine is _UNSET_NLP else nlp_engine
return AnalyzerEngine(
registry=recognizer_registry,
nlp_engine=effective_nlp,
supported_languages=supported_languages,
)
Was this helpful? React with 👍 / 👎
There was a problem hiding this comment.
This has been changed so that we always have a nlp recognizer. The prev implementation would've broken if a single spacy model was found in the recognizer set
There was a problem hiding this comment.
Thanks for the clarification! I understand the design decision now - maintaining an NLP recognizer prevents issues when spaCy models are present in the recognizer set.
Details
Your approach ensures that even in "any" mode, if there are recognizers that expect an NLP engine, they'll have one available. This prevents runtime errors that would occur if pattern-only recognizers were mixed with NLP-dependent recognizers and no NLP engine was provided.
The current implementation with nlp_engine if nlp_engine is not None else self._nlp_engine at tag_analyzer.py:128 achieves this by always providing an NLP engine (either the explicitly passed one or the instance default), which is the intended behavior based on your explanation.
| tag_analyzer.build_analyzer_with = tracking_build | ||
| tag_analyzer.analyze_content([test_value]) | ||
|
|
||
| assert all(e is None for e in captured_nlp_engines) |
There was a problem hiding this comment.
⚠️ Bug: Tests verify wrong thing — same assertion for "any" and specific modes
test_any_language_passes_nlp_engine_as_none (line 232) and test_specific_language_uses_nlp_engine (line 256) both assert the exact same thing: captured_nlp_engines[0] is None. The tests only verify what argument is passed to build_analyzer_with, not what AnalyzerEngine actually receives.
Since both modes call build_analyzer_with with nlp_engine=None (the "any" path explicitly, the specific path via the default), and build_analyzer_with treats None as "use default self._nlp_engine" for both, these tests can't distinguish between the two modes. The "any" mode test passes but doesn't verify the actual intended behavior (no NLP engine).
Additionally, mock_nlp_engine = MagicMock() means no real NLP engine behavior is exercised, masking the lang-code mismatch that would occur in production.
Suggestion: After fixing the sentinel issue in build_analyzer_with, update the test to verify the actual AnalyzerEngine was built without an NLP engine — e.g., by mocking AnalyzerEngine or inspecting what it receives. At minimum, test_specific_language_uses_nlp_engine should assert captured_nlp_engines[0] is mock_nlp_engine (not None) to properly differentiate the two modes.
Was this helpful? React with 👍 / 👎
| LANGUAGE_MODEL_MAPPING = defaultdict( | ||
| lambda: SPACY_MULTILANG_MODEL, | ||
| { | ||
| ClassificationLanguage.any: SPACY_EN_MODEL, |
There was a problem hiding this comment.
⚠️ Bug: Mapping any to English model contradicts PR description
The PR description states: "The multilingual spaCy model (xx_ent_wiki_sm) is mapped to ClassificationLanguage.any in LANGUAGE_MODEL_MAPPING". However, the code maps ClassificationLanguage.any to SPACY_EN_MODEL (en_core_web_md), not to SPACY_MULTILANG_MODEL (xx_ent_wiki_sm).
When load_nlp_engine(classification_language=ClassificationLanguage.any) is called from tag_processor.py, it creates SpacyNlpEngine(models=[{"lang_code": "any", "model_name": "en_core_web_md"}]). This has two issues:
- Wrong model: The English model is loaded instead of the multilingual model, contradicting the stated design.
- Invalid
lang_code:"any"is not a valid ISO 639-1 code.load_nlp_enginesetssupported_language = classification_language.valuewhich is"any". When spaCy/Presidio tries to use this NLP engine with a real language code like"en"or"fr", the mismatch may cause errors.
Since "any" mode uses nlp_engine=None in the analyzer (or at least intends to — see related sentinel bug), the NLP engine loaded here is only used as a fallback. But given the sentinel bug, it IS actually used, making this mapping impactful.
Suggestion: If "any" mode truly shouldn't use an NLP engine, consider skipping NLP engine loading entirely for any in the caller. Otherwise, map to the multilingual model as the PR description states.
Suggested fix:
ClassificationLanguage.any: SPACY_MULTILANG_MODEL,
Was this helpful? React with 👍 / 👎
bdee77d to
ff887a6
Compare
🔍 CI failure analysis for c0fe9db: Four test suites failed: 2 Java integration suites with infrastructure errors (99.95% pass rates), 1 Python suite with backend connectivity failures (98% pass rate), and 1 Maven SonarCloud suite with 7 DataProduct test failures (99.91% pass rate). All failures are pre-existing infrastructure/backend service issues unrelated to the PII classification or Athena profiling changes.IssueCI Status After Latest Push: Commit 1-2. Java Integration Tests (8 errors total):
3. Python 3.11 Integration Tests:
4. Maven SonarCloud CI (job 63756220752):
Maven Test FailuresDataProductResourceTest (6 confirmed failures - IDENTICAL to previous runs):
Pattern:
Root CauseMaven SonarCloud Test FailuresPre-existing DataProduct bugs (documented in previous analysis):
Pattern indicates:
Other Test Suite FailuresJava Integration Tests: Infrastructure issues (OpenSearch, pipelines, workflows) Why This is Unrelated to PR ChangesThis PR modifies:
This PR does NOT modify:
The failing tests:
Pattern AnalysisComparison with Previous Maven Runs:
Key Observations:
DetailsTest Execution SummaryMaven SonarCloud CI (job 63756220752):
Combined All Suites:
Why Failures Are Not PR-RelatedMaven Tests:
Java Integration Tests:
Python Tests:
Evidence:
RecommendationFor This PR:
For the Test Failures: Maven SonarCloud (DataProduct Tests):
Java Integration Tests:
Python Integration Tests:
Conclusion: Code Review
|
| Auto-apply | Compact |
|
|
Was this helpful? React with 👍 / 👎 | Gitar
This is not related to my PR |
|
|
|
Failed to cherry-pick changes to the 1.11.10 branch. |
* Update classification languages to support `any` * Run analyzer for different languages * Update generated TypeScript types * Apply comments from Gitar --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>



Summary
"any"as a valid value forclassificationLanguagein the auto-classification pipeline configurationanyis selected, all recognizers are included regardless of their configuredsupportedLanguage, rather than being filtered to a single languageanalyze()contractClassificationLanguage.anyinLANGUAGE_MODEL_MAPPINGChanges
classificationLanguages.json"any"to enum (first position); updated descriptiondatabaseServiceAutoClassificationPipeline.jsonclassificationLanguagedescriptionclassificationLanguages.pyany = 'any'toClassificationLanguageenumconstants.pyClassificationLanguage.any → xx_ent_wiki_smmappingtag_analyzer.pyget_recognizers_by()whenany; new_analyze_with()with per-language dispatch;build_analyzer_with()accepts optionalnlp_engineoverridetest_tag_analyzer_any_language.pyTest plan
test_tag_analyzer_any_language.pypass