Skip to content

Fix #25757: CSV Parsing When Escape Character#25778

Merged
ulixius9 merged 1 commit intomainfrom
issue-25757
Feb 10, 2026
Merged

Fix #25757: CSV Parsing When Escape Character#25778
ulixius9 merged 1 commit intomainfrom
issue-25757

Conversation

@keshavmohta09
Copy link
Member

@keshavmohta09 keshavmohta09 commented Feb 9, 2026

Fixes #25757

I worked on a CSV parsing issue related to escape characters by handling it through the escapechar parameter in pandas' read_csv method.


Summary by Gitar

  • Fixed CSV parsing:
    • Added escapechar="\\" and engine="python" to pd.read_csv() in dsv.py:143-144 to handle backslash-escaped quotes (\") in CSV fields
  • New test coverage:
    • Added 3 test methods in test_dsv_reader.py validating standard escaping, mixed \" and "" styles, and edge cases with newlines

This will update automatically on new commits.


storage_options=storage_options,
compression=compression,
encoding_errors="ignore",
escapechar="\\",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: Global escapechar="\" may silently corrupt data with backslashes

Setting escapechar="\\" globally for all CSV/TSV parsing means that any file containing literal backslashes in quoted fields will have those backslashes silently consumed as escape characters.

Example: A CSV file with "C:\Users\data.txt" (single backslashes, common in Windows paths) would be parsed as C:Usersdata.txt — the backslashes are silently stripped because \U and \d are interpreted as escape sequences producing just U and d.

The PR's own test uses doubled backslashes (C:\\Users\\data.txt in the file) to work around this, but real-world CSV exports from Windows tools typically contain single backslashes.

This is a global behavioral change that affects every CSV/TSV file processed through this reader, not just files that actually use \" escaping. Files that previously parsed correctly with literal backslashes will now silently produce corrupted data.

Suggested fix: Instead of applying escapechar globally, consider one of:

  1. A retry-based approach: first try parsing without escapechar, and if parsing fails (e.g., ParserError), retry with escapechar="\\" and engine="python".
  2. Make escapechar configurable via the reader's constructor or connection config, so users can opt-in when they know their files use backslash escaping.
  3. Pre-process the file to detect whether \" escaping is present before choosing the parsing strategy.

Was this helpful? React with 👍 / 👎

compression=compression,
encoding_errors="ignore",
escapechar="\\",
engine="python",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance: Forced engine="python" degrades CSV parsing performance globally

Switching from the default C engine to the Python engine for all CSV/TSV parsing introduces a significant performance regression. The pandas Python engine is typically 5-20x slower than the C engine for CSV parsing.

This change affects every CSV and TSV file processed by OpenMetadata's ingestion framework (local, S3, GCS, Azure), including files that don't use backslash escaping at all. For large data lake files or high-volume ingestion pipelines, this could meaningfully increase processing time.

Suggested fix: As mentioned in the data-correctness finding, a retry-based or detection-based approach would avoid imposing the Python engine penalty on files that don't need it. Alternatively, if the Python engine must be used, this trade-off should be explicitly documented and benchmarked against representative workloads.

Was this helpful? React with 👍 / 👎

@gitar-bot
Copy link

gitar-bot bot commented Feb 9, 2026

🔍 CI failure analysis for 78340be: Consistent pre-existing CI failures confirmed across 5+ runs: (1) Flaky Playwright test, (2) Deterministic pytest fixture scope bug. All CSV parsing tests pass in every run.

Summary

Consistent pre-existing test infrastructure bugs confirmed across all CI runs and Python versions, unrelated to CSV parsing changes.

Failure 1: Playwright E2E Test

Flaky test in CustomizeWidgets.spec.ts:316:1 › KPI Widget @sample-data. Browser crash and element visibility timeout. Not related to Python CSV parsing.

Failure 2: Python Integration Tests - CONFIRMED ACROSS ALL RUNS

Issue

7 test setup errors in test_auto_classification_workflow appearing consistently across:

  • All job runs analyzed: 63019525466, 63019526156, 63019527046, 63019527062, 63019524530 (latest)
  • Both Python 3.10 and 3.11 versions
  • Same exact error in every run - confirming this is a pre-existing deterministic bug

Root Cause

Pytest fixture scope configuration bug in test infrastructure:

ScopeMismatch: You tried to access the function scoped fixture caplog 
with a module scoped request object

Location: tests/integration/trino/test_classifier.py:67

Module-scoped run_classifier fixture incorrectly references function-scoped caplog fixture.

Consistent Test Results

Across all runs (including latest 63019524530):

  • ✅ Unit tests: 3763 passed (including new CSV parsing tests)
  • ✅ Integration tests: 530 passed
  • 7 errors: All from Trino classifier fixture scope bug
  • All CSV parsing tests pass in every single run

Additional Infrastructure Issues

  • Cassandra connection refused
  • Trino client failures
  • AWS/S3 metrics test issues

Relationship to PR

None. This PR only modifies:

  • ingestion/src/metadata/readers/dataframe/dsv.py (CSV reader)
  • ingestion/tests/unit/readers/test_dsv_reader.py (unit tests)

Failures are in unrelated Trino integration test fixtures.

Conclusion

Both failures are pre-existing issues confirmed across ALL CI runs and Python versions. CSV parsing functionality works correctly in every scenario. These CI failures should not block this PR.

Recommendation: This is the 5th consecutive run showing the same pre-existing bug. The PR is ready to merge despite these unrelated infrastructure issues.

Code Review ⚠️ Changes requested 0 resolved / 2 findings

The fix for backslash-escaped quotes in CSV parsing applies escapechar and Python engine globally, which risks silently corrupting data containing literal backslashes and degrading parsing performance for all files. Consider a more targeted approach (retry-based or configurable).

⚠️ Bug: Global escapechar="\" may silently corrupt data with backslashes

📄 ingestion/src/metadata/readers/dataframe/dsv.py:143

Setting escapechar="\\" globally for all CSV/TSV parsing means that any file containing literal backslashes in quoted fields will have those backslashes silently consumed as escape characters.

Example: A CSV file with "C:\Users\data.txt" (single backslashes, common in Windows paths) would be parsed as C:Usersdata.txt — the backslashes are silently stripped because \U and \d are interpreted as escape sequences producing just U and d.

The PR's own test uses doubled backslashes (C:\\Users\\data.txt in the file) to work around this, but real-world CSV exports from Windows tools typically contain single backslashes.

This is a global behavioral change that affects every CSV/TSV file processed through this reader, not just files that actually use \" escaping. Files that previously parsed correctly with literal backslashes will now silently produce corrupted data.

Suggested fix: Instead of applying escapechar globally, consider one of:

  1. A retry-based approach: first try parsing without escapechar, and if parsing fails (e.g., ParserError), retry with escapechar="\\" and engine="python".
  2. Make escapechar configurable via the reader's constructor or connection config, so users can opt-in when they know their files use backslash escaping.
  3. Pre-process the file to detect whether \" escaping is present before choosing the parsing strategy.
⚠️ Performance: Forced engine="python" degrades CSV parsing performance globally

📄 ingestion/src/metadata/readers/dataframe/dsv.py:144

Switching from the default C engine to the Python engine for all CSV/TSV parsing introduces a significant performance regression. The pandas Python engine is typically 5-20x slower than the C engine for CSV parsing.

This change affects every CSV and TSV file processed by OpenMetadata's ingestion framework (local, S3, GCS, Azure), including files that don't use backslash escaping at all. For large data lake files or high-volume ingestion pipelines, this could meaningfully increase processing time.

Suggested fix: As mentioned in the data-correctness finding, a retry-based or detection-based approach would avoid imposing the Python engine penalty on files that don't need it. Alternatively, if the Python engine must be used, this trade-off should be explicitly documented and benchmarked against representative workloads.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion-base-slim:trivy (debian 12.13)

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (33)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (8)

Package Vulnerability ID Severity Installed Version Fixed Version
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/extended_sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/lineage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data_aut.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage_aut.yaml

No Vulnerabilities Found

@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion:trivy (debian 12.12)

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
libpam-modules CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-modules-bin CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-runtime CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam0g CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (33)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (18)

Package Vulnerability ID Severity Installed Version Fixed Version
Werkzeug CVE-2024-34069 🚨 HIGH 2.2.3 3.0.3
aiohttp CVE-2025-69223 🚨 HIGH 3.12.12 3.13.3
aiohttp CVE-2025-69223 🚨 HIGH 3.13.2 3.13.3
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6
azure-core CVE-2026-21226 🚨 HIGH 1.37.0 1.38.0
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
protobuf CVE-2026-0994 🚨 HIGH 4.25.8 6.33.5, 5.29.6
pyasn1 CVE-2026-23490 🚨 HIGH 0.6.1 0.6.2
python-multipart CVE-2026-24486 🚨 HIGH 0.0.20 0.0.22
ray CVE-2025-62593 🔥 CRITICAL 2.47.1 2.52.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO

No Vulnerabilities Found

@ulixius9 ulixius9 merged commit b79fbeb into main Feb 10, 2026
28 of 47 checks passed
@ulixius9 ulixius9 deleted the issue-25757 branch February 10, 2026 12:17
@keshavmohta09
Copy link
Member Author

Changes have been cherry-picked to the 1.11.9 branch (ref)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CSV Parsing Error When Escape Character Is Followed by a Comma

2 participants