feat(ingestion): Make PySpark optional for S3, ABS, and Unity Catalog sources #15123

esteban · 2025-10-28T03:12:04Z

PySpark and PyDeequ have been required dependencies for S3, ABS, and Unity Catalog
sources, even when profiling is disabled. This creates unnecessary installation
overhead (~500MB) and potential dependency conflicts for users who don't need
profiling capabilities.

PySpark Detection Framework

Added pyspark_utils.py with centralized availability detection
Graceful fallback when PySpark/PyDeequ unavailable
Clear error messages guiding users to install dependencies when needed

Modular Installation Options

S3/ABS/GCS sources now work without PySpark when profiling is disabled
New data-lake-profiling extra for modular PySpark installation
Convenience extras: s3-profiling, gcs-profiling, abs-profiling
Unity Catalog gracefully falls back to sqlglot when PySpark unavailable

Config Validation

Added validators to S3/ABS configs to check PySpark availability at config time
Validates profiling dependencies before attempting to use them
Provides actionable error messages with installation instructions

Installation Examples

pip install 'acryl-datahub[s3]'

pip install 'acryl-datahub[s3,data-lake-profiling]'

pip install 'acryl-datahub[s3-profiling]'

Dependencies

PySpark ~=3.5.6 (in data-lake-profiling extra)
PyDeequ >=1.1.0 (data quality validation)

Benefits

Reduced footprint: Base installs ~500MB smaller without PySpark
Faster installs: No PySpark compilation for non-profiling users
Better errors: Clear messages when profiling needs PySpark
Flexibility: Users choose profiling support level
Backward compatible: Existing installations continue working

Testing

Added 46+ unit tests validating optional PySpark functionality
Tests cover availability detection, config validation, and graceful fallbacks
All existing tests continue to pass

See docs/PYSPARK.md for detailed installation and usage guide.

… sources PySpark and PyDeequ have been required dependencies for S3, ABS, and Unity Catalog sources, even when profiling is disabled. This creates unnecessary installation overhead (~500MB) and potential dependency conflicts for users who don't need profiling capabilities. **PySpark Detection Framework** - Added `pyspark_utils.py` with centralized availability detection - Graceful fallback when PySpark/PyDeequ unavailable - Clear error messages guiding users to install dependencies when needed **Modular Installation Options** - S3/ABS/GCS sources now work without PySpark when profiling is disabled - New `data-lake-profiling` extra for modular PySpark installation - Convenience extras: `s3-profiling`, `gcs-profiling`, `abs-profiling` - Unity Catalog gracefully falls back to sqlglot when PySpark unavailable **Config Validation** - Added validators to S3/ABS configs to check PySpark availability at config time - Validates profiling dependencies before attempting to use them - Provides actionable error messages with installation instructions **Installation Examples** ```bash pip install 'acryl-datahub[s3]' pip install 'acryl-datahub[s3,data-lake-profiling]' pip install 'acryl-datahub[s3-profiling]' ``` **Dependencies** - PySpark ~=3.5.6 (in `data-lake-profiling` extra) - PyDeequ >=1.1.0 (data quality validation) **Benefits** - Reduced footprint: Base installs ~500MB smaller without PySpark - Faster installs: No PySpark compilation for non-profiling users - Better errors: Clear messages when profiling needs PySpark - Flexibility: Users choose profiling support level - Backward compatible: Existing installations continue working **Testing** - Added 46+ unit tests validating optional PySpark functionality - Tests cover availability detection, config validation, and graceful fallbacks - All existing tests continue to pass See docs/PYSPARK.md for detailed installation and usage guide.

codecov · 2025-10-28T03:14:57Z

Codecov Report

❌ Patch coverage is 59.77011% with 70 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...tion/src/datahub/ingestion/source/abs/profiling.py	24.32%	28 Missing ⚠️
...stion/src/datahub/ingestion/source/s3/profiling.py	24.32%	28 Missing ⚠️
...ngestion/src/datahub/ingestion/source/s3/source.py	33.33%	6 Missing ⚠️
...ingestion/source/data_lake_common/pyspark_utils.py	93.54%	4 Missing ⚠️
...estion/src/datahub/ingestion/source/unity/usage.py	66.66%	3 Missing ⚠️
...gestion/src/datahub/ingestion/source/abs/config.py	90.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

docs/PYSPARK.md

metadata-ingestion/src/datahub/ingestion/source/s3/profiling.py

metadata-ingestion/src/datahub/ingestion/source/ge_data_profiler.py

metadata-ingestion/src/datahub/ingestion/source/s3/profiling.py

Flips the implementation to maintain backward compatibility while providing lightweight installation options. S3, GCS, and ABS sources now include PySpark by default, with new -slim variants for PySpark-less installations. **Changes:** 1. **Setup.py - Default PySpark inclusion:** - `s3`, `gcs`, `abs` extras now include `data-lake-profiling` by default - New `s3-slim`, `gcs-slim`, `abs-slim` extras without PySpark - Ensures existing users have no breaking changes - Naming aligns with Docker image conventions (slim/full) 2. **Config validation removed:** - Removed PySpark dependency validation from S3/ABS config - Profiling failures now occur at runtime (not config time) - Maintains pre-PR behavior for backward compatibility 3. **Documentation updated:** - Updated PYSPARK.md to reflect new installation approach - Standard installation: pip install 'acryl-datahub[s3]' (with PySpark) - Lightweight installation: pip install 'acryl-datahub[s3-slim]' (no PySpark) - Added migration path note for future DataHub 2.0 - Explained benefits for DataHub Actions with -slim variants 4. **Tests updated:** - Removed tests expecting validation failures without PySpark - Added tests confirming config accepts profiling without validation - All tests pass with new behavior **Rationale:** This approach provides: - **Backward compatibility**: Existing users see no changes - **Migration path**: Users can opt into -slim variants now - **Future flexibility**: DataHub 2.0 can flip defaults to -slim - **No breaking changes**: Maintains pre-PR functionality - **Naming consistency**: Aligns with Docker slim/full convention **Installation examples:** \`\`\`bash pip install 'acryl-datahub[s3]' pip install 'acryl-datahub[gcs]' pip install 'acryl-datahub[abs]' pip install 'acryl-datahub[s3-slim]' pip install 'acryl-datahub[gcs-slim]' pip install 'acryl-datahub[abs-slim]' \`\`\`

…yment Introduces slim and locked Docker image variants for both datahub-ingestion and datahub-actions, for environments with different PySpark requirements and security constraints. **Image Variants**: 1. **Full (default)**: With PySpark, network enabled - Includes PySpark for data profiling - Can install packages from PyPI at runtime - Backward compatible with existing deployments 2. **Slim**: Without PySpark, network enabled - Excludes PySpark (~500MB smaller) - Uses s3-slim, gcs-slim, abs-slim for data lake sources - Can still install packages from PyPI if needed 3. **Locked** (NEW): Without PySpark, network BLOCKED - Excludes PySpark - Blocks ALL network access to PyPI/UV indexes - datahub-actions: ONLY bundled venvs, no main ingestion install - Most secure/restrictive variant for production **Additional Changes**: **1. pyspark_utils.py**: Fixed module-level exports - Added SparkSession, DataFrame, AnalysisRunBuilder, PandasDataFrame as None - These can now be imported even when PySpark unavailable - Prevents ImportError in s3-slim installations **2. setup.py**: Moved cachetools to s3_base - operation_config.py uses cachetools unconditionally - Now available in s3-slim without requiring data_lake_profiling **3. build_bundled_venvs_unified.py**: Added slim_mode support - BUNDLED_VENV_SLIM_MODE flag controls package extras - When true: installs s3-slim, gcs-slim, abs-slim (no PySpark) - When false: installs s3, gcs, abs (with PySpark) - Venv named {plugin}-bundled (e.g., s3-bundled) for executor compatibility **4. datahub-actions/Dockerfile**: Three variant structure - bundled-venvs-full: s3 with PySpark - bundled-venvs-slim: s3-slim without PySpark - bundled-venvs-locked: s3-slim without PySpark - final-full: Has PySpark, network enabled, full install - final-slim: No PySpark, network enabled, slim install - final-locked: No PySpark, network BLOCKED, NO main install (bundled venvs only) **5. datahub-ingestion/Dockerfile**: Added locked stage - install-full: All sources with PySpark - install-slim: Selected sources with s3-slim (no PySpark) - install-locked: Minimal sources with s3-slim, network BLOCKED **6. build.gradle**: Updated variants and defaults - defaultVariant: "full" (restored to original) - Variants: full (no suffix), slim (-slim), locked (-locked) - Build args properly set for all variants **Network Blocking in Locked Variant**: ```dockerfile ENV UV_INDEX_URL=http://127.0.0.1:1/simple ENV PIP_INDEX_URL=http://127.0.0.1:1/simple ``` This prevents all PyPI downloads while allowing cached packages from build. **Bundled Venv Naming**: - Venv named `s3-bundled` (not `s3-slim-bundled`) - Recipe uses `type: s3` (standard plugin name) - Executor finds `s3-bundled` venv automatically - Slim/locked: venv uses s3-slim package internally (no PySpark) - Full: venv uses s3 package (with PySpark) **Testing**: ✅ Full variant: PySpark installed, network enabled ✅ Slim variant: PySpark NOT installed, network enabled, s3-bundled venv works ✅ Integration tests: 12 tests validate s3-slim functionality **Build Commands**: ```bash ./gradlew :datahub-actions:docker ./gradlew :docker:datahub-ingestion:docker ./gradlew :datahub-actions:docker -PdockerTarget=slim ./gradlew :docker:datahub-ingestion:docker -PdockerTarget=slim ./gradlew :datahub-actions:docker -PdockerTarget=locked ./gradlew :docker:datahub-ingestion:docker -PdockerTarget=locked ./gradlew :datahub-actions:docker -PmatrixBuild=true ./gradlew :docker:datahub-ingestion:docker -PmatrixBuild=true ``` **Recipe Format** (works with all variants): ```yaml source: type: s3 # Use of existing "s3" source type config: path_specs: - include: "s3://bucket/*.csv" profiling: enabled: false # Required for slim/locked ```

codecov · 2025-10-30T04:40:10Z

Bundle Report

Changes will increase total bundle size by 9.26kB (0.03%) ⬆️. This is within the configured threshold ✅

Detailed changes

Bundle name	Size	Change
datahub-react-web-esm	28.58MB	9.26kB (0.03%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name	Size Change	Total Size	Change (%)
`assets/index-*.js`	10.73kB	18.95MB	0.06%
~~*`assets/theme_v2.config-.js`*~~ (Deleted)*	-1.47kB	0 bytes	-100.0% 🗑️

…yment

…ranch

github-actions bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs labels Oct 28, 2025

github-actions bot deployed to datahub-wheels (Preview) October 28, 2025 03:14 View deployment

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Oct 28, 2025

vercel bot deployed to Preview October 28, 2025 03:29 View deployment

sgomezvillamor reviewed Oct 28, 2025

View reviewed changes

docs/PYSPARK.md Show resolved Hide resolved

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Oct 28, 2025

sgomezvillamor reviewed Oct 28, 2025

View reviewed changes

docs/PYSPARK.md Show resolved Hide resolved

sgomezvillamor reviewed Oct 28, 2025

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/s3/profiling.py Show resolved Hide resolved

treff7es reviewed Oct 28, 2025

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/ge_data_profiler.py Outdated Show resolved Hide resolved

metadata-ingestion/src/datahub/ingestion/source/s3/profiling.py Show resolved Hide resolved

datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Oct 29, 2025

github-actions bot deployed to datahub-wheels (Preview) October 29, 2025 02:51 View deployment

vercel bot deployed to Preview October 29, 2025 03:07 View deployment

github-actions bot deployed to datahub-wheels (Preview) October 30, 2025 04:29 View deployment

vercel bot deployed to Preview October 30, 2025 04:44 View deployment

feat(docker): Add slim and locked variants for PySpark-optional deplo…

40d0d25

…yment

github-actions bot deployed to datahub-wheels (Preview) October 30, 2025 15:30 View deployment

esteban added the publish-docker label Oct 30, 2025

vercel bot deployed to Preview October 30, 2025 15:46 View deployment

fix(metadata-ingestion): linting

b051aa4

github-actions bot deployed to datahub-wheels (Preview) October 30, 2025 15:52 View deployment

vercel bot deployed to Preview October 30, 2025 16:06 View deployment

fix(s3): fix support_status of s3 ingestion to CERTIFIED in feature b…

e019a52

…ranch

github-actions bot deployed to datahub-wheels (Preview) October 30, 2025 16:15 View deployment

Merge branch 'master' into feat-make-pyspark-optional

be36f75

github-actions bot deployed to datahub-wheels (Preview) October 30, 2025 16:27 View deployment

fix(s3): fix support_status of s3 ingestion in capability_summary.json

b1f8289

github-actions bot deployed to datahub-wheels (Preview) October 30, 2025 16:33 View deployment

vercel bot deployed to Preview October 30, 2025 16:48 View deployment

fix(metadata-ingest): udpate text to reflect supported pandas versions

483e328

github-actions bot deployed to datahub-wheels (Preview) October 30, 2025 18:03 View deployment

vercel bot deployed to Preview October 30, 2025 18:18 View deployment

esteban added 2 commits October 30, 2025 17:15

feat(ingestion): add unit tets for s3 and abs profiling

e1948f0

feat(ingestion): add additional tests for coverage

2aef4dc

github-actions bot deployed to datahub-wheels (Preview) October 30, 2025 22:27 View deployment

vercel bot deployed to Preview October 30, 2025 22:43 View deployment

feat(ingestion): additional test fixes

843625b

github-actions bot deployed to datahub-wheels (Preview) October 30, 2025 23:39 View deployment

vercel bot deployed to Preview October 30, 2025 23:56 View deployment

feat(ingestion): additional test fixes

737fdbd

github-actions bot deployed to datahub-wheels (Preview) October 31, 2025 00:49 View deployment

vercel bot deployed to Preview October 31, 2025 01:06 View deployment

feat(ingestion): additional test fixes

db26200

github-actions bot deployed to datahub-wheels (Preview) October 31, 2025 01:09 View deployment

vercel bot deployed to Preview October 31, 2025 01:24 View deployment

feat(ingestion): additional test coverage

4c17af8

github-actions bot deployed to datahub-wheels (Preview) October 31, 2025 03:35 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ingestion): Make PySpark optional for S3, ABS, and Unity Catalog sources #15123

feat(ingestion): Make PySpark optional for S3, ABS, and Unity Catalog sources #15123

esteban commented Oct 28, 2025

Uh oh!

codecov bot commented Oct 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Oct 30, 2025

Assets Changed:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(ingestion): Make PySpark optional for S3, ABS, and Unity Catalog sources #15123

Are you sure you want to change the base?

feat(ingestion): Make PySpark optional for S3, ABS, and Unity Catalog sources #15123

Conversation

esteban commented Oct 28, 2025

Uh oh!

codecov bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Oct 30, 2025

Bundle Report

Affected Assets, Files, and Routes:

Assets Changed:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Oct 28, 2025 •

edited

Loading