Skip to content

Conversation

@esteban
Copy link
Collaborator

@esteban esteban commented Oct 28, 2025

PySpark and PyDeequ have been required dependencies for S3, ABS, and Unity Catalog
sources, even when profiling is disabled. This creates unnecessary installation
overhead (~500MB) and potential dependency conflicts for users who don't need
profiling capabilities.

PySpark Detection Framework

  • Added pyspark_utils.py with centralized availability detection
  • Graceful fallback when PySpark/PyDeequ unavailable
  • Clear error messages guiding users to install dependencies when needed

Modular Installation Options

  • S3/ABS/GCS sources now work without PySpark when profiling is disabled
  • New data-lake-profiling extra for modular PySpark installation
  • Convenience extras: s3-profiling, gcs-profiling, abs-profiling
  • Unity Catalog gracefully falls back to sqlglot when PySpark unavailable

Config Validation

  • Added validators to S3/ABS configs to check PySpark availability at config time
  • Validates profiling dependencies before attempting to use them
  • Provides actionable error messages with installation instructions

Installation Examples

pip install 'acryl-datahub[s3]'

pip install 'acryl-datahub[s3,data-lake-profiling]'

pip install 'acryl-datahub[s3-profiling]'

Dependencies

  • PySpark ~=3.5.6 (in data-lake-profiling extra)
  • PyDeequ >=1.1.0 (data quality validation)

Benefits

  • Reduced footprint: Base installs ~500MB smaller without PySpark
  • Faster installs: No PySpark compilation for non-profiling users
  • Better errors: Clear messages when profiling needs PySpark
  • Flexibility: Users choose profiling support level
  • Backward compatible: Existing installations continue working

Testing

  • Added 46+ unit tests validating optional PySpark functionality
  • Tests cover availability detection, config validation, and graceful fallbacks
  • All existing tests continue to pass

See docs/PYSPARK.md for detailed installation and usage guide.

… sources

PySpark and PyDeequ have been required dependencies for S3, ABS, and Unity Catalog
sources, even when profiling is disabled. This creates unnecessary installation
overhead (~500MB) and potential dependency conflicts for users who don't need
profiling capabilities.

**PySpark Detection Framework**
- Added `pyspark_utils.py` with centralized availability detection
- Graceful fallback when PySpark/PyDeequ unavailable
- Clear error messages guiding users to install dependencies when needed

**Modular Installation Options**
- S3/ABS/GCS sources now work without PySpark when profiling is disabled
- New `data-lake-profiling` extra for modular PySpark installation
- Convenience extras: `s3-profiling`, `gcs-profiling`, `abs-profiling`
- Unity Catalog gracefully falls back to sqlglot when PySpark unavailable

**Config Validation**
- Added validators to S3/ABS configs to check PySpark availability at config time
- Validates profiling dependencies before attempting to use them
- Provides actionable error messages with installation instructions

**Installation Examples**
```bash
pip install 'acryl-datahub[s3]'

pip install 'acryl-datahub[s3,data-lake-profiling]'

pip install 'acryl-datahub[s3-profiling]'
```

**Dependencies**
- PySpark ~=3.5.6 (in `data-lake-profiling` extra)
- PyDeequ >=1.1.0 (data quality validation)

**Benefits**
- Reduced footprint: Base installs ~500MB smaller without PySpark
- Faster installs: No PySpark compilation for non-profiling users
- Better errors: Clear messages when profiling needs PySpark
- Flexibility: Users choose profiling support level
- Backward compatible: Existing installations continue working

**Testing**
- Added 46+ unit tests validating optional PySpark functionality
- Tests cover availability detection, config validation, and graceful fallbacks
- All existing tests continue to pass

See docs/PYSPARK.md for detailed installation and usage guide.
@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata docs Issues and Improvements to docs labels Oct 28, 2025
@codecov
Copy link

codecov bot commented Oct 28, 2025

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Oct 28, 2025
@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Oct 28, 2025
Flips the implementation to maintain backward compatibility while providing
lightweight installation options. S3, GCS, and ABS sources now include PySpark
by default, with new -slim variants for PySpark-less installations.

**Changes:**

1. **Setup.py - Default PySpark inclusion:**
   - `s3`, `gcs`, `abs` extras now include `data-lake-profiling` by default
   - New `s3-slim`, `gcs-slim`, `abs-slim` extras without PySpark
   - Ensures existing users have no breaking changes
   - Naming aligns with Docker image conventions (slim/full)

2. **Config validation removed:**
   - Removed PySpark dependency validation from S3/ABS config
   - Profiling failures now occur at runtime (not config time)
   - Maintains pre-PR behavior for backward compatibility

3. **Documentation updated:**
   - Updated PYSPARK.md to reflect new installation approach
   - Standard installation: pip install 'acryl-datahub[s3]' (with PySpark)
   - Lightweight installation: pip install 'acryl-datahub[s3-slim]' (no PySpark)
   - Added migration path note for future DataHub 2.0
   - Explained benefits for DataHub Actions with -slim variants

4. **Tests updated:**
   - Removed tests expecting validation failures without PySpark
   - Added tests confirming config accepts profiling without validation
   - All tests pass with new behavior

**Rationale:**

This approach provides:
- **Backward compatibility**: Existing users see no changes
- **Migration path**: Users can opt into -slim variants now
- **Future flexibility**: DataHub 2.0 can flip defaults to -slim
- **No breaking changes**: Maintains pre-PR functionality
- **Naming consistency**: Aligns with Docker slim/full convention

**Installation examples:**

\`\`\`bash
pip install 'acryl-datahub[s3]'
pip install 'acryl-datahub[gcs]'
pip install 'acryl-datahub[abs]'

pip install 'acryl-datahub[s3-slim]'
pip install 'acryl-datahub[gcs-slim]'
pip install 'acryl-datahub[abs-slim]'
\`\`\`
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Oct 29, 2025
…yment

Introduces slim and locked Docker image variants for both
datahub-ingestion and datahub-actions, for environments with different PySpark requirements
and security constraints.

**Image Variants**:

1. **Full (default)**: With PySpark, network enabled
   - Includes PySpark for data profiling
   - Can install packages from PyPI at runtime
   - Backward compatible with existing deployments

2. **Slim**: Without PySpark, network enabled
   - Excludes PySpark (~500MB smaller)
   - Uses s3-slim, gcs-slim, abs-slim for data lake sources
   - Can still install packages from PyPI if needed

3. **Locked** (NEW): Without PySpark, network BLOCKED
   - Excludes PySpark
   - Blocks ALL network access to PyPI/UV indexes
   - datahub-actions: ONLY bundled venvs, no main ingestion install
   - Most secure/restrictive variant for production

**Additional Changes**:

**1. pyspark_utils.py**: Fixed module-level exports
   - Added SparkSession, DataFrame, AnalysisRunBuilder, PandasDataFrame as None
   - These can now be imported even when PySpark unavailable
   - Prevents ImportError in s3-slim installations

**2. setup.py**: Moved cachetools to s3_base
   - operation_config.py uses cachetools unconditionally
   - Now available in s3-slim without requiring data_lake_profiling

**3. build_bundled_venvs_unified.py**: Added slim_mode support
   - BUNDLED_VENV_SLIM_MODE flag controls package extras
   - When true: installs s3-slim, gcs-slim, abs-slim (no PySpark)
   - When false: installs s3, gcs, abs (with PySpark)
   - Venv named {plugin}-bundled (e.g., s3-bundled) for executor compatibility

**4. datahub-actions/Dockerfile**: Three variant structure
   - bundled-venvs-full: s3 with PySpark
   - bundled-venvs-slim: s3-slim without PySpark
   - bundled-venvs-locked: s3-slim without PySpark
   - final-full: Has PySpark, network enabled, full install
   - final-slim: No PySpark, network enabled, slim install
   - final-locked: No PySpark, network BLOCKED, NO main install (bundled venvs only)

**5. datahub-ingestion/Dockerfile**: Added locked stage
   - install-full: All sources with PySpark
   - install-slim: Selected sources with s3-slim (no PySpark)
   - install-locked: Minimal sources with s3-slim, network BLOCKED

**6. build.gradle**: Updated variants and defaults
   - defaultVariant: "full" (restored to original)
   - Variants: full (no suffix), slim (-slim), locked (-locked)
   - Build args properly set for all variants

**Network Blocking in Locked Variant**:
```dockerfile
ENV UV_INDEX_URL=http://127.0.0.1:1/simple
ENV PIP_INDEX_URL=http://127.0.0.1:1/simple
```
This prevents all PyPI downloads while allowing cached packages from build.

**Bundled Venv Naming**:
- Venv named `s3-bundled` (not `s3-slim-bundled`)
- Recipe uses `type: s3` (standard plugin name)
- Executor finds `s3-bundled` venv automatically
- Slim/locked: venv uses s3-slim package internally (no PySpark)
- Full: venv uses s3 package (with PySpark)

**Testing**:
✅ Full variant: PySpark installed, network enabled
✅ Slim variant: PySpark NOT installed, network enabled, s3-bundled venv works
✅ Integration tests: 12 tests validate s3-slim functionality

**Build Commands**:
```bash
./gradlew :datahub-actions:docker
./gradlew :docker:datahub-ingestion:docker

./gradlew :datahub-actions:docker -PdockerTarget=slim
./gradlew :docker:datahub-ingestion:docker -PdockerTarget=slim

./gradlew :datahub-actions:docker -PdockerTarget=locked
./gradlew :docker:datahub-ingestion:docker -PdockerTarget=locked

./gradlew :datahub-actions:docker -PmatrixBuild=true
./gradlew :docker:datahub-ingestion:docker -PmatrixBuild=true
```

**Recipe Format** (works with all variants):
```yaml
source:
  type: s3  # Use of existing "s3" source type
  config:
    path_specs:
      - include: "s3://bucket/*.csv"
    profiling:
      enabled: false  # Required for slim/locked
```
@codecov
Copy link

codecov bot commented Oct 30, 2025

Bundle Report

Changes will increase total bundle size by 9.26kB (0.03%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 28.58MB 9.26kB (0.03%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js 10.73kB 18.95MB 0.06%
assets/theme_v2.config-*.js (Deleted) -1.47kB 0 bytes -100.0% 🗑️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Issues and Improvements to docs ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer. publish-docker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants