- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.3k
feat(ingestion): Make PySpark optional for S3, ABS, and Unity Catalog sources #15123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Open
      
      
            esteban
  wants to merge
  15
  commits into
  master
  
    
      
        
          
  
    
      Choose a base branch
      
     
    
      
        
      
      
        
          
          
        
        
          
            
              
              
              
  
           
        
        
          
            
              
              
           
        
       
     
  
        
          
            
          
            
          
        
       
    
      
from
feat-make-pyspark-optional
  
      
      
   
  
    
  
  
  
 
  
      
    base: master
Could not load branches
            
              
  
    Branch not found: {{ refName }}
  
            
                
      Loading
              
            Could not load tags
            
            
              Nothing to show
            
              
  
            
                
      Loading
              
            Are you sure you want to change the base?
            Some commits from the old base branch may be removed from the timeline,
            and old review comments may become outdated.
          
          
      
        
          +5,836
        
        
          −192
        
        
          
        
      
    
  
Conversation
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
    … sources PySpark and PyDeequ have been required dependencies for S3, ABS, and Unity Catalog sources, even when profiling is disabled. This creates unnecessary installation overhead (~500MB) and potential dependency conflicts for users who don't need profiling capabilities. **PySpark Detection Framework** - Added `pyspark_utils.py` with centralized availability detection - Graceful fallback when PySpark/PyDeequ unavailable - Clear error messages guiding users to install dependencies when needed **Modular Installation Options** - S3/ABS/GCS sources now work without PySpark when profiling is disabled - New `data-lake-profiling` extra for modular PySpark installation - Convenience extras: `s3-profiling`, `gcs-profiling`, `abs-profiling` - Unity Catalog gracefully falls back to sqlglot when PySpark unavailable **Config Validation** - Added validators to S3/ABS configs to check PySpark availability at config time - Validates profiling dependencies before attempting to use them - Provides actionable error messages with installation instructions **Installation Examples** ```bash pip install 'acryl-datahub[s3]' pip install 'acryl-datahub[s3,data-lake-profiling]' pip install 'acryl-datahub[s3-profiling]' ``` **Dependencies** - PySpark ~=3.5.6 (in `data-lake-profiling` extra) - PyDeequ >=1.1.0 (data quality validation) **Benefits** - Reduced footprint: Base installs ~500MB smaller without PySpark - Faster installs: No PySpark compilation for non-profiling users - Better errors: Clear messages when profiling needs PySpark - Flexibility: Users choose profiling support level - Backward compatible: Existing installations continue working **Testing** - Added 46+ unit tests validating optional PySpark functionality - Tests cover availability detection, config validation, and graceful fallbacks - All existing tests continue to pass See docs/PYSPARK.md for detailed installation and usage guide.
| Codecov Report❌ Patch coverage is  📢 Thoughts on this report? Let us know! | 
              
                    treff7es
  
              
              reviewed
              
                  
                    Oct 28, 2025 
                  
              
              
            
            
        
          
                metadata-ingestion/src/datahub/ingestion/source/ge_data_profiler.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
      Flips the implementation to maintain backward compatibility while providing lightweight installation options. S3, GCS, and ABS sources now include PySpark by default, with new -slim variants for PySpark-less installations. **Changes:** 1. **Setup.py - Default PySpark inclusion:** - `s3`, `gcs`, `abs` extras now include `data-lake-profiling` by default - New `s3-slim`, `gcs-slim`, `abs-slim` extras without PySpark - Ensures existing users have no breaking changes - Naming aligns with Docker image conventions (slim/full) 2. **Config validation removed:** - Removed PySpark dependency validation from S3/ABS config - Profiling failures now occur at runtime (not config time) - Maintains pre-PR behavior for backward compatibility 3. **Documentation updated:** - Updated PYSPARK.md to reflect new installation approach - Standard installation: pip install 'acryl-datahub[s3]' (with PySpark) - Lightweight installation: pip install 'acryl-datahub[s3-slim]' (no PySpark) - Added migration path note for future DataHub 2.0 - Explained benefits for DataHub Actions with -slim variants 4. **Tests updated:** - Removed tests expecting validation failures without PySpark - Added tests confirming config accepts profiling without validation - All tests pass with new behavior **Rationale:** This approach provides: - **Backward compatibility**: Existing users see no changes - **Migration path**: Users can opt into -slim variants now - **Future flexibility**: DataHub 2.0 can flip defaults to -slim - **No breaking changes**: Maintains pre-PR functionality - **Naming consistency**: Aligns with Docker slim/full convention **Installation examples:** \`\`\`bash pip install 'acryl-datahub[s3]' pip install 'acryl-datahub[gcs]' pip install 'acryl-datahub[abs]' pip install 'acryl-datahub[s3-slim]' pip install 'acryl-datahub[gcs-slim]' pip install 'acryl-datahub[abs-slim]' \`\`\`
…yment
Introduces slim and locked Docker image variants for both
datahub-ingestion and datahub-actions, for environments with different PySpark requirements
and security constraints.
**Image Variants**:
1. **Full (default)**: With PySpark, network enabled
   - Includes PySpark for data profiling
   - Can install packages from PyPI at runtime
   - Backward compatible with existing deployments
2. **Slim**: Without PySpark, network enabled
   - Excludes PySpark (~500MB smaller)
   - Uses s3-slim, gcs-slim, abs-slim for data lake sources
   - Can still install packages from PyPI if needed
3. **Locked** (NEW): Without PySpark, network BLOCKED
   - Excludes PySpark
   - Blocks ALL network access to PyPI/UV indexes
   - datahub-actions: ONLY bundled venvs, no main ingestion install
   - Most secure/restrictive variant for production
**Additional Changes**:
**1. pyspark_utils.py**: Fixed module-level exports
   - Added SparkSession, DataFrame, AnalysisRunBuilder, PandasDataFrame as None
   - These can now be imported even when PySpark unavailable
   - Prevents ImportError in s3-slim installations
**2. setup.py**: Moved cachetools to s3_base
   - operation_config.py uses cachetools unconditionally
   - Now available in s3-slim without requiring data_lake_profiling
**3. build_bundled_venvs_unified.py**: Added slim_mode support
   - BUNDLED_VENV_SLIM_MODE flag controls package extras
   - When true: installs s3-slim, gcs-slim, abs-slim (no PySpark)
   - When false: installs s3, gcs, abs (with PySpark)
   - Venv named {plugin}-bundled (e.g., s3-bundled) for executor compatibility
**4. datahub-actions/Dockerfile**: Three variant structure
   - bundled-venvs-full: s3 with PySpark
   - bundled-venvs-slim: s3-slim without PySpark
   - bundled-venvs-locked: s3-slim without PySpark
   - final-full: Has PySpark, network enabled, full install
   - final-slim: No PySpark, network enabled, slim install
   - final-locked: No PySpark, network BLOCKED, NO main install (bundled venvs only)
**5. datahub-ingestion/Dockerfile**: Added locked stage
   - install-full: All sources with PySpark
   - install-slim: Selected sources with s3-slim (no PySpark)
   - install-locked: Minimal sources with s3-slim, network BLOCKED
**6. build.gradle**: Updated variants and defaults
   - defaultVariant: "full" (restored to original)
   - Variants: full (no suffix), slim (-slim), locked (-locked)
   - Build args properly set for all variants
**Network Blocking in Locked Variant**:
```dockerfile
ENV UV_INDEX_URL=http://127.0.0.1:1/simple
ENV PIP_INDEX_URL=http://127.0.0.1:1/simple
```
This prevents all PyPI downloads while allowing cached packages from build.
**Bundled Venv Naming**:
- Venv named `s3-bundled` (not `s3-slim-bundled`)
- Recipe uses `type: s3` (standard plugin name)
- Executor finds `s3-bundled` venv automatically
- Slim/locked: venv uses s3-slim package internally (no PySpark)
- Full: venv uses s3 package (with PySpark)
**Testing**:
✅ Full variant: PySpark installed, network enabled
✅ Slim variant: PySpark NOT installed, network enabled, s3-bundled venv works
✅ Integration tests: 12 tests validate s3-slim functionality
**Build Commands**:
```bash
./gradlew :datahub-actions:docker
./gradlew :docker:datahub-ingestion:docker
./gradlew :datahub-actions:docker -PdockerTarget=slim
./gradlew :docker:datahub-ingestion:docker -PdockerTarget=slim
./gradlew :datahub-actions:docker -PdockerTarget=locked
./gradlew :docker:datahub-ingestion:docker -PdockerTarget=locked
./gradlew :datahub-actions:docker -PmatrixBuild=true
./gradlew :docker:datahub-ingestion:docker -PmatrixBuild=true
```
**Recipe Format** (works with all variants):
```yaml
source:
  type: s3  # Use of existing "s3" source type
  config:
    path_specs:
      - include: "s3://bucket/*.csv"
    profiling:
      enabled: false  # Required for slim/locked
```
    | Bundle ReportChanges will increase total bundle size by 9.26kB (0.03%) ⬆️. This is within the configured threshold ✅ Detailed changes
 Affected Assets, Files, and Routes:view changes for bundle: datahub-react-web-esmAssets Changed:
 | 
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
      Labels
      
    docs
  Issues and Improvements to docs 
  
    ingestion
  PR or Issue related to the ingestion of metadata 
  
    needs-review
  Label for PRs that need review from a maintainer. 
  
    publish-docker
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
PySpark and PyDeequ have been required dependencies for S3, ABS, and Unity Catalog
sources, even when profiling is disabled. This creates unnecessary installation
overhead (~500MB) and potential dependency conflicts for users who don't need
profiling capabilities.
PySpark Detection Framework
pyspark_utils.pywith centralized availability detectionModular Installation Options
data-lake-profilingextra for modular PySpark installations3-profiling,gcs-profiling,abs-profilingConfig Validation
Installation Examples
Dependencies
data-lake-profilingextra)Benefits
Testing
See docs/PYSPARK.md for detailed installation and usage guide.