- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.3k
feat(ingestion): Make PySpark optional for S3, ABS, and Unity Catalog sources #15123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Open
      
      
            esteban
  wants to merge
  16
  commits into
  master
  
    
      
        
          
  
    
      Choose a base branch
      
     
    
      
        
      
      
        
          
          
        
        
          
            
              
              
              
  
           
        
        
          
            
              
              
           
        
       
     
  
        
          
            
          
            
          
        
       
    
      
from
feat-make-pyspark-optional
  
      
      
   
  
    
  
  
  
 
  
      
    base: master
Could not load branches
            
              
  
    Branch not found: {{ refName }}
  
            
                
      Loading
              
            Could not load tags
            
            
              Nothing to show
            
              
  
            
                
      Loading
              
            Are you sure you want to change the base?
            Some commits from the old base branch may be removed from the timeline,
            and old review comments may become outdated.
          
          
      
        
          +5,840
        
        
          −192
        
        
          
        
      
    
  
  
     Open
                    Changes from 1 commit
      Commits
    
    
            Show all changes
          
          
            16 commits
          
        
        Select commit
          Hold shift + click to select a range
      
      432e0ce
              
                feat(ingestion): Make PySpark optional for S3, ABS, and Unity Catalog…
              
              
                esteban 208965c
              
                feat(ingestion): Make PySpark default for s3/gcs/abs with -slim variants
              
              
                esteban 80f9806
              
                feat(docker): Add slim and locked variants for PySpark-optional deplo…
              
              
                esteban 40d0d25
              
                feat(docker): Add slim and locked variants for PySpark-optional deplo…
              
              
                esteban b051aa4
              
                fix(metadata-ingestion): linting
              
              
                esteban e019a52
              
                fix(s3): fix support_status of s3 ingestion to CERTIFIED in feature b…
              
              
                esteban be36f75
              
                Merge branch 'master' into feat-make-pyspark-optional
              
              
                esteban b1f8289
              
                fix(s3): fix support_status of s3 ingestion in capability_summary.json
              
              
                esteban 483e328
              
                fix(metadata-ingest): udpate text to reflect supported pandas versions
              
              
                esteban e1948f0
              
                feat(ingestion): add unit tets for s3 and abs profiling
              
              
                esteban 2aef4dc
              
                feat(ingestion): add additional tests for coverage
              
              
                esteban 843625b
              
                feat(ingestion): additional test fixes
              
              
                esteban 737fdbd
              
                feat(ingestion): additional test fixes
              
              
                esteban db26200
              
                feat(ingestion): additional test fixes
              
              
                esteban 4c17af8
              
                feat(ingestion): additional test coverage
              
              
                esteban 151a758
              
                feat(ingestion): additional test coverage
              
              
                esteban File filter
Filter by extension
Conversations
          Failed to load comments.   
        
        
          
      Loading
        
  Jump to
        
          Jump to file
        
      
      
          Failed to load files.   
        
        
          
      Loading
        
  Diff view
Diff view
There are no files selected for viewing
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,283 @@ | ||
| # Optional PySpark Support for Data Lake Sources | ||
|  | ||
| DataHub's S3, GCS, ABS, and Unity Catalog sources now support optional PySpark installation. This allows users to install only the dependencies they need, reducing installation size and complexity when data lake profiling is not required. | ||
|  | ||
| ## Overview | ||
|  | ||
| Previously, PySpark was a required dependency for S3, GCS, ABS, and Unity Catalog sources, even when profiling was disabled. This created unnecessary installation overhead (~500MB) and potential dependency conflicts for users who only needed metadata extraction without profiling. | ||
|  | ||
| **Now you can choose:** | ||
|  | ||
| - **Lightweight installation**: Metadata extraction without PySpark (~500MB smaller) | ||
| - **Full installation**: Metadata extraction + profiling with PySpark and PyDeequ | ||
|  | ||
| ## PySpark Version | ||
|  | ||
| > **Current Version:** PySpark 3.5.x (3.5.6) | ||
| > | ||
| > PySpark 4.0 support is planned for a future release. Until then, all DataHub components use PySpark 3.5.x for compatibility and stability. | ||
|  | ||
| ## Installation Options | ||
|  | ||
| ### Option 1: Modular Installation (Recommended) | ||
|  | ||
| Install base source support, then add profiling if needed: | ||
|  | ||
| ```bash | ||
| # S3 without profiling | ||
| pip install 'acryl-datahub[s3]' | ||
|  | ||
| # S3 with profiling | ||
| pip install 'acryl-datahub[s3,data-lake-profiling]' | ||
|  | ||
| # Multiple sources with profiling | ||
| pip install 'acryl-datahub[s3,gcs,abs,data-lake-profiling]' | ||
| ``` | ||
|  | ||
| ### Option 2: Convenience Variants | ||
|  | ||
| All-in-one extras that include profiling: | ||
|  | ||
| ```bash | ||
| # S3 with profiling (convenience) | ||
| pip install 'acryl-datahub[s3-profiling]' | ||
|  | ||
| # GCS with profiling | ||
| pip install 'acryl-datahub[gcs-profiling]' | ||
|  | ||
| # ABS with profiling | ||
| pip install 'acryl-datahub[abs-profiling]' | ||
| ``` | ||
|  | ||
| ### What's Included | ||
|  | ||
| **Base extras (`s3`, `gcs`, `abs`):** | ||
|  | ||
| - ✅ Metadata extraction (schemas, tables, file listing) | ||
| - ✅ Data format detection (Parquet, Avro, CSV, JSON, etc.) | ||
| - ✅ Schema inference from files | ||
| - ✅ Table and column-level metadata | ||
| - ✅ Tags and properties extraction | ||
| - ❌ Data profiling (min/max, nulls, distinct counts) | ||
| - ❌ Data quality checks (PyDeequ-based) | ||
|  | ||
| **With `data-lake-profiling` extra:** | ||
|  | ||
| - ✅ All base functionality | ||
| - ✅ Data profiling with PyDeequ | ||
| - ✅ Statistical analysis (min, max, mean, stddev) | ||
| - ✅ Null count and distinct count analysis | ||
| - ✅ Histogram generation | ||
| - Includes: `pyspark~=3.5.6`, `pydeequ>=1.1.0` | ||
|  | ||
| **Unity Catalog behavior:** | ||
|  | ||
| - Without PySpark: Uses sqlglot for SQL parsing (graceful fallback) | ||
| - With PySpark: Uses PySpark's SQL parser for better accuracy | ||
|  | ||
| ## Feature Comparison | ||
|  | ||
| | Feature | Without PySpark | With PySpark | | ||
| | ----------------------- | ------------------ | ---------------------------- | | ||
| | **S3/GCS/ABS metadata** | ✅ Full support | ✅ Full support | | ||
| | **Schema inference** | ✅ Basic inference | ✅ Enhanced inference | | ||
| | **Data profiling** | ❌ Not available | ✅ Full profiling | | ||
| | **Unity Catalog** | ✅ sqlglot parser | ✅ PySpark parser | | ||
| | **Installation size** | ~200MB | ~700MB | | ||
| | **Install time** | Fast | Slower (PySpark compilation) | | ||
|  | ||
| ## Configuration | ||
|  | ||
| ### Enabling Profiling | ||
|  | ||
| When profiling is enabled in your recipe, DataHub validates that PySpark is installed: | ||
|  | ||
| ```yaml | ||
| source: | ||
| type: s3 | ||
| config: | ||
| path_specs: | ||
| - include: s3://my-bucket/data/**/*.parquet | ||
| profiling: | ||
| enabled: true # Requires data-lake-profiling extra | ||
| profile_table_level_only: false | ||
| ``` | ||
|  | ||
| **If PySpark is not installed**, you'll see a clear error message: | ||
|  | ||
| ``` | ||
| ValueError: Data lake profiling is enabled but required dependencies are not installed. | ||
| PySpark and PyDeequ are required for S3 profiling. | ||
| Please install with: pip install 'acryl-datahub[s3,data-lake-profiling]' | ||
| See docs/PYSPARK.md for more information. | ||
| ``` | ||
|  | ||
| ### Disabling Profiling | ||
|  | ||
| To use S3/GCS/ABS without PySpark, simply disable profiling: | ||
|  | ||
| ```yaml | ||
| source: | ||
| type: s3 | ||
| config: | ||
| path_specs: | ||
| - include: s3://my-bucket/data/**/*.parquet | ||
| profiling: | ||
| enabled: false # No PySpark required | ||
| ``` | ||
|  | ||
| ### Adding PySpark Support to New Sources | ||
|  | ||
| If you're developing a new data lake source, follow this pattern: | ||
|  | ||
| ```python | ||
| from datahub.ingestion.source.data_lake_common import pyspark_utils | ||
|  | ||
| # At module level - detect availability | ||
| _PYSPARK_AVAILABLE = pyspark_utils.is_pyspark_available() | ||
|  | ||
| # In your source class | ||
| if _PYSPARK_AVAILABLE and self.config.profiling.enabled: | ||
| # Import PySpark modules conditionally | ||
| from pyspark.sql import SparkSession | ||
| # ... use PySpark for profiling | ||
| else: | ||
| logger.info("Profiling disabled or PySpark not available") | ||
| ``` | ||
|  | ||
| ## Troubleshooting | ||
|  | ||
| ### Error: "PySpark is not installed" | ||
|  | ||
| **Problem:** You're trying to use profiling but PySpark is not installed. | ||
|  | ||
| **Solution:** | ||
|  | ||
| ```bash | ||
| pip install 'acryl-datahub[data-lake-profiling]' | ||
| ``` | ||
|  | ||
| Or use the convenience variant: | ||
|  | ||
| ```bash | ||
| pip install 'acryl-datahub[s3-profiling]' | ||
| ``` | ||
|  | ||
| ### Warning: "Data lake profiling disabled: PySpark/PyDeequ not available" | ||
|  | ||
| **Problem:** Profiling is enabled in config but PySpark is not installed. | ||
|  | ||
| **Solutions:** | ||
|  | ||
| 1. Install profiling dependencies: `pip install 'acryl-datahub[data-lake-profiling]'` | ||
| 2. Disable profiling in your recipe: `profiling.enabled: false` | ||
|  | ||
| ### Verifying Installation | ||
|  | ||
| Check if PySpark is installed: | ||
|  | ||
| ```bash | ||
| # Check installed packages | ||
| pip list | grep pyspark | ||
|  | ||
| # Test import in Python | ||
| python -c "import pyspark; print(pyspark.__version__)" | ||
| ``` | ||
|  | ||
| Expected output: | ||
|  | ||
| - With `data-lake-profiling`: Shows `pyspark 3.5.x` | ||
| - Without `data-lake-profiling`: Import fails or package not found | ||
|  | ||
| ## Migration Guide | ||
|  | ||
| ### Upgrading from Previous Versions | ||
|  | ||
| If you were using S3/GCS/ABS with profiling before this change: | ||
|  | ||
| **Option 1: Keep existing behavior (with profiling)** | ||
|  | ||
| ```bash | ||
| # Replace your old install command | ||
| pip install 'acryl-datahub[s3]' | ||
|  | ||
| # With new profiling-inclusive variant | ||
| pip install 'acryl-datahub[s3-profiling]' | ||
| ``` | ||
|  | ||
| **Option 2: Reduce footprint (without profiling)** | ||
|  | ||
| ```bash | ||
| # Use base variant if profiling not needed | ||
| pip install 'acryl-datahub[s3]' | ||
|  | ||
| # Update config to disable profiling | ||
| # profiling: | ||
| # enabled: false | ||
| ``` | ||
|  | ||
| ### No Breaking Changes | ||
|         
                  esteban marked this conversation as resolved.
              Show resolved
            Hide resolved | ||
|  | ||
| This change is backward compatible: | ||
|  | ||
| - Existing installations with PySpark continue to work | ||
| - Profiling behavior is unchanged when PySpark is installed | ||
| - Only affects new installations where users can now choose to exclude PySpark | ||
|  | ||
| ## DataHub Actions | ||
|  | ||
| [DataHub Actions](https://github.com/datahub-project/datahub/tree/master/datahub-actions) depends on `acryl-datahub` and benefits significantly from optional PySpark support: | ||
|  | ||
| ### Reduced Installation Size | ||
|  | ||
| DataHub Actions typically doesn't need data lake profiling capabilities since it focuses on reacting to metadata events, not extracting metadata from data lakes. With optional PySpark: | ||
|  | ||
| ```bash | ||
| # Before: Actions pulled in PySpark unnecessarily | ||
| pip install acryl-datahub-actions | ||
| # Result: ~700MB installation | ||
|  | ||
| # After: Actions installs without PySpark by default | ||
| pip install acryl-datahub-actions | ||
| # Result: ~200MB installation (500MB saved) | ||
| ``` | ||
|  | ||
| ### Faster Deployment | ||
|  | ||
| Actions services can now deploy faster in containerized environments: | ||
|  | ||
| - **Faster pip install**: No PySpark compilation required | ||
| - **Smaller Docker images**: Reduced base image size | ||
| - **Quicker cold starts**: Less code to load and initialize | ||
|  | ||
| ### Fewer Dependency Conflicts | ||
|  | ||
| Actions workflows often integrate with other tools (Slack, Teams, email services). Removing PySpark reduces: | ||
|  | ||
| - Python version constraint conflicts | ||
| - Java/Spark runtime conflicts in restricted environments | ||
| - Transitive dependency version mismatches | ||
|  | ||
| ### When Actions Needs Profiling | ||
|  | ||
| If your Actions workflow needs to trigger data lake profiling jobs, you can still install the full stack: | ||
|  | ||
| ```bash | ||
| # Actions with data lake profiling capability | ||
| pip install 'acryl-datahub-actions' | ||
| pip install 'acryl-datahub[s3,data-lake-profiling]' | ||
| ``` | ||
|  | ||
| **Common Actions use cases that DON'T need PySpark:** | ||
|  | ||
| - Slack notifications on schema changes | ||
| - Propagating tags and terms to downstream systems | ||
| - Triggering dbt runs on metadata updates | ||
| - Sending emails on data quality failures | ||
| - Creating Jira tickets for governance issues | ||
| - Updating external catalogs (e.g., Alation, Collibra) | ||
|  | ||
| **Rare Actions use cases that MIGHT need PySpark:** | ||
|  | ||
| - Custom actions that programmatically trigger S3/GCS/ABS profiling | ||
| - Actions that directly process data lake files (not typical) | ||
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
      
      Oops, something went wrong.
        
    
  
      
      Oops, something went wrong.
        
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Uh oh!
There was an error while loading. Please reload this page.