pygen-spark generates strongly-typed Python User-Defined Table Functions (UDTFs) from CDF Data Models, enabling you to query CDF data directly from Spark SQL. The generated UDTFs work with any Spark cluster (standalone, YARN, Kubernetes, or local development).
This approach is ideal for:
- Standalone Spark Clusters: Deploy UDTFs to standard Spark clusters without Databricks-specific features
- Development and Testing: Quickly test UDTFs in local or development environments
- Production Deployments: Use UDTFs in production Spark clusters with configuration file-based credential management
- Flexible Credential Management: Use TOML/YAML configuration files for secure credential handling
This documentation covers the complete workflow for using pygen-spark:
- Installation: Set up dependencies and verify your environment
- Generation: Generate UDTF code from CDF Data Models
- Registration: Register UDTFs in your Spark session
- Querying: Query UDTFs using SQL with credential parameters
- Filtering: Filter data using WHERE clauses with predicate pushdown
- Joining: Join data from different UDTFs based on
external_idandspace - Time Series: Work with template-generated time series UDTFs (same template-based generation as Data Model UDTFs)
- Troubleshooting: Common issues and solutions
pygen-spark provides generic utilities that work with any Spark cluster:
TypeConverter: Convert between CDF types, PySpark DataTypes, and SQL DDLCDFConnectionConfig: Pydantic model for managing CDF credentials from TOML/YAML filesto_udtf_function_name(): Helper function for consistent UDTF naming
These utilities are available in cognite.pygen_spark and are generic (not Databricks-specific). See the README for usage examples.
- Basic Generation: Generate UDTFs from a CDF Data Model
- Registration: Register and query UDTFs
- Querying Data: Query single/multiple UDTFs, named vs positional parameters
- Filtering Queries: Equality, range, NULL handling, multiple conditions
- Joining UDTFs: Joins on external_id, space+external_id, CROSS JOIN LATERAL
- pygen: Base code generation library for CDF Data Models
- cognite-databricks: Helper SDK for Databricks-specific features (Unity Catalog, Secret Manager)
- Technical Plan: CDF Databricks Integration (UDTF-Based)