Skip to content

Latest commit

 

History

History
53 lines (35 loc) · 2.9 KB

File metadata and controls

53 lines (35 loc) · 2.9 KB

pygen-spark User Guide

Introduction

pygen-spark generates strongly-typed Python User-Defined Table Functions (UDTFs) from CDF Data Models, enabling you to query CDF data directly from Spark SQL. The generated UDTFs work with any Spark cluster (standalone, YARN, Kubernetes, or local development).

This approach is ideal for:

  • Standalone Spark Clusters: Deploy UDTFs to standard Spark clusters without Databricks-specific features
  • Development and Testing: Quickly test UDTFs in local or development environments
  • Production Deployments: Use UDTFs in production Spark clusters with configuration file-based credential management
  • Flexible Credential Management: Use TOML/YAML configuration files for secure credential handling

Overview

This documentation covers the complete workflow for using pygen-spark:

  1. Installation: Set up dependencies and verify your environment
  2. Generation: Generate UDTF code from CDF Data Models
  3. Registration: Register UDTFs in your Spark session
  4. Querying: Query UDTFs using SQL with credential parameters
  5. Filtering: Filter data using WHERE clauses with predicate pushdown
  6. Joining: Join data from different UDTFs based on external_id and space
  7. Time Series: Work with template-generated time series UDTFs (same template-based generation as Data Model UDTFs)
  8. Troubleshooting: Common issues and solutions

Generic Spark Utilities

pygen-spark provides generic utilities that work with any Spark cluster:

  • TypeConverter: Convert between CDF types, PySpark DataTypes, and SQL DDL
  • CDFConnectionConfig: Pydantic model for managing CDF credentials from TOML/YAML files
  • to_udtf_function_name(): Helper function for consistent UDTF naming

These utilities are available in cognite.pygen_spark and are generic (not Databricks-specific). See the README for usage examples.

Quick Links

Examples

Related Documentation

  • pygen: Base code generation library for CDF Data Models
  • cognite-databricks: Helper SDK for Databricks-specific features (Unity Catalog, Secret Manager)
  • Technical Plan: CDF Databricks Integration (UDTF-Based)