- Spark Cluster: Access to a Spark cluster (standalone, YARN, Kubernetes, or local)
- Python 3.9+: Python 3.9 or higher
- PySpark 3.5+: PySpark 3.5 or higher (required for UDTF support)
- CDF Credentials: Access to CDF credentials (client_id, client_secret, tenant_id, cdf_cluster, project)
- CDF Data Model: A CDF Data Model with Views (for Data Model UDTFs)
Install cognite-pygen-spark and its dependencies:
pip install cognite-pygen-sparkThis will automatically install:
cognite-pygen(base code generation library)cognite-sdk(CDF Python SDK)jinja2(template engine)
Important: The cognite-sdk package must be installed on all Spark worker nodes. The generated UDTF code requires cognite-sdk to connect to CDF.
Install cognite-sdk on all Spark worker nodes:
# On each worker node
pip install cognite-sdkIf your Spark cluster uses a shared Python environment, install there:
# In the Spark Python environment
pip install cognite-sdkFor production deployments, you can package dependencies with your application. See your Spark cluster's documentation for details.
from cognite.pygen_spark import SparkUDTFGenerator
from cognite.pygen import load_cognite_client_from_toml
# Verify imports work
print("✓ pygen-spark installed successfully")Create a TOML configuration file (config.toml) with your CDF credentials:
[cognite]
project = "<your-cdf-project>"
tenant_id = "<your-tenant-id>"
cdf_cluster = "<your-cdf-cluster>"
client_id = "<your-client-id>"
client_secret = "<your-client-secret>"Security Note: Keep your config.toml file secure and never commit it to version control. Use environment variables or secure configuration management in production.
Check that your PySpark version supports UDTFs (3.5+):
import pyspark
print(f"PySpark version: {pyspark.__version__}")
# Should be 3.5.0 or higher
assert pyspark.__version__ >= "3.5.0", "PySpark 3.5+ required for UDTF support"Once installation is complete, proceed to Generation to generate UDTF code from your CDF Data Model.