From 8fde9f46db9760ee6b5d220706485d595eccbaeb Mon Sep 17 00:00:00 2001 From: tvalentyn Date: Tue, 25 Nov 2025 16:13:39 -0800 Subject: [PATCH 1/2] Update python-pipeline-dependencies.md --- .../sdks/python-pipeline-dependencies.md | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md index fefc2d12513e..b6ffac92f1da 100644 --- a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md +++ b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md @@ -163,22 +163,20 @@ Dataflow, see [Pre-building the python SDK custom container image with extra dep ## Pickling and Managing the Main Session When the Python SDK submits the pipeline for execution to a remote runner, the pipeline contents, such as transform user code, is serialized (or pickled) into a bytecode using -libraries that perform the serialization (also called picklers). The default pickler library used by Beam is `dill`. -To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` pipeline option. +libraries that perform the serialization (also called picklers). On Apache Beam 2.64.0 or earlier, the default pickler library was `dill`. -By default, global imports, functions, and variables defined in the main pipeline module are not saved during the serialization of a Beam job. +When `dill` pickler is used, global imports, functions, and variables defined in the main pipeline module are not saved during the serialization of a Beam job by default. Thus, one might encounter an unexpected `NameError` when running a `DoFn` on any remote runner. To resolve this, supply the main session content with the pipeline by setting the `--save_main_session` pipeline option. This will load the pickled state of the global namespace onto the Dataflow workers (if using `DataflowRunner`). For example, see [Handling NameErrors](https://cloud.google.com/dataflow/docs/guides/common-errors#name-error) to set the main session on the `DataflowRunner`. -Managing the main session in Python SDK is only necessary when using `dill` pickler on any remote runner. Therefore, this issue will -not occur in `DirectRunner`. - Since serialization of the pipeline happens on the job submission, and deserialization happens at runtime, it is imperative that the same version of pickling library is used at job submission and at runtime. -To ensure this, Beam typically sets a very narrow supported version range for pickling libraries. If for whatever reason, users cannot use the version of `dill` or `cloudpickle` required by Beam, and choose to -install a custom version, they must also ensure that they use the same custom version at runtime (e.g. in their custom container, +To ensure this, Beam users who use `dill` and choose to install a custom version of dill, must also ensure that they use the same custom version at runtime (e.g. in their custom container, or by specifying a pipeline dependency requirement). +The `--save_main_session` pipeline options is not necessary when `cloudpickle` pickler is used, which is the default pickler on Apache Beam 2.65.0 and later versions. +To use the `cloudpickle` pickler on the earlier Beam versions, supply the `--pickle_library=cloudpickle` pipeline option. + ## Control the dependencies the pipeline uses {#control-dependencies} ### Pipeline environments From 1d47b8811dfba3d77402cec2b16ec02879386bca Mon Sep 17 00:00:00 2001 From: tvalentyn Date: Tue, 25 Nov 2025 16:35:05 -0800 Subject: [PATCH 2/2] Update python-pipeline-dependencies.md --- .../en/documentation/sdks/python-pipeline-dependencies.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md index b6ffac92f1da..b0fd7b639935 100644 --- a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md +++ b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md @@ -174,7 +174,7 @@ Since serialization of the pipeline happens on the job submission, and deseriali To ensure this, Beam users who use `dill` and choose to install a custom version of dill, must also ensure that they use the same custom version at runtime (e.g. in their custom container, or by specifying a pipeline dependency requirement). -The `--save_main_session` pipeline options is not necessary when `cloudpickle` pickler is used, which is the default pickler on Apache Beam 2.65.0 and later versions. +The `--save_main_session` pipeline options is not necessary when `cloudpickle` pickler is used, which is the default pickler on Apache Beam 2.65.0 and later versions. To use the `cloudpickle` pickler on the earlier Beam versions, supply the `--pickle_library=cloudpickle` pipeline option. ## Control the dependencies the pipeline uses {#control-dependencies}