Skip to content

Commit 6efda27

Browse files
committed
Merge branch 'master' of https://github.com/apache/beam into users/damccorm/prismByDefault
2 parents 83472ce + cd01e34 commit 6efda27

File tree

57 files changed

+2892
-229
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+2892
-229
lines changed

.test-infra/mock-apis/poetry.lock

Lines changed: 11 additions & 6 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

build.gradle.kts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -733,7 +733,7 @@ if (project.hasProperty("javaLinkageArtifactIds")) {
733733

734734
val linkageCheckerJava by configurations.creating
735735
dependencies {
736-
linkageCheckerJava("com.google.cloud.tools:dependencies:1.5.6")
736+
linkageCheckerJava("com.google.cloud.tools:dependencies:1.5.15")
737737
}
738738

739739
// We need to evaluate all the projects first so that we can find depend on all the

buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -651,6 +651,7 @@ class BeamModulePlugin implements Plugin<Project> {
651651
def arrow_version = "15.0.2"
652652
def jmh_version = "1.34"
653653
def jupiter_version = "5.7.0"
654+
def spanner_grpc_proto_version = "6.95.1"
654655

655656
// Export Spark versions, so they are defined in a single place only
656657
project.ext.spark3_version = spark3_version
@@ -860,7 +861,7 @@ class BeamModulePlugin implements Plugin<Project> {
860861
proto_google_cloud_pubsub_v1 : "com.google.api.grpc:proto-google-cloud-pubsub-v1", // google_cloud_platform_libraries_bom sets version
861862
proto_google_cloud_pubsublite_v1 : "com.google.api.grpc:proto-google-cloud-pubsublite-v1", // google_cloud_platform_libraries_bom sets version
862863
proto_google_cloud_secret_manager_v1 : "com.google.api.grpc:proto-google-cloud-secretmanager-v1", // google_cloud_platform_libraries_bom sets version
863-
proto_google_cloud_spanner_v1 : "com.google.api.grpc:proto-google-cloud-spanner-v1", // google_cloud_platform_libraries_bom sets version
864+
proto_google_cloud_spanner_v1 : "com.google.api.grpc:proto-google-cloud-spanner-v1:$spanner_grpc_proto_version", // google_cloud_platform_libraries_bom sets version
864865
proto_google_cloud_spanner_admin_database_v1: "com.google.api.grpc:proto-google-cloud-spanner-admin-database-v1", // google_cloud_platform_libraries_bom sets version
865866
proto_google_common_protos : "com.google.api.grpc:proto-google-common-protos", // google_cloud_platform_libraries_bom sets version
866867
qpid_jms_client : "org.apache.qpid:qpid-jms-client:$qpid_jms_client_version",

learning/tour-of-beam/learning-content/introduction/introduction-concepts/pipeline-concepts/overview-pipeline/description.md

Lines changed: 11 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -17,29 +17,19 @@ limitations under the License.
1717
To use Beam, you first need to first create a driver program using the classes in one of the Beam SDKs. Your driver program defines your pipeline, including all of the inputs, transforms, and outputs. It also sets execution options for your pipeline (typically passed by using command-line options). These include the Pipeline Runner, which, in turn, determines what back-end your pipeline will run on.
1818

1919
The Beam SDKs provide several abstractions that simplify the mechanics of large-scale distributed data processing. The same Beam abstractions work with both batch and streaming data sources. When you create your Beam pipeline, you can think about your data processing task in terms of these abstractions. They include:
20-
21-
`Pipeline`: A Pipeline encapsulates your entire data processing task, from start to finish. This includes reading input data, transforming that data, and writing output data. All Beam driver programs must create a Pipeline. When you create the Pipeline, you must also specify the execution options that tell the Pipeline where and how to run.
22-
23-
`PCollection`: A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism. Your pipeline typically creates an initial PCollection by reading data from an external data source, but you can also create a PCollection from in-memory data within your driver program. From there, PCollections are the inputs and outputs for each step in your pipeline.
24-
25-
`PTransform`: A PTransform represents a data processing operation, or a step, in your pipeline. Every PTransform takes zero or more PCollection objects as the input, performs a processing function that you provide on the elements of that PCollection, and then produces zero or more output PCollection objects.
20+
* `Pipeline`: A Pipeline encapsulates your entire data processing task, from start to finish. This includes reading input data, transforming that data, and writing output data. All Beam driver programs must create a Pipeline. When you create the Pipeline, you must also specify the execution options that tell the Pipeline where and how to run.
21+
* `PCollection`: A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism. Your pipeline typically creates an initial PCollection by reading data from an external data source, but you can also create a PCollection from in-memory data within your driver program. From there, PCollections are the inputs and outputs for each step in your pipeline.
22+
* `PTransform`: A PTransform represents a data processing operation, or a step, in your pipeline. Every PTransform takes zero or more PCollection objects as the input, performs a processing function that you provide on the elements of that PCollection, and then produces zero or more output PCollection objects.
2623
{{if (eq .Sdk "go")}}
27-
28-
`Scope`: The Go SDK has an explicit scope variable used to build a `Pipeline`. A Pipeline can return it’s root scope with the `Root()` method. The scope variable is then passed to `PTransform` functions that place them in the `Pipeline` that owns the `Scope`.
24+
* `Scope`: The Go SDK has an explicit scope variable used to build a `Pipeline`. A Pipeline can return it’s root scope with the `Root()` method. The scope variable is then passed to `PTransform` functions that place them in the `Pipeline` that owns the `Scope`.
2925
{{end}}
30-
31-
`I/O transforms`: Beam comes with a number of “IOs” - library PTransforms that read or write data to various external storage systems.
26+
* `I/O transforms`: Beam comes with a number of “IOs” - library PTransforms that read or write data to various external storage systems.
3227

3328
A typical Beam driver program works as follows:
29+
* Create a Pipeline object and set the pipeline execution options, including the Pipeline Runner.
30+
* Create an initial `PCollection` for pipeline data, either using the IOs to read data from an external storage system, or using a Create transform to build a `PCollection` from in-memory data.
31+
* Apply `PTransforms` to each `PCollection`. Transforms can change, filter, group, analyze, or otherwise process the elements in a PCollection. A transform creates a new output PCollection without modifying the input collection. A typical pipeline applies subsequent transforms to each new output PCollection in turn until the processing is complete. However, note that a pipeline does not have to be a single straight line of transforms applied one after another: think of PCollections as variables and PTransforms as functions applied to these variables, so the shape of the pipeline can be an arbitrarily complex processing graph.
32+
* Use IOs to write the final, transformed PCollection(s) to an external source.
33+
* Run the pipeline using the designated Pipeline Runner.
3434

35-
→ Create a Pipeline object and set the pipeline execution options, including the Pipeline Runner.
36-
37-
→ Create an initial `PCollection` for pipeline data, either using the IOs to read data from an external storage system, or using a Create transform to build a `PCollection` from in-memory data.
38-
39-
→ Apply `PTransforms` to each `PCollection`. Transforms can change, filter, group, analyze, or otherwise process the elements in a PCollection. A transform creates a new output PCollection without modifying the input collection. A typical pipeline applies subsequent transforms to each new output PCollection in turn until the processing is complete. However, note that a pipeline does not have to be a single straight line of transforms applied one after another: think of PCollections as variables and PTransforms as functions applied to these variables: the shape of the pipeline can be an arbitrarily complex processing graph.
40-
41-
→ Use IOs to write the final, transformed PCollection(s) to an external source.
42-
43-
→ Run the pipeline using the designated Pipeline Runner.
44-
45-
When you run your Beam driver program, the Pipeline Runner that you designate constructs a workflow graph of your pipeline based on the PCollection objects you’ve created and the transforms that you’ve applied. That graph is then executed using the appropriate distributed processing back-end, becoming an asynchronous “job” (or equivalent) on that back-end.
35+
When you run your Beam driver program, the Pipeline Runner that you designate constructs a workflow graph of your pipeline based on the PCollection objects you’ve created and the transforms that you’ve applied. That graph is then executed using the appropriate distributed processing back-end, becoming an asynchronous “job” (or equivalent) on that back-end.

learning/tour-of-beam/learning-content/introduction/introduction-concepts/pipeline-concepts/setting-pipeline/java-example/Task.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ public class Task {
4141
private static final Logger LOG = LoggerFactory.getLogger(Task.class);
4242

4343
public interface MyOptions extends PipelineOptions {
44-
// Default value if [--output] equal null
44+
// Default value if [--inputFile] equal null
4545
@Description("Path of the file to read from")
4646
@Default.String("gs://apache-beam-samples/shakespeare/kinglear.txt")
4747
String getInputFile();
@@ -88,4 +88,4 @@ public void processElement(ProcessContext c) throws Exception {
8888
LOG.info(prefix + ": {}", c.element());
8989
}
9090
}
91-
}
91+
}

learning/tour-of-beam/learning-content/introduction/introduction-concepts/runner-concepts/description.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -293,11 +293,11 @@ $ wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \
293293

294294
{{if (eq .Sdk "java")}}
295295

296-
##### Non portable
296+
##### Portable
297297
1. Start the JobService endpoint:
298298
* with Docker (preferred): docker run --net=host apache/beam_spark_job_server:latest
299299
* or from Beam source code: ./gradlew :runners:spark:3:job-server:runShadow
300-
2. Submit the Python pipeline to the above endpoint by using the PortableRunner, job_endpoint set to localhost:8099 (this is the default address of the JobService), and environment_type set to LOOPBACK. For example:
300+
2. Submit the pipeline to the above endpoint by using the PortableRunner, job_endpoint set to localhost:8099 (this is the default address of the JobService), and environment_type set to LOOPBACK. For example:
301301

302302
Console:
303303
```
@@ -307,7 +307,7 @@ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
307307

308308
##### Non Portable
309309

310-
When using Java, you must specify your dependency on the Cloud Dataflow Runner in your `pom.xml`.
310+
When using Java, you must specify your dependency on the Apache Spark Runner in your `pom.xml`.
311311

312312
```
313313
<dependency>

runners/google-cloud-dataflow-java/worker/build.gradle

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ applyJavaNature(
9090
"org/slf4j/jul/**"
9191
],
9292
generatedClassPatterns: [
93-
/^org\.apache\.beam\.runners\.dataflow\.worker\.windmill.*/,
93+
/^org\.apache\.beam\.runners\.dataflow\.worker\.windmill\..*AutoBuilder.*/,
9494
/^org\.apache\.beam\.runners\.dataflow\.worker\..*AutoBuilder.*/,
9595
],
9696
shadowClosure: {

runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/DataflowExecutionStateRegistry.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ public abstract class DataflowExecutionStateRegistry {
5151
public DataflowOperationContext.DataflowExecutionState getState(
5252
final NameContext nameContext,
5353
final String stateName,
54-
final MetricsContainer container,
54+
final @Nullable MetricsContainer container,
5555
final ProfileScope profileScope) {
5656
return getStateInternal(nameContext, stateName, null, null, container, profileScope);
5757
}

runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/KeyTokenInvalidException.java

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,17 +17,16 @@
1717
*/
1818
package org.apache.beam.runners.dataflow.worker;
1919

20+
import javax.annotation.Nullable;
21+
2022
/** Indicates that the key token was invalid when data was attempted to be fetched. */
21-
@SuppressWarnings({
22-
"nullness" // TODO(https://github.com/apache/beam/issues/20497)
23-
})
2423
public class KeyTokenInvalidException extends RuntimeException {
2524
public KeyTokenInvalidException(String key) {
2625
super("Unable to fetch data due to token mismatch for key " + key);
2726
}
2827

2928
/** Returns whether an exception was caused by a {@link KeyTokenInvalidException}. */
30-
public static boolean isKeyTokenInvalidException(Throwable t) {
29+
public static boolean isKeyTokenInvalidException(@Nullable Throwable t) {
3130
while (t != null) {
3231
if (t instanceof KeyTokenInvalidException) {
3332
return true;

runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/logging/DataflowWorkerLoggingMDC.java

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@
1717
*/
1818
package org.apache.beam.runners.dataflow.worker.logging;
1919

20+
import javax.annotation.Nullable;
21+
2022
/** Mapped diagnostic context for the Dataflow worker. */
2123
@SuppressWarnings({
2224
"nullness" // TODO(https://github.com/apache/beam/issues/20497)
@@ -34,7 +36,7 @@ public static void setJobId(String newJobId) {
3436
}
3537

3638
/** Sets the Stage Name of the current thread, which will be inherited by child threads. */
37-
public static void setStageName(String newStageName) {
39+
public static void setStageName(@Nullable String newStageName) {
3840
stageName.set(newStageName);
3941
}
4042

@@ -44,7 +46,7 @@ public static void setWorkerId(String newWorkerId) {
4446
}
4547

4648
/** Sets the Work ID of the current thread, which will be inherited by child threads. */
47-
public static void setWorkId(String newWorkId) {
49+
public static void setWorkId(@Nullable String newWorkId) {
4850
workId.set(newWorkId);
4951
}
5052

0 commit comments

Comments
 (0)