[docs] Add new Python multi-lang quickstart using the SchemaTransform framework#33360
[docs] Add new Python multi-lang quickstart using the SchemaTransform framework#33360ahmedabu98 wants to merge 2 commits intoapache:masterfrom
Conversation
|
Assigning reviewers. If you would like to opt out of this review, comment R: @kennknowles for label website. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
Reminder, please take a look at this pr: @kennknowles |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @damccorm for label website. Available commands:
|
| #### 13.2.2. Using cross-language transforms in a Python pipeline | ||
|
|
||
| For Beam versions 2.60.0+, please follow [this guide](sdks/python-custom-multi-language-pipelines-guide.md#use-the-portable-transform-in-a-python-pipeline) instead. | ||
|
|
There was a problem hiding this comment.
Does this section actually need this disclaimer? I think consuming schema transforms is basically the same, right/nothing has changed for this section?
|
|
||
| #### 13.1.1. Creating cross-language Java transforms | ||
|
|
||
| For Beam versions 2.60.0+, please follow [this guide](sdks/python-custom-multi-language-pipelines-guide.md) instead. |
There was a problem hiding this comment.
Does this apply to the whole section or just 13.1.1.2? Do we need to recommend away from JavaExternalTransform for cases where it works?
There was a problem hiding this comment.
Also, should we update this section to recommend the new way (even if its just linking to the full doc) by default, and just link to the legacy page for <2.60.0 instead of leaving all the content here?
|
|
||
| ## Create a cross-language transform | ||
|
|
||
| Here's a Java transform provider, [ExtractWordsProvider](https://github.com/apache/beam/blob/master/examples/multi-language/src/main/java/org/apache/beam/examples/multilanguage/schematransforms/ExtractWordsProvider.java), that is uniquely identified with the URN `"beam:schematransform:org.apache.beam:extract_words:v1"`. Given a Configuration object, it will provide a transform: |
There was a problem hiding this comment.
Could you describe what the URN does? (in this context allows the transform to be identified across the language barrier)
|
|
||
| Beam uses this configuration to generate a Python transform with the following signature: | ||
| ```python | ||
| Extract(drop=["foo", "bar"]) |
There was a problem hiding this comment.
| Extract(drop=["foo", "bar"]) | |
| class Extract(): | |
| def __init__(self, drop: List[str]) |
Saying the existing code snippet is a signature is not quite right. Thoughts on providing the full Python class definition? This might be a bit clearer.
Alternately, we could change Beam uses this configuration to generate a Python transform with the following signature: to Beam uses this configuration to generate a Python transform which can be instantiated like:.
| Extract(drop=["foo", "bar"]) | ||
| ``` | ||
|
|
||
| The transform can be any implementation of your choice, as long as it meets the requirements of a [SchemaTransform](../glossary.md#schematransform). For this example, the transform does the following: |
There was a problem hiding this comment.
I think we need to similarly describe what a valid configuration is above. I assume not all field types are valid?
|
|
||
| When building a job for a multi-language pipeline, Beam uses an [expansion service](../glossary#expansion-service) to expand [composite transforms](../glossary#composite-transform). You must have at least one expansion service per remote SDK. | ||
|
|
||
| Before running a multi-language pipeline, you need to build an expansion service that can access your Java transform. It’s often easier to create a single shaded JAR that contains both. Both Python and Java dependencies will be staged for the runner by the Python SDK. |
There was a problem hiding this comment.
It’s often easier to create a single shaded JAR that contains both
I'm not sure what this is saying - both of what?
There was a problem hiding this comment.
It might be nice to include an example command or additional info that shows how you can do this as well
| Then, initialize the `ExternalTransformProvider` with your expansion service. This can take two parameters: | ||
|
|
||
| * `expansion_services`: an expansion service, or list of expansion services | ||
| * `urn_pattern`: (optional) a regex pattern to match valid transforms |
There was a problem hiding this comment.
| * `urn_pattern`: (optional) a regex pattern to match valid transforms | |
| * `urn_pattern`: (optional) a regex pattern to match valid transforms. If this is not provided... |
It would be good to add information on what this does/what happens if it is missing
|
|
||
| ### Run with direct runner | ||
|
|
||
| In the following command, `input1` is a file containing lines of text: |
There was a problem hiding this comment.
Probably worth calling out that the expansion service needs to be started first (here and below in the Dataflow section)
|
Reminder, please take a look at this pr: @damccorm |
|
waiting on author |
|
This PR is still listed on the 2.63.0 milestone. Is this a release blocker? |
|
I don't think it should be since the website is versioned independently of the release. @ahmedabu98 I'll remove the blocker, feel free to comment/add it back if I'm wrong |
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Part of #33358
Adding a new multi-lang quickstart and marking the old one as "legacy"