-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Added documentation for return vs yield usage in DoFn.process() #34912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
068d6b5
380410d
401978e
9703fd4
271c840
6bb018e
d11a58e
2ac2fb3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1184,6 +1184,101 @@ func init() { | |
|
|
||
| </span> | ||
|
|
||
| {{< paragraph class="language-python">}} | ||
| Proper Use of return vs yield in Python Functions. | ||
YashaswiniTB marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| {{< /paragraph >}} | ||
|
|
||
| {{< highlight python >}} | ||
| # Returning a single string instead of a sequence | ||
YashaswiniTB marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
YashaswiniTB marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| class ReturnIndividualElement(beam.DoFn): | ||
| def process(self, element): | ||
| return element | ||
YashaswiniTB marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| with beam.Pipeline() as pipeline: | ||
| ( | ||
| pipeline | ||
| | "CreateExamples" >> beam.Create(["foo"]) | ||
| | "MapIncorrect" >> beam.ParDo(ReturnIndividualElement()) | ||
| | "Print" >> beam.Map(print) | ||
| ) | ||
| # prints: | ||
| # f | ||
| # o | ||
| # o | ||
| {{< /highlight >}} | ||
|
|
||
| <span class="language-python"> | ||
|
|
||
| > **Returning a single element (e.g., `return element`) is incorrect** | ||
|
||
| > The `process` method in Beam must return an *iterable* of elements. Returning a single value like an integer or string | ||
| > (e.g., `return element`) leads to a runtime error (`TypeError: 'int' object is not iterable`) or incorrect results since the return value | ||
| > will be treated as an iterable. Always ensure your return type is iterable. | ||
|
|
||
| </span> | ||
|
|
||
| {{< highlight python >}} | ||
| # Returning a list of strings | ||
| class ReturnWordsFn(beam.DoFn): | ||
| def process(self, element): | ||
| # Split the sentence and return all words longer than 2 characters as a list | ||
| return [word for word in element.split() if len(word) > 2] | ||
|
|
||
| with beam.Pipeline() as pipeline: | ||
| ( | ||
| pipeline | ||
| | "CreateSentences_Return" >> beam.Create([ # Create a collection of sentences | ||
| "Apache Beam is powerful", # Sentence 1 | ||
| "Try it now" # Sentence 2 | ||
| ]) | ||
| | "SplitWithReturn" >> beam.ParDo(ReturnWordsFn()) # Apply the custom DoFn to split words | ||
| | "PrintWords_Return" >> beam.Map(print) # Print each List of words | ||
| ) | ||
| # prints: | ||
| # ['Apache', 'Beam', 'powerful'] | ||
| # ['Try', 'now'] | ||
damccorm marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| {{< /highlight >}} | ||
|
|
||
| <span class="language-python"> | ||
|
|
||
| > **Returning a list (e.g., `return [element1, element2]`) is valid because List is Iterable** | ||
| > This approach works well when emitting multiple outputs from a single call and is easy to read for small datasets. | ||
|
|
||
| </span> | ||
|
|
||
| {{< highlight python >}} | ||
| # Yielding each line one at a time | ||
| class YieldWordsFn(beam.DoFn): | ||
| def process(self, element): | ||
| # Splitting the sentence and yielding words that have more than 2 characters | ||
| for word in element.split(): | ||
| if len(word) > 2: | ||
| yield word | ||
|
|
||
| with beam.Pipeline() as pipeline: | ||
| ( | ||
| pipeline | ||
| | "CreateSentences_Yield" >> beam.Create([ # Create a collection of sentences | ||
| "Apache Beam is powerful", # Sentence 1 | ||
| "Try it now" # Sentence 2 | ||
| ]) | ||
| | "SplitWithYield" >> beam.ParDo(YieldWordsFn()) # Apply the custom DoFn to split words | ||
| | "PrintWords_Yield" >> beam.Map(print) # Print each word | ||
| ) | ||
| # prints: | ||
| # Apache | ||
| # Beam | ||
| # powerful | ||
| # Try | ||
| # now | ||
| {{< /highlight >}} | ||
|
|
||
| <span class="language-python"> | ||
|
|
||
| > **Using `yield` (e.g., `yield element`) is also valid** | ||
| > This approach can be useful for generating multiple outputs more flexibly, especially in cases where conditional logic or loops are involved. | ||
|
|
||
| </span> | ||
|
|
||
| A given `DoFn` instance generally gets invoked one or more times to process some | ||
| arbitrary bundle of elements. However, Beam doesn't guarantee an exact number of | ||
| invocations; it may be invoked multiple times on a given worker node to account | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.