Skip to content

Commit 3863b50

Browse files
committed
Add comments and update README.md
1 parent a635995 commit 3863b50

File tree

2 files changed

+36
-4
lines changed

2 files changed

+36
-4
lines changed

sdks/python/apache_beam/yaml/examples/transforms/ml/inference/README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,8 +52,8 @@ A hosted model on Vertex AI is needed before being able to use
5252
the Vertex AI model handler. One of the current state-of-the-art
5353
NLP models is HuggingFace's DistilBERT, a distilled version of
5454
BERT model and is faster at inference. To deploy DistilBERT on
55-
Vertex AI, use the [notebook](
56-
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_huggingface_pytorch_inference_deployment.ipynb).
55+
Vertex AI, run this [notebook](
56+
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/model_garden_huggingface_pytorch_inference_deployment.ipynb) in Colab Enterprise.
5757

5858
BigQuery is the pipeline's sink for the inference result output.
5959
A BigQuery dataset needs to exist first before the pipeline can
@@ -69,14 +69,15 @@ remote inference with the Vertex AI model handler and DistilBERT
6969
deployed to a Vertex AI endpoint. The inference result is then
7070
parsed and written to a BigQuery table.
7171

72-
Run the pipeline (remove the necessary variables in the command):
72+
Run the pipeline (replace with appropriate variables in the command
73+
below):
7374

7475
```sh
7576
export PROJECT="$(gcloud config get-value project)"
7677
export TEMP_LOCATION="gs://YOUR-BUCKET/tmp"
7778
export REGION="us-central1"
7879
export JOB_NAME="streaming-sentiment-analysis-`date +%Y%m%d-%H%M%S`"
79-
export NUM_WORKERS="1"
80+
export NUM_WORKERS="3"
8081

8182
python -m apache_beam.yaml.main \
8283
--yaml_pipeline_file transforms/ml/inference/streaming_sentiment_analysis.yaml \

sdks/python/apache_beam/yaml/examples/transforms/ml/inference/streaming_sentiment_analysis.yaml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,12 @@
2525

2626
pipeline:
2727
transforms:
28+
# The YouTube comments dataset contains rows that
29+
# have unexpected schema (e.g. rows with more fields,
30+
# rows with fields that contain string instead of
31+
# integer, etc...). PyTransform helps construct
32+
# the logic to properly read in the csv dataset as
33+
# a schema'd PCollection.
2834
- type: PyTransform
2935
name: ReadFromGCS
3036
input: {}
@@ -56,6 +62,8 @@ pipeline:
5662
)
5763
file_pattern: "{{ GCS_PATH }}"
5864

65+
# Send the rows as Kafka records to an existing
66+
# Kafka topic.
5967
- type: WriteToKafka
6068
name: SendRecordsToKafka
6169
input: ReadFromGCS
@@ -70,6 +78,7 @@ pipeline:
7078
security.protocol: "SASL_PLAINTEXT"
7179
sasl.mechanism: "PLAIN"
7280

81+
# Read Kafka records from an existing Kafka topic.
7382
- type: ReadFromKafka
7483
name: ReadFromMyTopic
7584
config:
@@ -94,6 +103,9 @@ pipeline:
94103
security.protocol: "SASL_PLAINTEXT"
95104
sasl.mechanism: "PLAIN"
96105

106+
# Remove unexpected characters from the YouTube
107+
# comment string, e.g. emojis, ascii characters outside
108+
# the common day-to-day English.
97109
- type: MapToFields
98110
name: RemoveWeirdCharacters
99111
input: ReadFromMyTopic
@@ -112,6 +124,8 @@ pipeline:
112124
likes: likes
113125
replies: replies
114126

127+
# Remove rows that have empty comment text
128+
# after previously removing unexpected characters.
115129
- type: Filter
116130
name: FilterForProperComments
117131
input: RemoveWeirdCharacters
@@ -122,6 +136,12 @@ pipeline:
122136
def filter(row):
123137
return len(row.comment_text) > 0
124138
139+
# HuggingFace's distilbert-base-uncased is used for inference,
140+
# which accepts string with a maximum limit of 250 tokens.
141+
# Some of the comment strings can be large and are well over
142+
# this limit after tokenization.
143+
# This transform truncates the comment string and ensure
144+
# every comment satisfy the maximum token limit.
125145
- type: MapToFields
126146
name: Truncating
127147
input: FilterForProperComments
@@ -149,6 +169,10 @@ pipeline:
149169
likes: likes
150170
replies: replies
151171

172+
# HuggingFace's distilbert-base-uncased does not distinguish
173+
# between 'english' and 'English'.
174+
# This pipeline makes the same point by converting all words
175+
# into lowercase.
152176
- type: MapToFields
153177
name: LowerCase
154178
input: Truncating
@@ -160,6 +184,10 @@ pipeline:
160184
likes: likes
161185
replies: replies
162186

187+
# With VertexAIModelHandlerJSON model handler,
188+
# RunInference transform performs remote inferences by
189+
# sending POST requests to the Vertex AI endpoint that
190+
# our distilbert-base-uncased model is being deployed to.
163191
- type: RunInference
164192
name: DistilBERTRemoteInference
165193
input: LowerCase
@@ -174,6 +202,7 @@ pipeline:
174202
preprocess:
175203
callable: 'lambda x: x.comment_text'
176204

205+
# Parse inference results output
177206
- type: MapToFields
178207
name: FormatInferenceOutput
179208
input: DistilBERTRemoteInference
@@ -199,6 +228,7 @@ pipeline:
199228
expression: replies
200229
output_type: integer
201230

231+
# Assign windows to each element of the unbounded PCollection.
202232
- type: WindowInto
203233
name: Windowing
204234
input: FormatInferenceOutput
@@ -207,6 +237,7 @@ pipeline:
207237
type: fixed
208238
size: 30s
209239

240+
# Write all inference results to a BigQuery table.
210241
- type: WriteToBigQuery
211242
name: WriteInferenceResultsToBQ
212243
input: Windowing

0 commit comments

Comments
 (0)