Skip to content
This repository was archived by the owner on Jul 10, 2025. It is now read-only.

Commit 3208d1e

Browse files
committed
Update 20200824-e2e-text-preprocessing.md
more formatting, add missing authors
1 parent 9a1e66d commit 3208d1e

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

rfcs/20200824-e2e-text-preprocessing.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ This RFC will be open for comment until Friday, September 4th, 2020.
55
| Status | (Proposed) |
66
:-------------- |:----------------------------------------------------------------------|
77
| **RFC #** | [NNN](https://github.com/tensorflow/community/pull/NNN) (update when you have community PR #)|
8-
| **Author(s)** | Terry Huang (Google) |
8+
| **Author(s)** | Terry Huang (Google), Arno Eigenwillig (Google), Chen Chen (Google) |
99
| **Sponsor** | Xiaodan Song(Google), Greg Billock (Google), Mark Omernick (Google) |
1010
| **Updated** | 2020-08-24 |
1111

@@ -27,7 +27,7 @@ Additionally, many existing Python methods write out processed outputs to files
2727
The proposed new set of text preprocessing APIs will allow users to:
2828
- **Assemble TF input pipelines w/ reusable, well-tested, standard building blocks** that transform their text datasets into model inputs. Being part of the TF graph also enables users to make preprocessing choices dynamically on the fly.
2929
- **Drastically simplify their model’s inputs to just text.** Users will be able to easily expand to new datasets for training, evaluation or inference. Models deployed to TF Serving can start from text inputs and encapsulate the details of preprocessing.
30-
- **Reduce risks of training/serving skew** by giving models stronger ownership of the entire preprocessing and postprocessing process.
30+
- **Reduce risks of training/serving skew** by giving models stronger ownership of the entire preprocessing process.
3131
- **Reduced complexity and improved input pipeline efficiency** by removing an extra read & write step to transform their datasets and improved efficiency w/ vectorized mapping by processing inputs in batches.
3232

3333

@@ -151,8 +151,8 @@ def bert_pretrain_preprocess(vocab_lookup_table, features):
151151
}
152152
```
153153

154-
The output of the tf.data pipeline is integer inputs transformed from the raw text and can be fed directly to the model (e.g., bert_pretraining model in model_garden):
155-
154+
The outputs of the tf.data pipeline are integer inputs transformed from the raw text and can be fed directly to the model:
155+
156156
```
157157
{
158158
'input_ids': [
@@ -231,7 +231,7 @@ class SplitterWithOffsets(Splitter):
231231
"""
232232
```
233233

234-
Splitter subclasses can implement different algorithms for segmenting strings and can even be a trained TF model. We also introduce two concrete implementations of Splitter: RegexSplitter and StateBasedSentenceBreaker).
234+
Splitter subclasses can implement different algorithms for segmenting strings and can even be a trained TF model. We also introduce two concrete implementations of Splitter: `RegexSplitter` and `StateBasedSentenceBreaker`).
235235

236236

237237
#### RegexSplitter

0 commit comments

Comments
 (0)