Skip to content

Commit 605e8e3

Browse files
authored
contributing update (NVIDIA#251)
* contributing update Signed-off-by: tbartley94 <tbartley@nvidia.com> * adding edits Signed-off-by: tbartley94 <tbartley@nvidia.com> * spelling Signed-off-by: tbartley94 <tbartley@nvidia.com> --------- Signed-off-by: tbartley94 <tbartley@nvidia.com>
1 parent 3cbdf91 commit 605e8e3

File tree

1 file changed

+32
-1
lines changed

1 file changed

+32
-1
lines changed

CONTRIBUTING.md

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Contributions are welcome!
22

3-
We do all of NeMo-text-processing's development in the open. Contributions from the open-source community are welcome.
3+
We do all of NeMo-Text-Processing's development in the open. Contributions from the open-source community are welcome.
44

55

66
# Pull Requests (PR) Guidelines
@@ -22,4 +22,35 @@ We do all of NeMo-text-processing's development in the open. Contributions from
2222
11) Optional: if you added a new language or a new feature please update the [NeMo documentation](https://github.com/NVIDIA/NeMo/blob/main/docs/source/nlp/text_normalization/wfst/wfst_text_normalization.rst) (lives in different repo).
2323
12) Send your PR and request a review
2424

25+
# Notes for Language Contribution
26+
1) `en/graph_utils.py` and `en/utils.py` are the de facto parents for all other `graph_util` and `utils` functions, respectively. Please refrain from duplicating code logics into new `graph_utils.py` and `utils.py` files and default to imports from `en/graph_utils.py` and `en/utils.py` instead. `LANG/graph_utils.py` and `LANG/utils.py` files should only contain new methods and variables. Not all new languages will require a submodule specific `graph_utils.py` or `utils.py` file.
2527

28+
2) NeMo-Text-Processing allows creation of FST graphs through two backends: the Python based library itself (via [Pynini](https://www.opengrm.org/twiki/bin/view/GRM/Pynini) backend) and C++ based [Sparrowhawk](https://github.com/google/sparrowhawk/tree/master) in an upstream repo. Due to the typical tradeoffs between these Python and C++ development [languages](https://www.youtube.com/watch?v=VioxsWYzoJk), the NeMo-Text-Processing library assumes development to be performed with the Python library for final deployment in Sparrowhawk/C++. This dual framework approach can lead to issues in development, notably in the case of tagging additional properties during tokenization.
29+
30+
When writing taggers for semiotic classes, you may need to tag additional token properties (e.g. grammatical gender, case) for accurate verbalization. For example, the Spanish ordinal `21.º` carries masculine gender and is verbalized with a specific spelling. As such, it would be desired for the TN tagger to tokenize the string with the gender property included.
31+
32+
Naively, one may be tempted to simply include the property string `gender: "masc"` and check for this string during the verbalization phase. **This is not advised.** While the NeMo-Text-Processing library itself will permit any custom string in the tagger, Sparrowhawk limits permissible strings, and will fail with custom property strings. Given the performance loss in not providing Sparrowhawk support, we cannot integrate new graphs that cause Sparrowhawk failure. As such, tagged properties should be limited to Sparrowhawk supported strings.
33+
34+
For all classes, Sparrowhawk supports the `morphosyntactic_features` property, and it is recommended to default to this property for tagging additional features. For example:
35+
36+
`21.º" -> ordinal { integer: "vigésimo primero" morphosyntactic_features: "masc" }`
37+
38+
For additional Sparrowhawk supported properties by class, see [here](https://github.com/yzhang123/sparrowhawk/blob/test/src/proto/semiotic_classes.proto)
39+
40+
N.B. The same limitation applies for novel semiotic classes as well. Only predefined classes are supported in Sparrowhawk.
41+
42+
3) Between the tagging and verbalizing stages, both the NeMo-Text-Processing and Sparrowhawk engines permute order of tagged properties. That is, assuming the tagger parsed `1ᵉʳ juillet` as:
43+
44+
`date { month: "juillet" day: "1" } }`
45+
46+
the verbalizer will receive as input both
47+
48+
`date { month: "juillet" day: "1" }`
49+
50+
and
51+
52+
`date { day: "1" month: "juillet" }`
53+
54+
While this eases construction of verbalization graphs, permutation can be computationally expensive. If you know that the tagger output will not require permutation of token properties, you can improve model performance by including the `preserve_order: "true"` property:
55+
56+
`date { day: "1" month: "juillet" preserve_order: "true" }`

0 commit comments

Comments
 (0)