You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+32-1Lines changed: 32 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Contributions are welcome!
2
2
3
-
We do all of NeMo-text-processing's development in the open. Contributions from the open-source community are welcome.
3
+
We do all of NeMo-Text-Processing's development in the open. Contributions from the open-source community are welcome.
4
4
5
5
6
6
# Pull Requests (PR) Guidelines
@@ -22,4 +22,35 @@ We do all of NeMo-text-processing's development in the open. Contributions from
22
22
11) Optional: if you added a new language or a new feature please update the [NeMo documentation](https://github.com/NVIDIA/NeMo/blob/main/docs/source/nlp/text_normalization/wfst/wfst_text_normalization.rst) (lives in different repo).
23
23
12) Send your PR and request a review
24
24
25
+
# Notes for Language Contribution
26
+
1)`en/graph_utils.py` and `en/utils.py` are the de facto parents for all other `graph_util` and `utils` functions, respectively. Please refrain from duplicating code logics into new `graph_utils.py` and `utils.py` files and default to imports from `en/graph_utils.py` and `en/utils.py` instead. `LANG/graph_utils.py` and `LANG/utils.py` files should only contain new methods and variables. Not all new languages will require a submodule specific `graph_utils.py` or `utils.py` file.
25
27
28
+
2) NeMo-Text-Processing allows creation of FST graphs through two backends: the Python based library itself (via [Pynini](https://www.opengrm.org/twiki/bin/view/GRM/Pynini) backend) and C++ based [Sparrowhawk](https://github.com/google/sparrowhawk/tree/master) in an upstream repo. Due to the typical tradeoffs between these Python and C++ development [languages](https://www.youtube.com/watch?v=VioxsWYzoJk), the NeMo-Text-Processing library assumes development to be performed with the Python library for final deployment in Sparrowhawk/C++. This dual framework approach can lead to issues in development, notably in the case of tagging additional properties during tokenization.
29
+
30
+
When writing taggers for semiotic classes, you may need to tag additional token properties (e.g. grammatical gender, case) for accurate verbalization. For example, the Spanish ordinal `21.º` carries masculine gender and is verbalized with a specific spelling. As such, it would be desired for the TN tagger to tokenize the string with the gender property included.
31
+
32
+
Naively, one may be tempted to simply include the property string `gender: "masc"` and check for this string during the verbalization phase. **This is not advised.** While the NeMo-Text-Processing library itself will permit any custom string in the tagger, Sparrowhawk limits permissible strings, and will fail with custom property strings. Given the performance loss in not providing Sparrowhawk support, we cannot integrate new graphs that cause Sparrowhawk failure. As such, tagged properties should be limited to Sparrowhawk supported strings.
33
+
34
+
For all classes, Sparrowhawk supports the `morphosyntactic_features` property, and it is recommended to default to this property for tagging additional features. For example:
For additional Sparrowhawk supported properties by class, see [here](https://github.com/yzhang123/sparrowhawk/blob/test/src/proto/semiotic_classes.proto)
39
+
40
+
N.B. The same limitation applies for novel semiotic classes as well. Only predefined classes are supported in Sparrowhawk.
41
+
42
+
3) Between the tagging and verbalizing stages, both the NeMo-Text-Processing and Sparrowhawk engines permute order of tagged properties. That is, assuming the tagger parsed `1ᵉʳ juillet` as:
43
+
44
+
`date { month: "juillet" day: "1" } }`
45
+
46
+
the verbalizer will receive as input both
47
+
48
+
`date { month: "juillet" day: "1" }`
49
+
50
+
and
51
+
52
+
`date { day: "1" month: "juillet" }`
53
+
54
+
While this eases construction of verbalization graphs, permutation can be computationally expensive. If you know that the tagger output will not require permutation of token properties, you can improve model performance by including the `preserve_order: "true"` property:
0 commit comments