Skip to content

On punctuation and capitalization Β #3761

@1-800-BAD-CODE

Description

@1-800-BAD-CODE

Here's a few feature requests + bugs related to the punctuation and capitalization model.

Punctuation issues

Inverted punctuation

For languages like Spanish, we need two predictions per token to account for the possibility of inverted punctuation tokens preceding a word.

Subword masking

By always applying subtoken masks, continuous-script languages (e.g., Chinese) cannot be punctuated without applying some sort of pre-processing. It would be useful if the model processed text in its native script.

Arbitrary punctuation tokens

The text-based data set does not allow to punctuate languages such as Thai, where a space character is a punctuation token. These languages could work by having a token-based data set and removing subword masks (essentially, resolving the other issues resolves this one).

Capitilization issues

Acronyms

The capitalization prediction is simply whether a word starts with a capital letter. So acronyms like 'amc' will not be correctly capitalized.

Names that begin with a particle

Similar to the acronym issue, words that begin with a particle, e.g., 'mcdonald', cannot be properly capitalized to 'McDonald'.

Capitalization is independent of punctuation

Currently, the two heads are conditioned only on the encoder's output and independent of each other. But capitalization is dependent on punctuation in many cases.

An example of what might go wrong is we may end up with "Hello, world, What's up?" because the capitalization model might expect a period after 'world'. Essentially the capitalization head is predicting what the punctuation head will do.

In practice I have found this problem to be an uncommon manifestation, but to be correct, capitalization should take into account the punctuator's output. Implicitly, we are forcing the capitalization head to learn punctuation (and predict the punctuation head's output).

Data set issues

Dataset as text

First (this may be a personal preference), it is unnatural to have a preprocessed data set in text format rather than token IDs.
More importantly, text data sets are incompatible with other issues mentioned in this thread (subwords, space as punctuation)

Supported data set classes

Dataset is fixed to one class, but it would be more convenient to simply expect an abstract base class, and let the user specify a _target_ in the config and use hydra.utils.instantiate, as in some other models. E.g.,
https://github.com/NVIDIA/NeMo/blob/bc6215f166e69502fd7784fc73a5c2c39b465819/nemo/collections/tts/models/melgan.py#L298

For example, a user may wish to implement a different dataset that generates examples on-the-fly, or use a ConcatDataset with multiple languages and temperature sampling, etc.

Paragraph segmentation

A primary benefit of this model is to improve NMT results on unpunctuated and uncased ASR output. However, running MT on an arbitrarily-long inputs will inevitably end poorly.

For this model to be complete, I would argue it needs to implement a third token classification analytic: paragraph segmentation (splitting a paragraph into its constituent sentences). Translating sentences as separate units would improve results in many cases. Furthermore, a Transformer's runtime complexity is N^2 in the sequence length.

Metadata

Metadata

Assignees

Labels

featurerequest/PR for a new feature

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions