On punctuation and capitalization 

Here's a few feature requests + bugs related to the punctuation and capitalization model.

### Punctuation issues
#### Inverted punctuation
For languages like Spanish, we need two predictions per token to account for the possibility of inverted punctuation tokens preceding a word.

#### Subword masking
By always applying subtoken masks, continuous-script languages (e.g., Chinese) cannot be punctuated without applying some sort of pre-processing. It would be useful if the model processed text in its native script.  

#### Arbitrary punctuation tokens
The text-based data set does not allow to punctuate languages such as Thai, where a space character is a punctuation token. These languages could work by having a token-based data set and removing subword masks (essentially, resolving the other issues resolves this one).

### Capitilization issues

#### Acronyms
The capitalization prediction is simply whether a word starts with a capital letter. So acronyms like 'amc' will not be correctly capitalized.

#### Names that begin with a particle
Similar to the acronym issue, words that begin with a particle, e.g., 'mcdonald', cannot be properly capitalized to 'McDonald'.

#### Capitalization is independent of punctuation
Currently, the two heads are conditioned only on the encoder's output and independent of each other. But capitalization is dependent on punctuation in many cases. 

An example of what might go wrong is we may end up with "Hello, world, What's up?" because the capitalization model might expect a period after 'world'. Essentially the capitalization head is predicting what the punctuation head will do.

In practice I have found this problem to be an uncommon manifestation, but to be correct, capitalization should take into account the punctuator's output. Implicitly, we are forcing the capitalization head to learn punctuation (and predict the punctuation head's output).

### Data set issues

#### Dataset as text
First (this may be a personal preference), it is unnatural to have a preprocessed data set in text format rather than token IDs.
More importantly, text data sets are incompatible with other issues mentioned in this thread (subwords, space as punctuation)

#### Supported data set classes
Dataset is fixed to one class, but it would be more convenient to simply expect an abstract base class, and let the user specify a `_target_` in the config and use hydra.utils.instantiate, as in some other models. E.g., 
https://github.com/NVIDIA/NeMo/blob/bc6215f166e69502fd7784fc73a5c2c39b465819/nemo/collections/tts/models/melgan.py#L298

For example, a user may wish to implement a different dataset that generates examples on-the-fly, or use a ConcatDataset with multiple languages and temperature sampling, etc.

### Paragraph segmentation
A primary benefit of this model is to improve NMT results on unpunctuated and uncased ASR output. However, running MT on an arbitrarily-long inputs will inevitably end poorly. 

For this model to be complete, I would argue it needs to implement a third token classification analytic: paragraph segmentation (splitting a paragraph into its constituent sentences). Translating sentences as separate units would improve results in many cases. Furthermore, a Transformer's runtime complexity is N^2 in the sequence length. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On punctuation and capitalization #3761

Punctuation issues

Inverted punctuation

Subword masking

Arbitrary punctuation tokens

Capitilization issues

Acronyms

Names that begin with a particle

Capitalization is independent of punctuation

Data set issues

Dataset as text

Supported data set classes

Paragraph segmentation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

On punctuation and capitalization #3761

Description

Punctuation issues

Inverted punctuation

Subword masking

Arbitrary punctuation tokens

Capitilization issues

Acronyms

Names that begin with a particle

Capitalization is independent of punctuation

Data set issues

Dataset as text

Supported data set classes

Paragraph segmentation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions