Specification of the value-to-form processing in Lexibank datasets:
The value-to-form processing is divided into two steps, implemented as methods:
FormSpec.split: Splits a string into individual form chunks.FormSpec.clean: Normalizes a form chunk.
These methods use the attributes of a FormSpec instance to configure their behaviour.
brackets:{'(': ')', '{': '}', '[': ']', '(': ')', '【': '】', '『': '』', '«': '»', '⁽': '⁾', '₍': '₎'}Pairs of strings that should be recognized as brackets, specified asdictmapping opening string to closing stringseparators:(';', '/', ',')Iterable of single character tokens that should be recognized as word separatormissing_data:('?', '-')Iterable of strings that are used to mark missing datastrip_inside_brackets:FalseFlag signaling whether to strip content in brackets (and strip leading and trailing whitespace)replacements:[]List of pairs (source,target) used to replace occurrences ofsourcein formswithtarget(before stripping content in brackets)first_form_only:FalseFlag signaling whether at most one form should be returned fromsplit- effectively ignoring any spelling variants, etc.normalize_whitespace:TrueFlag signaling whether to normalize whitespace - stripping leading and trailing whitespace and collapsing multi-character whitespace to single spacesnormalize_unicode:NoneUNICODE normalization form to use for input ofsplit(None, 'NFD' or 'NFC')