You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: lt.seg/README.MD
+7-5Lines changed: 7 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -66,16 +66,18 @@ lt.seg comes with a number of parameters, run `seg -?` to get a list of options
66
66
*`--normalize <level>` (`-nl`)
67
67
*`0` (default): no normalization, each segment will be printed as it is in the input
68
68
*`1`: reduce same consecutive non-word characters, e.g. multiple consecutive blanks will be merged to one. Example: "\t\t\n\t\t" -> "\t\n\t"
69
-
*`2`: `1` + replace consecutive numbers and digits within words and number segments themselves with the symbol `0`. Example 'He11o World. I am Johnny 5.' -> 'He0o World . I am Johnny 0 .'
70
-
*`3`: `2` + replace all non-word segments with its symbol.
71
-
*`4`: `3` + lowercase words.
69
+
*`2`: `1` + replace empty space and punctuation characters with its symbol
70
+
*`3`: `2` + replace consecutive numbers and digits within words and number segments themselves with the symbol `0`. Example 'He11o World. I am Johnny 5.' -> 'He0o World . I am Johnny 0 .'
71
+
*`4`: `3` + replace all non-word segments with its symbol.
72
+
*`5`: `4` + lowercase words.
72
73
*`--filter <level>` (`-fl`): *Note: examples below use normalization level (-nl)*`2` and DiffTokenizer
73
74
*`0`: no filtering, each segment will be printed separated by blanks (this also includes emptyspace segments, in most cases you probably want to use at least `1` or `2`)
* `3`: `2` + filter unclassified and non-readable segments (attention: results heavily depend on tokenizer)
77
-
* `4`: `3` + filter punctuation characters. Example: "The number is 534 423 or 43. ? :-/ " -> "The number is 0 or 0"
78
-
* `5`: `4` + filter numbers and words with numbers. Example: "The number is 534 423 or 43. ? :-/ " -> "The number is or" (Only useful with proper token normalization level.)
78
+
* `4`: `3` + filter punctuation characters
79
+
* `5`: `4` + filter meta data like URLs, file descriptors, emails, wiki markup, emoticons, etc.
80
+
* `6`: `4` + filter numbers and words with numbers. Example: "The number is 534 423 or 43. ? :-/ " -> "The number is or" (Only useful with proper token normalization level.)
79
81
*`--merge [<level>]` (`-ml`): *Note: examples below use normalization level (-nl)*`2`
80
82
* `0`: no merging (default when not specified)
81
83
* `1`: merge same consecutive token types if they are not words or words with numbers (default when just -ml specified). Example: "The number is 534 423 or 43. ? :-/ " -> "The number is 0 or 0 . "
0 commit comments