You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+20-19Lines changed: 20 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ TICCLTOOLS
4
4
TICCLtools is a collection of programs to process text data files towards fully-automatic lexical corpus post-correction. Together they constitute the bulk of TICCL: Text Induced Corpus-Cleanup. This software is usually invoked by the pipeline system PICCL: https://github.com/LanguageMachines/PICCL ,
5
5
consult there for installation and usage instructions unless you really want to invoke the individual tools manually.
6
6
7
-
The workflows in PICCL, the Philosophical Integrator of Computational and Corpus Libraries are schematically visualised here, TICCL being the one to the right:
7
+
The workflows in PICCL, the Philosophical Integrator of Computational and Corpus Libraries are schematically visualised here, TICCL being the one to the right:
@@ -58,49 +58,50 @@ created to house the required files for the specific language(s).
58
58
Should you want or need to build your own TICCL alphabet and character confusion files yourself, the tool to do that is:
59
59
60
60
- TICCL-lexstat
61
-
- Creates a character frequency ranked 'alphabet' file of the unicode characters that are present in a lexicon for
61
+
- Creates a character frequency ranked 'alphabet' file of the unicode characters that are present in a lexicon for
62
62
the specific language.
63
63
- Convert an 'alphabet' file (in a second step) into a list of character confusion hash values and an example of the
64
64
particular character confusion, optionally a list of all possible character confusions given the set of characters involved.
65
-
66
-
Note that each extra character allowed to be an actual character used in the language expands the search space for lexical variants. The tool therefore allows you to 'clip' or apply a frequency cut-off to the character frequency
65
+
66
+
Note that each extra character allowed to be an actual character used in the language expands the search space for lexical variants. The tool therefore allows you to 'clip' or apply a frequency cut-off to the character frequency
67
67
list for your particular language.
68
68
69
69
The actual TICCL post-correction programs in this collection are:
70
70
- TICCL-stats
71
71
- A tool to derive word frequency lists from text files or corpora. Its companion tool for corpora in FoLiA XML format,
72
72
FoLiA-stats, is far more developed and recommended.
73
73
- TICCL-unk
74
-
- a cleanup tool for word frequency lists. Creates a 'clean' file with desirable words, an 'unk' file with
74
+
- a cleanup tool for word frequency lists. Creates a 'clean' file with desirable words, an 'unk' file with
75
75
uncorrectable words and a 'punct' file with words that would be clean after removing punctuation before and after.
76
76
- TICCL-anahash
77
-
- a tool to create anagram hash values from a word frequency file. All anagrams formable given a particular bag
78
-
of characters and observed in the frequency file are assigned a distinguishing anagram value, based on the
77
+
- a tool to create anagram hash values from a word frequency file. All anagrams formable given a particular bag
78
+
of characters and observed in the frequency file are assigned a distinguishing anagram value, based on the
79
79
individual character values assigned by tool TICCL-lexstat to each character in the alphabet.
80
80
- TICCL-indexer and TICCL-indexerNT
81
81
- a tool to create an exhaustive numerical index to all lexical
82
-
variants present in a corpus within the distances defined by the character
82
+
variants present in a corpus within the distances defined by the character
83
83
confusion values, given a particular Levenshtein or edit distance.
84
84
- TICCL-LDcalc
85
-
- a preprocessing tool for TICCL-rank. Gathers the info from TICCL-anahash, TICCL-indexer or TICCL-indexerNT,
86
-
TICCL-lexstat and TICCL-unk. Retrieves and pre-filters the symbolic pairs of word variants linked to their
85
+
- a preprocessing tool for TICCL-rank. Gathers the info from TICCL-anahash, TICCL-indexer or TICCL-indexerNT,
86
+
TICCL-lexstat and TICCL-unk. Retrieves and pre-filters the symbolic pairs of word variants linked to their
87
87
Correction Candidates.
88
88
- TICCL-rank
89
-
- ranks a word variant list on the basis of a wide range of criteria and the actual set of ranking features
89
+
- ranks a word variant list on the basis of a wide range of criteria and the actual set of ranking features
90
90
specified to be used in the Correction Candidate ranking.
91
91
- TICCL-chain
92
-
- Tool designed to gather variants that lie outside the edit distance to the perceived best Correction Candidate,
92
+
- Tool designed to gather variants that lie outside the edit distance to the perceived best Correction Candidate,
93
93
given the Levenshtein distance set earlier in the work flow. This distance is usually two characters.
94
-
- Chaining = "my friends' friends are my friends" : the highest frequency Correction Candidate in the TICCL-rank output
95
-
list with best-first ranked variants within the set Levenshtein Distance (LD) that act as CCs for further variants
94
+
- Chaining = "my friends' friends are my friends" : the highest frequency Correction Candidate in the TICCL-rank output
95
+
list with best-first ranked variants within the set Levenshtein Distance (LD) that act as CCs for further variants
96
96
beyond this LD (and so on for even greater LDs) is directly linked to these larger LD variants.
97
97
- TICCL-chainclean
98
-
- After correction of n-gram frequency lists (with n > 1) word strings may have been assigned different Correction
98
+
- After correction of n-gram frequency lists (with n > 1) word strings may have been assigned different Correction
99
99
Candidates on the unigram, bi- or trigram correction levels, leading to inconsistencies. This experimental tool tries
100
100
to solve the inconsistencies.
101
-
101
+
102
102
Post-TICCLtools: actual text editing:
103
-
104
-
We currently only provide for post-editing of texts based on the list of correction candidates collected
105
-
by TICCLtools for texts or corpora in FoLiA XML. Please see the FoLiA-tools collection for the tool:
103
+
104
+
We currently only provide for post-editing of texts based on the list of correction candidates collected
105
+
by TICCLtools for texts or corpora in FoLiA XML. Please see the FoLiA-utils
106
+
(https://github.com/LanguageMachines/foliautils) collection for the tool:
0 commit comments