Skip to content

Commit 77fe765

Browse files
Ko van der SlootKo van der Sloot
authored andcommitted
added a link
1 parent 5bf695a commit 77fe765

File tree

1 file changed

+20
-19
lines changed

1 file changed

+20
-19
lines changed

README.md

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ TICCLTOOLS
44
TICCLtools is a collection of programs to process text data files towards fully-automatic lexical corpus post-correction. Together they constitute the bulk of TICCL: Text Induced Corpus-Cleanup. This software is usually invoked by the pipeline system PICCL: https://github.com/LanguageMachines/PICCL ,
55
consult there for installation and usage instructions unless you really want to invoke the individual tools manually.
66

7-
The workflows in PICCL, the Philosophical Integrator of Computational and Corpus Libraries are schematically visualised here, TICCL being the one to the right:
7+
The workflows in PICCL, the Philosophical Integrator of Computational and Corpus Libraries are schematically visualised here, TICCL being the one to the right:
88

99
![PICCL Architecture](https://raw.githubusercontent.com/LanguageMachines/PICCL/master/architecture.png)
1010

@@ -58,49 +58,50 @@ created to house the required files for the specific language(s).
5858
Should you want or need to build your own TICCL alphabet and character confusion files yourself, the tool to do that is:
5959

6060
- TICCL-lexstat
61-
- Creates a character frequency ranked 'alphabet' file of the unicode characters that are present in a lexicon for
61+
- Creates a character frequency ranked 'alphabet' file of the unicode characters that are present in a lexicon for
6262
the specific language.
6363
- Convert an 'alphabet' file (in a second step) into a list of character confusion hash values and an example of the
6464
particular character confusion, optionally a list of all possible character confusions given the set of characters involved.
65-
66-
Note that each extra character allowed to be an actual character used in the language expands the search space for lexical variants. The tool therefore allows you to 'clip' or apply a frequency cut-off to the character frequency
65+
66+
Note that each extra character allowed to be an actual character used in the language expands the search space for lexical variants. The tool therefore allows you to 'clip' or apply a frequency cut-off to the character frequency
6767
list for your particular language.
6868

6969
The actual TICCL post-correction programs in this collection are:
7070
- TICCL-stats
7171
- A tool to derive word frequency lists from text files or corpora. Its companion tool for corpora in FoLiA XML format,
7272
FoLiA-stats, is far more developed and recommended.
7373
- TICCL-unk
74-
- a cleanup tool for word frequency lists. Creates a 'clean' file with desirable words, an 'unk' file with
74+
- a cleanup tool for word frequency lists. Creates a 'clean' file with desirable words, an 'unk' file with
7575
uncorrectable words and a 'punct' file with words that would be clean after removing punctuation before and after.
7676
- TICCL-anahash
77-
- a tool to create anagram hash values from a word frequency file. All anagrams formable given a particular bag
78-
of characters and observed in the frequency file are assigned a distinguishing anagram value, based on the
77+
- a tool to create anagram hash values from a word frequency file. All anagrams formable given a particular bag
78+
of characters and observed in the frequency file are assigned a distinguishing anagram value, based on the
7979
individual character values assigned by tool TICCL-lexstat to each character in the alphabet.
8080
- TICCL-indexer and TICCL-indexerNT
8181
- a tool to create an exhaustive numerical index to all lexical
82-
variants present in a corpus within the distances defined by the character
82+
variants present in a corpus within the distances defined by the character
8383
confusion values, given a particular Levenshtein or edit distance.
8484
- TICCL-LDcalc
85-
- a preprocessing tool for TICCL-rank. Gathers the info from TICCL-anahash, TICCL-indexer or TICCL-indexerNT,
86-
TICCL-lexstat and TICCL-unk. Retrieves and pre-filters the symbolic pairs of word variants linked to their
85+
- a preprocessing tool for TICCL-rank. Gathers the info from TICCL-anahash, TICCL-indexer or TICCL-indexerNT,
86+
TICCL-lexstat and TICCL-unk. Retrieves and pre-filters the symbolic pairs of word variants linked to their
8787
Correction Candidates.
8888
- TICCL-rank
89-
- ranks a word variant list on the basis of a wide range of criteria and the actual set of ranking features
89+
- ranks a word variant list on the basis of a wide range of criteria and the actual set of ranking features
9090
specified to be used in the Correction Candidate ranking.
9191
- TICCL-chain
92-
- Tool designed to gather variants that lie outside the edit distance to the perceived best Correction Candidate,
92+
- Tool designed to gather variants that lie outside the edit distance to the perceived best Correction Candidate,
9393
given the Levenshtein distance set earlier in the work flow. This distance is usually two characters.
94-
- Chaining = "my friends' friends are my friends" : the highest frequency Correction Candidate in the TICCL-rank output
95-
list with best-first ranked variants within the set Levenshtein Distance (LD) that act as CCs for further variants
94+
- Chaining = "my friends' friends are my friends" : the highest frequency Correction Candidate in the TICCL-rank output
95+
list with best-first ranked variants within the set Levenshtein Distance (LD) that act as CCs for further variants
9696
beyond this LD (and so on for even greater LDs) is directly linked to these larger LD variants.
9797
- TICCL-chainclean
98-
- After correction of n-gram frequency lists (with n > 1) word strings may have been assigned different Correction
98+
- After correction of n-gram frequency lists (with n > 1) word strings may have been assigned different Correction
9999
Candidates on the unigram, bi- or trigram correction levels, leading to inconsistencies. This experimental tool tries
100100
to solve the inconsistencies.
101-
101+
102102
Post-TICCLtools: actual text editing:
103-
104-
We currently only provide for post-editing of texts based on the list of correction candidates collected
105-
by TICCLtools for texts or corpora in FoLiA XML. Please see the FoLiA-tools collection for the tool:
103+
104+
We currently only provide for post-editing of texts based on the list of correction candidates collected
105+
by TICCLtools for texts or corpora in FoLiA XML. Please see the FoLiA-utils
106+
(https://github.com/LanguageMachines/foliautils) collection for the tool:
106107
FoLiA-correct.

0 commit comments

Comments
 (0)