Skip to content

TICLL-rank: 'filter out' unigram correction variants where a bigram to unigram CC is present. #26

@kosloot

Description

@kosloot

@martinreynaert provided the following examples:

<mre> veroor_zaakt#1#veroorzaakt#100000002#1#0.815385
<mre> veroor_zaakt_door#1#veroorzaakt_door#100000001#1#1
<mre> veroor#1#verloor#100000024#1#0.998869

The last entry is undesirable.

<mre> veroor_zaakt#1#veroorzaakt#100000002#1#0.815385
<mre> veroor_zaakt_door#1#veroorzaakt_door#100000001#1#1
<mre> zaakt_door#1#zaak_voor#100000001#2#1
<mre> zaakt#1#nazakt#100000000#2#0.998757

The last entry is undesirable.

<mre> verlaa_ten#1#verlaaten#100000010#1#0.984416
<mre> verlaa#1#verlaan#100000000#1#0.998726

Idem

<mre> acobs_Nakomelingen#1#j_acobs_Nakomelingen#1#2#1
<mre> acobs#1#Jacobs#100000001#1#0.993398
<mre> j_acobs#1#Jacobs#100000001#1#0.977545

Here the second is undesirable.

This last one also illustrates why filtering out is not that easy.
It would be handy if is was a sequential process, but unfortunately not.

At the moment TICCL-rank process it's input and output in chunks, but we have to change that and store all results so we can filter the above cases out afterwards.
A major change! More memory consuming, and less easy to handle multi threaded.
Some more investigation is needed.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions