Trying to increase accuracy with floret #10391

mehmetilker · 2022-02-28T09:35:21Z

mehmetilker
Feb 28, 2022

I wanted to give floret a try after examining results here: https://spacy.io/usage/v3-2#vectors

First I trained a model with this dataset without word vectors with the following result:
https://github.com/UniversalDependencies/UD_Turkish-Kenet

E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE 
---  ------  ------------  -----------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------
  0       0          0.00        98.93         105.83       149.87    42.35    38.02      23.01      81.28    26.70     7.00     0.61    0.44
  0     200       1441.68      8558.74       13176.99     16624.49    77.24    72.73      56.58      81.28    59.37    32.00    87.13    0.68
  0     400       2653.99      5116.15        9726.29     14466.41    83.57    83.16      67.69      81.28    64.08    39.18    79.87    0.73
  0     600       3659.42      4907.80        9416.03     15937.51    86.31    85.03      73.45      81.28    69.28    45.92    88.51    0.76
  0     800       5683.84      6493.59       12377.14     22119.27    87.56    86.79      76.65      81.28    70.78    49.56    91.93    0.78
  0    1000       7228.31      7116.70       13718.58     26347.50    88.45    87.73      78.40      81.28    75.10    54.36    92.35    0.80
wandb: Adding directory to artifact (./training/UD_Turkish-Kenet/model-last)... Done. 0.4s
  1    1200       9124.81      7526.35       15282.42     30983.31    89.21    88.86      80.13      81.28    75.99    55.40    92.85    0.80
  1    1400      12053.56      9195.38       18233.87     38347.88    89.87    89.40      81.06      81.28    77.00    58.21    93.23    0.81
  2    1600      15051.45     10283.81       21152.24     45872.76    90.25    90.00      82.17      81.28    76.61    58.63    91.72    0.81
  3    1800      19774.66     12021.56       24774.21     56624.26    90.57    90.27      82.56      81.28    78.94    61.24    93.96    0.82
  3    2000      25359.57     12666.83       26975.33     67331.34    90.69    90.57      83.13      81.28    79.19    61.61    93.57    0.82
wandb: Adding directory to artifact (./training/UD_Turkish-Kenet/model-last)... Done. 0.2s
  5    2200      30610.27     14117.30       30483.70     76126.03    90.70    90.66      83.36      81.28    79.82    62.07    94.99    0.83
  6    2400      37610.09     14810.11       32260.73     88460.75    90.73    90.97      83.71      81.28    80.61    63.00    94.84    0.83
  7    2600      40498.39     13884.57       30208.00     86937.24    90.59    90.73      83.43      81.28    80.41    62.97    94.63    0.83
  8    2800      41881.44     13148.82       28470.28     83925.69    90.59    90.73      83.48      81.28    80.61    63.04    95.29    0.83
 10    3000      42934.23     11774.21       25497.70     82309.27    90.44    90.51      83.45      81.28    80.47    63.18    94.54    0.83
wandb: Adding directory to artifact (./training/UD_Turkish-Kenet/model-last)... Done. 0.1s
 11    3200      43877.18     11377.67       24394.14     77525.34    90.50    90.52      83.56      81.28    80.26    62.38    95.32    0.83
 12    3400      45002.19     10551.75       22846.30     74810.87    90.46    90.53      83.54      81.28    80.67    62.92    94.95    0.83
 14    3600      45606.01      9908.86       21106.09     72398.24    90.31    90.35      83.25      81.28    80.99    63.36    95.63    0.83
 15    3800      48484.76      9293.65       19885.67     74135.15    90.32    90.39      83.45      81.28    80.60    62.39    95.11    0.83
 16    4000      49756.19      9379.76       19402.75     71205.52    90.26    90.35      83.40      81.28    81.09    63.14    95.43    0.83
wandb: Adding directory to artifact (./training/UD_Turkish-Kenet/model-last)... Done. 0.2s


wandb: Run summary:
wandb:              dep_las 0.63137
wandb:              dep_uas 0.81087
wandb:            lemma_acc 0.81283
wandb:   loss_morphologizer 19402.74821
wandb:          loss_parser 71205.52032
wandb:          loss_tagger 9379.75532
wandb:         loss_tok2vec 49756.1933
wandb:            morph_acc 0.83403
wandb:        morph_micro_f 0.89778
wandb:        morph_micro_p 0.92065
wandb:        morph_micro_r 0.87602
wandb:              pos_acc 0.90353
wandb:                score 0.82758
wandb:              sents_f 0.95431
wandb:              sents_p 0.96347
wandb:              sents_r 0.94532
wandb:                speed 11606.26679
wandb:              tag_acc 0.90256
wandb:            token_acc 1.0
wandb:              token_f 0.99972
wandb:              token_p 0.9996
wandb:              token_r 0.99983




TOK      100.00
TAG      91.60 
POS      91.48 
MORPH    84.69 
LEMMA    81.58 
UAS      78.54 
LAS      60.73 
SENT P   95.97 
SENT R   95.62 
SENT F   95.79 
SPEED    10333

Then I tried with floret vectors trained with parameters from example here: https://github.com/explosion/projects/tree/v3/pipelines/floret_fi_core_demo

=========================== Initializing pipeline ===========================
[2022-02-24 12:10:50,095] [INFO] Set up nlp object from config
[2022-02-24 12:10:50,107] [INFO] Pipeline: ['tok2vec', 'tagger', 'morphologizer', 'lemmatizer', 'parser']
[2022-02-24 12:10:50,129] [INFO] Created vocabulary
[2022-02-24 12:10:50,239] [INFO] Added vectors: ../vectors/unshuffled_deduplicated_tr.1000000.n4-5_floret_model
[2022-02-24 12:10:50,492] [INFO] Finished initializing nlp object
[2022-02-24 12:11:13,891] [INFO] Initialized pipeline components: ['tok2vec', 'tagger', 'morphologizer', 'lemmatizer', 'parser']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer', 'lemmatizer',
'parser']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE 
---  ------  ------------  -----------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------
  0       0          0.00        98.93         105.83       149.87    38.88    40.27      24.27      81.28    27.14     8.64     0.05    0.43
  0     200       1667.87      7473.89       12493.98     16444.57    82.74    80.98      63.43      81.28    65.91    39.94    88.79    0.73
  0     400       2912.03      4360.64        8719.22     13768.10    86.33    86.25      72.87      81.28    70.02    48.67    67.72    0.77
  0     600       3775.13      4266.90        8350.77     15134.64    87.87    86.90      77.05      81.28    74.42    53.20    87.74    0.79
  0     800       5575.90      5788.21       11018.57     21264.35    89.02    88.08      79.52      81.28    73.23    54.92    89.22    0.80
  0    1000       7025.91      6431.20       12113.73     25137.39    89.84    89.25      81.68      81.28    77.22    58.45    91.41    0.81
  1    1200       8860.58      6970.58       13576.36     30005.38    90.13    89.94      83.32      81.28    80.02    61.51    92.89    0.82
  1    1400      11581.13      8802.21       16206.11     37024.17    90.76    90.40      84.19      81.28    80.63    63.52    91.71    0.83
  2    1600      14117.31      9933.36       18921.47     44206.25    90.87    90.76      84.91      81.28    79.44    62.67    93.18    0.83
  3    1800      18623.84     11752.79       21983.39     55285.32    91.16    91.12      85.77      81.28    81.77    65.11    93.44    0.84
  3    2000      23449.39     12932.01       24502.07     66037.83    91.27    91.23      86.32      81.28    81.43    64.99    92.98    0.84
  5    2200      28112.74     14816.78       27808.07     75329.03    91.42    91.28      86.27      81.28    82.56    66.13    93.91    0.84
  6    2400      34258.87     15906.00       29968.62     87770.51    91.36    91.43      86.56      81.28    82.38    65.58    95.07    0.84
  7    2600      36236.84     15018.21       28198.10     86472.63    91.16    91.30      86.19      81.28    82.33    66.49    95.21    0.84
  8    2800      37663.94     14424.94       26876.22     83818.16    91.32    91.30      86.57      81.28    82.98    66.38    95.37    0.84
 10    3000      38564.14     13114.61       24450.11     82723.55    91.21    91.35      86.55      81.28    83.46    67.30    95.91    0.84
 11    3200      39981.61     12428.82       23297.17     77799.75    91.40    91.50      86.78      81.28    83.38    66.95    96.09    0.84
 12    3400      41131.58     11807.83       22083.27     75499.70    91.30    91.34      86.56      81.28    83.16    67.19    96.21    0.84
 14    3600      41936.60     10839.98       20591.42     73367.62    91.12    91.27      86.34      81.28    83.12    66.72    95.85    0.84
 15    3800      44504.19     10219.53       19396.03     74883.28    91.16    91.25      86.43      81.28    82.95    66.16    95.79    0.84
 16    4000      45762.92      9944.07       18701.55     71490.58    90.97    90.94      86.06      81.28    83.52    66.83    95.74    0.84
 18    4200      46497.80      9127.85       17543.11     70158.71    91.14    91.32      86.51      81.28    82.79    66.70    95.53    0.84
 19    4400      47542.82      8745.55       16853.62     67679.41    90.79    90.96      86.07      81.28    82.47    65.98    95.72    0.84
 20    4600      49385.74      8445.74       15988.43     67671.22    91.07    91.07      86.05      81.28    82.91    66.02    95.94    0.84
 22    4800      48987.24      7645.34       14944.40     64322.30    91.10    91.03      86.05      81.28    82.91    66.04    96.06    0.84
✔ Saved pipeline to output directory
training/UD_Turkish-Kenet/model-last


================================== evaluate ==================================
Running command: /opt/conda/bin/python -m spacy evaluate ./training/UD_Turkish-Kenet/model-best ./corpus/UD_Turkish-Kenet/test.spacy --output ./metrics/UD_Turkish-Kenet.json --gpu-id -1
ℹ Using CPU

================================== Results ==================================

TOK      100.00
TAG      92.27 
POS      92.33 
MORPH    87.43 
LEMMA    81.58 
UAS      81.42 
LAS      64.04 
SENT P   96.56 
SENT R   95.62 
SENT F   96.09 
SPEED    9979

The only difference I could find in the config (from my first try end Finnish example) - no major changes:

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96 > 256
depth = 4 > 8

As you can see from the results, there is no meaningful change... I was expecting more since I was trying with another what is called agglutinative language...

Is there anything else I could try (parameters or data set ) to see at least 3 point increase like in Finnish example ?

Edit: Another thing, maybe it is related, although I can get similarity result, there is a warning:

['tok2vec', 'tagger', 'morphologizer', 'lemmatizer', 'parser', 'ner']
0.74184476898971

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:21: UserWarning: [W007] The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.

Thanks.

adrianeboyd · 2022-02-28T16:47:25Z

adrianeboyd
Feb 28, 2022

Thanks for the note about the warning, that was a mistake in Vectors that should be fixed by #10394.

It's hard to say exactly why you're not seeing a lot of improvement.

I would definitely train on more text than is used by default in the demo project, which was intended to run relatively quickly for demo purposes. I'd try the full OSCAR subset for Turkish if you have the space/time to process it, which might mean it takes a couple hours to download/tokenize the corpus and closer to a day to train the vectors, depending largely on your CPU and n_process.

Also consider testing with larger vector_size bucket sizes and adjusting the minn/maxn settings for Turkish. In what we've seen so far, it looks like a good rule of thumb is to have maxn as the size of the largest frequently used affix + 1, so 4/5 might be a bit short for some Turkish affixes and maybe 4/6 or 5/6 would be better. You could also try skipgram instead of cbow.

We'd be interested in hearing about the results if you find some particular good (or bad) settings!

0 replies

mehmetilker · 2022-02-28T18:03:57Z

mehmetilker
Feb 28, 2022
Author

Thanks @adrianeboyd ,

I will try with larger text and arrange parameters. I hope will have good results to report back here...
I am also planning to test "Edit tree lemmatizer"

By the way I though text size is enough since you got result you stated here (https://spacy.io/usage/v3-2#vectors) with demo project. Maybe problem is the quality of corpus...

3 replies

adrianeboyd Mar 1, 2022

It could be a number of things, also the lengths of the Turkish texts in OSCAR (I don't know how much training data you end up with for Turkish based on the text count cutoff) or the need for more training data given more variation in surface forms due to details about vowel harmony in Turkish, etc.

In this case, I think it makes sense to try more data and larger bucket sizes (closer in size to the vector tables in lg models, so something like ~500K-1M entries) initially to see if you get the kinds of results you want, and then see if you can maintain similar performance with smaller tables.

mehmetilker Mar 1, 2022
Author

I have tried with little bit more text and changed parameters and almost same result:

I have changed max_texts from 1M to 4M > Corpus text size increased from 368MB to 446MB (I expected more since x4 but something is wrong with dataset reading I guess, I got memory error for bigger text query multiple times...)

Other parameters:

  vector_size: 100000 #50000
  vector_dim: 300
  n_process: 8
  minn: 4
  maxn: 6 # 5
  model_type: "skipgram" # cbow

Well, I expected at least little bit increase on accuracy since bigger text and different parameters... Maybe there is something else I am missing...

By the way, cbow is better suited for training I guess. When I query nearest_neighbors with cbow model I got words with different affixes for the same word but when I query with skip gram I got, for example other cities related with queried city name.
Am I right about cbow ?


E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS PARSER  LOSS NER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  -----------  -------------  -----------  --------  -------  -------  ---------  ---------  -------  -------  -------  ------  ------  ------  ------
  0       0          0.00        98.93         105.83       149.87     53.50    44.84    41.91      24.60      81.30    10.54    10.18     0.00    0.00    0.00    0.00    0.37
  0     200        991.71      6252.29       10923.71     15846.44     80.84    85.54    84.67      71.13      81.30    68.02    43.15    84.60    0.00    0.00    0.00    0.52
  0     400       1573.07      3975.76        7510.96     13492.13      0.00    86.23    87.42      76.42      81.30    70.53    48.02    82.92    0.00    0.00    0.00    0.53
  0     600       2299.58      3949.09        7307.34     15128.73      0.00    88.52    88.42      80.92      81.30    74.85    53.87    92.05    0.00    0.00    0.00    0.54
  0     800       3662.33      5249.43        9501.81     20429.86      0.00    89.54    89.13      82.61      81.30    78.02    57.58    92.79    0.00    0.00    0.00    0.55
  0    1000       4841.36      5837.67       10410.14     24528.23      0.47    90.32    90.23      84.23      81.30    79.12    61.15    91.67    0.00    0.00    0.00    0.56
wandb: Adding directory to artifact (./training/UD_Turkish-Kenet/model-last)... Done. 1.0s
  1    1200       6379.42      6430.34       11348.59     28800.33      0.00    90.93    91.02      85.60      81.30    79.84    62.32    93.61    0.00    0.00    0.00    0.56
  1    1400       8543.27      7566.99       13609.10     36051.79      0.00    91.06    91.16      86.25      81.30    81.80    64.60    95.30    0.00    0.00    0.00    0.56
  2    1600      10427.53      8180.32       14322.78     40902.85      0.00    90.92    91.18      86.13      81.30    81.39    64.53    94.49    0.00    0.00    0.00    0.56
  3    1800      13543.44      9280.51       16305.70     49142.18      0.00    91.36    91.35      86.39      81.30    82.05    66.10    95.84    0.00    0.00    0.00    0.56
  3    2000      16534.34      9187.72       15648.07     56633.88      0.00    91.39    91.47      86.51      81.30    82.69    66.70    96.05    0.00    0.00    0.00    0.57
wandb: Adding directory to artifact (./training/UD_Turkish-Kenet/model-last)... Done. 0.4s
  5    2200      20390.72      9436.59       15910.95     64943.36      0.00    91.48    91.59      86.81      81.30    83.49    66.94    96.25    0.00    0.00    0.00    0.57
  6    2400      23847.97      8400.09       14066.51     69585.70      0.00    91.24    91.23      86.41      81.30    83.30    65.90    95.87    0.00    0.00    0.00    0.56
  7    2600      25207.03      7375.08       12489.39     65128.44      0.00    91.34    91.50      86.62      81.30    83.25    66.13    96.38    0.00    0.00    0.00    0.57
  8    2800      25937.22      6332.66       10682.15     59939.30      0.00    91.10    91.24      86.35      81.30    83.47    65.96    95.72    0.00    0.00    0.00    0.56
 10    3000      26021.27      5271.06        8839.10     55106.21      0.00    91.13    91.20      86.35      81.30    83.03    65.95    95.12    0.00    0.00    0.00    0.56
wandb: Adding directory to artifact (./training/UD_Turkish-Kenet/model-last)... Done. 0.4s
 11    3200      26966.49      5043.35        8409.12     51710.94      0.00    90.87    91.03      86.21      81.30    83.16    65.83    95.12    0.00    0.00    0.00    0.56
 12    3400      28041.43      4590.34        7751.03     50270.81      0.00    90.71    90.94      86.23      81.30    83.25    65.78    96.00    0.00    0.00    0.00    0.56
 14    3600      27128.76      4013.20        6816.87     45263.97      0.00    90.89    90.97      86.06      81.30    83.29    66.06    96.49    0.00    0.00    0.00    0.56
 15    3800      28359.73      3854.96        6496.75     44927.43      0.00    91.00    91.09      86.06      81.30    83.02    65.53    95.98    0.00    0.00    0.00    0.56



TOK      100.00
TAG      92.11 
POS      92.28 
MORPH    87.81 
LEMMA    81.58 
UAS      81.02 
LAS      64.07 
NER P    -     
NER R    -     
NER F    -     
SENT P   96.74 
SENT R   95.74 
SENT F   96.24 
SPEED    2703

adrianeboyd Mar 7, 2022

I think the reason you're not seeing much difference is that <500MB is not really enough data for floret. The demo examples were trained on ~5GB text and there's still room for improvement.

Figure out what's causing the memory error (maybe too many processes for tokenization?) and make sure that the tokenization step runs without errors before moving onto the floret training. I think there is a total of at least ~25GB for Turkish from the original OSCAR release?

The fasttext docs mention that skipgram is better-suited for subword information, but in their own provided models they use cbow, so I have to say I'm not really sure, and I haven't done any minimal comparisons.

mehmetilker · 2022-03-13T19:41:49Z

mehmetilker
Mar 13, 2022
Author

Hello again,
No success to improve accuracy even though I worked with 8GB corpus.

I produced to vector model for cbow and skipgram with following parameters and I got almost save accuracy (%91-92) for these two training (one for cbow and another for skipgram) which is again almost same with previous trainings.

Much bigger corpus and different parameters did not give and meaningful change which leads me to think I am doing something wrong.

  max_texts: 3000000
  vector_size: 100000 #50000
  vector_dim: 300
  n_process: 8
  minn: 4
  maxn: 7
  model_type: "skipgram"

By the way, the reason I could not produce bigger corpus is that, increasing max_texts do not change anything if you do not have too much RAM. I ranted a machine with 16GB/8 core. But after producing 1 GB corpus, app crashed without error, and every time I had restart it again skipping first part like [dataset.skip(1280609 + 328750 + 411257)] (I have check if I duplicate the corpus)

Is it possible there is something wrong with vector model?

When I run len(lang.model.vocab) >258
len(lang.model.vocab.vectors) > 100000

Another thing is here I do not understand, when I run [orth for orth in lang.model.vocab.strings]
I see following result. shouldn't be there 100K vocabulary list ?

['-', '.', '.C.', '.D.', '.K.', '.S.', '.iz', '.Ö.', 'A', 'A.B.D.', 'AB', 'ABD', 'ABS', 'AOÇ', 'ASKİ', 'ATA', 'Alb', 'Alb.', 'Amerika', 'Ank', 'Ank.', 'Ankara', 'Apt', 'Apt.', 'Ar', 'Ar.Gör.', 'Arş', 'Arş.Gör.', 'As', 'As.İz.', 'Asb', 'Asb.', 'Astsb', 'Astsb.', 'Atğm', 'Av', 'Av.', 'Avrupa Birliği', 'Avrupa uzay ajansı', 'B', 'BDDK', 'BJK', 'BMM', 'Bağ', 'Bağ-kur', 'Bağkur', 'Beşiktaş', 'Bnb', 'Bnb.', 'Bul', 'Bul.', 'Bçvş', 'Bçvş.', 'Böl', 'Böl.', 'Bşk', 'Bşk.', 'Bştbp', 'Bştbp.', 'C', 'Cad', 'Cad.', 'D', 'DDK', 'Doç', 'Doç.', 'Dr', 'Dr.', 'Dz', 'Dz.', 'Dz.K.K.lığı', 'Dz.Kuv.', 'Dz.Kuv.K.', 'E', 'ESA', 'Ecz', 'Ecz.', 'F', 'FB', 'Fak', 'Fak.', 'Fenerbahçe', 'G', 'GATA', 'GS', 'Galatasaray', 'Genelkurmay', 'Gn', 'Gn.', 'Gn.Kur.', 'Gnkur', 'Gnkur.', 'Gör', 'H', 'Hs', 'Hs.Uzm.', 'Hst', 'Hst.', 'Hv', 'Hv.', 'Hv.K.K.lığı', 'Hv.Kuv.', 'Hv.Kuv.K.', 'Hz', 'Hz.', 'Hz.Öz.', 'J', 'Jeol', 'Jeol.', 'K', 'K.', 'KBB', 'Korg', 'Korg.', 'Kur', 'Kur.', 'Kur.Bşk.', 'Kuv', 'Kuv.', 'L', 'Ltd', 'Ltd.', 'M', 'M.S.', 'M.Ö.', 'Mah', 'Mah.', 'Mim', 'Müh', 'Müh.', 'O', 'Onb', 'Onb.', 'Ord', 'Ord.', 'Org', 'Org.', 'P', 'Ped', 'Ped.', 'Prof', 'Prof.', 'R', 'RTÜK', 'S', 'SKİ', 'Sb', 'Sb.', 'Sn', 'Sn.', 'Sok', 'Sok.', 'T', 'T.C.', 'TBMM', 'TC', 'Tbp', 'Tbp.', 'Tel', 'Tel.', 'Tug', 'Tug.', 'Tuğg', 'Tuğg.', 'TÜK', 'TÜİK', 'Tümg', 'Tümg.', 'Türkiye istatistik kurumu', 'Tğm', 'Tğm.', 'U', 'Uzm', 'Uzm.', 'X.', 'X.X.', 'X.X.X.', 'X.Xxx', 'X.Xxx.', 'XX', 'XXX', 'XXXX', 'Xx', 'Xx.', 'Xx.X.X.xxxx', 'Xx.Xx.', 'Xx.Xxx.', 'Xx.Xxx.X.', 'Xxx', 'Xxx-xxx', 'Xxx.', 'Xxx.Xxx.', 'Xxxx', 'Xxxx.', 'Xxxx.Xxx.', 'Xxxxx', 'Xxxxx.', 'Y', 'Y.Mim', 'Y.Mim.', 'Y.Müh', 'Y.Müh.', 'Yar', 'Yar.', 'Yar.Doç.', 'Yard', 'Yard.', 'Yard.Doç.', 'Yb', 'Yb.', 'Yd', 'Yd.Sb.', 'Yrd', 'Yrd.', 'Yrd.Doç.', 'YÖK', 'a', 'a.b.d.', 'ab', 'abd', 'abs', 'ad.', 'ah.', 'ak.', 'alb', 'alb.', 'albay', 'ank', 'ank.', 'aoç', 'apartmanı', 'apt', 'apt.', 'ar', 'ar.', 'ar.gör.', 'ard', 'arş', 'arş.gör.', 'as', 'as.iz', 'as.iz.', 'as.i̇z.', 'asb', 'asb.', 'aski̇', 'asteğmen', 'astsb', 'astsb.', 'astsubay', 'atğm', 'av', 'av.', 'avukat', 'ax.', 'b', 'bakınız', 'bağ', 'bağ-kur', 'başkanlığı', 'baştabip', 'başçavuş', 'bddk', 'binbaşı', 'bjk', 'bk', 'bk.', 'bknz', 'bknz.', 'bnb', 'bnb.', 'bp.', 'bul', 'bul.', 'bulvarı', 'bçvş', 'bçvş.', 'böl', 'böl.', 'bölümü', 'bşk', 'bşk.', 'bştbp', 'bştbp.', 'c', 'cad', 'cad.', 'caddesi', 'cz.', 'd', 'dak', 'dak.', 'dakika', 'deniz', 'derleyen', 'dk', 'dk.', 'doktor', 'doç', 'doç.', 'doçent', 'doğ', 'doğ.', 'dr', 'dr.', 'drl', 'drl.', 'dz', 'dz.', 'dz.k.k.lığı', 'dz.kuv.', 'dz.kuv.k.', 'dzl', 'dzl.', 'düzenleyen', 'e', 'ecz', 'ecz.', 'eczanesi', 'ed.', 'ekon', 'ekon.', 'ekonomi', 'el.', 'elg', 'eol', 'esa', 'ev.', 'fak', 'fak.', 'fakültesi', 'fb', 'fren', 'g', 'gata', 'genel', 'gn', 'gn.', 'gn.kur.', 'gnkur', 'gnkur.', 'gr', 'gr.', 'gram', 'gs', 'gör', 'h', 'hastanesi', 'hava', 'hazreti', 'hs', 'hs.uzm.', 'hst', 'hst.', 'huk', 'huk.', 'hukuk', 'hv', 'hv.', 'hv.k.k.lığı', 'hv.kuv.', 'hv.kuv.k.', 'hz', 'hz.', 'hz.öz.', 'ic.', 'im.', 'in.', 'ingilizce', 'iz.', 'i̇ng', 'i̇ng.', 'i̇ski̇', 'i̇st', 'i̇st.', 'i̇z', 'j', 'jeol', 'jeol.', 'jeoloji', 'k', 'k.', 'kbb', 'knz', 'kon', 'korg', 'korg.', 'korgeneral', 'kur', 'kur.', 'kur.bşk.', 'kurmay', 'kuv', 'kuv.', 'kuvvetleri', 'l', 'lb.', 'lg.', 'limited', 'ltd', 'ltd.', 'm', 'm.s.', 'm.ö.', 'mah', 'mah.', 'mahallesi', 'maksimum', 'max', 'max.', 'mg.', 'min', 'min.', 'minimum', 'müh', 'müh.', 'mühendisliği', 'nb.', 'ng.', 'ni.', 'nk.', 'nz.', 'of.', 'ok.', 'ol.', 'on.', 'onb', 'onb.', 'onbaşı', 'ord', 'ord.', 'ordinaryüs', 'org', 'org.', 'orgeneral', 'oç.', 'oğ.', 'p', 'ped', 'ped.', 'pedagoji', 'prof', 'prof.', 'profesör', 'pt.', 'radyo ve televizyon üst kurulu', 'rd.', 'rg.', 'rl.', 'rof', 'rtük', 's', 'saniye', 'sayın', 'sb', 'sb.', 'sn', 'sn.', 'sok', 'sok.', 'sokak', 'st.', 'subay', 't', 't.c.', 'tabip', 'tbmm', 'tbp', 'tbp.', 'tc', 'td.', 'tel', 'tel.', 'telefon', 'telg', 'telg.', 'telgraf', 'teğmen', 'ti.', 'tic', 'tic.', 'ticaret', 'tr', 'tsb', 'tug', 'tug.', 'tugay', 'tuğg', 'tuğg.', 'tuğgeneral', 'tüi̇k', 'tümg', 'tümg.', 'tümgeneral', 'tğm', 'tğm.', 'ug.', 'uk.', 'ul.', 'ur.', 'uv.', 'uzm', 'uzm.', 'uzman', 'uğg', 'v', 'vb', 'vb.', 'vesaire', 'vs', 'vs.', 'vş.', 'xx', 'xx.', 'xx.xx', 'xx.xx.', 'xxx', 'xxx.', 'xxxx', 'xxxx.', 'y', 'y.mim', 'y.mim.', 'y.müh', 'y.müh.', 'yar', 'yar.', 'yar.doç.', 'yarbay', 'yard', 'yard.', 'yard.doç.', 'yardımcı', 'yb', 'yb.', 'yd', 'yd.sb.', 'yrd', 'yrd.', 'yrd.doç.', 'yy', 'yy.', 'yök', 'yüzyıl', 'zl.', 'zm.', 'Ç', 'Çvş', 'Çvş.', 'Ö', 'Öz', 'Öz.', 'Ü', 'Üni', 'Üni.', 'Ütğm', 'Ütğm.', 'Üçvş', 'Üçvş.', 'ÜİK', 'ç', 'çavuş', 'çev', 'çev.', 'çeviren', 'çvş', 'çvş.', 'öl.', 'ör.', 'öz', 'üh.', 'ümg', 'üni', 'üni.', 'üniversitesi', 'üsteğmen', 'üstçavuş', 'ütğm', 'ütğm.', 'üçvş', 'üçvş.', 'ğg.', 'ğm.', 'İ', 'İSKİ', 'İng', 'İng.', 'İst', 'İst.', 'İstanbul', 'İz', 'İz.', 'ığı', 'Ş', 'Şb', 'Şb.', 'Şti', 'Şti.', 'ş', 'şb', 'şb.', 'şirketi', 'şk.', 'şti', 'şti.', 'şube', 'kadın', 'dın']

1 reply

adrianeboyd Mar 14, 2022

I also experimented a bit with cbow and skipgram for Finnish and Korean and overall there wasn't much difference. In most cases the performance was slightly better with cbow and cbow trains faster, so at this point I would recommend cbow.

The nlp objects do grow in size due to the vocab caches as lots of texts are processed and the provided tokenization script isn't optimized for limited RAM. The general recommended solution is to periodically reload nlp, but this would mean you'd have to adjust how it loads the dataset to break it into smaller chunks for processing. (Also nlp.pipe is slow and uses more RAM than some other alternatives, but it makes the script a bit simpler for this example. Some related discussion: #10306. We're working on a spacy project for larger data sizes, which will hopefully be published with v3.3.)

The internals of floret aren't like a BPE vocab where you'd have a fixed list of 100K substrings. The hash values can be calculated for any arbitrary string. Oversimplified it looks like this:

hash("apple") % 100000 -> 20789

We don't need to store apple because we can always recalculate this hash (and some other strings will also have the same hash given only 50000 entries).

mehmetilker · 2022-03-14T10:18:59Z

mehmetilker
Mar 14, 2022
Author

Thanks @adrianeboyd for advises and explanation on floret. I guess there isn't much to try. I will continue with transformer model. Though can't get more then %92 with it as well...

1 reply

adrianeboyd Mar 15, 2022

It looks like the current stanza model's UPOS accuracy is 93.4, so this is relatively close given that it's a much smaller/faster model. There may not be much more improvement with only default vs. floret vectors.

Uh oh!

Trying to increase accuracy with floret #10391

Uh oh!

Uh oh!

mehmetilker Feb 28, 2022

Replies: 4 comments · 5 replies

Uh oh!

adrianeboyd Feb 28, 2022

Uh oh!

mehmetilker Feb 28, 2022 Author

Uh oh!

adrianeboyd Mar 1, 2022

Uh oh!

Uh oh!

mehmetilker Mar 1, 2022 Author

Uh oh!

adrianeboyd Mar 7, 2022

Uh oh!

mehmetilker Mar 13, 2022 Author

Uh oh!

adrianeboyd Mar 14, 2022

Uh oh!

mehmetilker Mar 14, 2022 Author

Uh oh!

adrianeboyd Mar 15, 2022

mehmetilker
Feb 28, 2022

Replies: 4 comments 5 replies

adrianeboyd
Feb 28, 2022

mehmetilker
Feb 28, 2022
Author

mehmetilker Mar 1, 2022
Author

mehmetilker
Mar 13, 2022
Author

mehmetilker
Mar 14, 2022
Author