|
1 | 1 | # Development |
2 | 2 |
|
3 | | -## Setup |
4 | | - |
5 | | -To be able to train the model |
6 | | - |
7 | | -- Download the [Tatoeba sentence export](https://downloads.tatoeba.org/exports/sentences.tar.bz2) |
8 | | -- Extract in `data/tatoeba.csv` |
9 | | - |
10 | | -- Download the [UDHR](https://unicode.org/udhr/assemblies/udhr_txt.zip) |
11 | | -- Extract in `data/udhr/` |
12 | | - |
13 | 3 | ## Commands |
14 | 4 |
|
15 | 5 | ```sh |
16 | | -# install deps |
| 6 | +# Install |
17 | 7 | yarn |
18 | 8 |
|
19 | | -# train and generate language profiles |
20 | | -yarn train |
21 | | - |
22 | | -# build the library |
| 9 | +# Build |
23 | 10 | yarn build |
24 | 11 |
|
25 | | -# code style linting |
| 12 | +# Test |
| 13 | +yarn test |
| 14 | + |
| 15 | +# Lint / Auto-fix code style problems |
26 | 16 | yarn lint |
| 17 | +``` |
27 | 18 |
|
28 | | -# test |
29 | | -yarn test |
| 19 | +--- |
| 20 | + |
| 21 | +## Install issues |
| 22 | + |
| 23 | +For the moment the library has lot of dev-dependencies purely for the benchmark process. |
| 24 | +Some of those libraries need to compile native code, which can be problematic (gcc, gyp, python, ...) |
| 25 | + |
| 26 | +If you run into those issues, one of the easiest solution is to remove the problematic dependencies from `package.json` then try again to install. |
| 27 | + |
| 28 | +[like here](https://github.com/komodojp/tinyld/issues/10#issuecomment-1019085476) |
| 29 | + |
| 30 | +It will only cause issue with `yarn bench`, but everything else should still work normally |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## Optional |
| 35 | + |
| 36 | +### 1. Generate profiles (`yarn train`) |
| 37 | + |
| 38 | +This step require lot of data and time, so it's optional and the result are store directly in git. |
| 39 | + |
| 40 | +This will analyse lot fo text in different language and build statistics to be able to identify the best features for each language |
| 41 | + |
| 42 | +To be able to train the model, you will need first to have the dataset locally |
| 43 | + |
| 44 | +``` |
| 45 | +Download Datasets |
| 46 | + - Download the [Tatoeba sentence export](https://downloads.tatoeba.org/exports/sentences.tar.bz2) |
| 47 | + - Extract in `data/tatoeba.csv` |
| 48 | + - Download the [UDHR](https://unicode.org/udhr/assemblies/udhr_txt.zip) |
| 49 | + - Extract in `data/udhr/` |
| 50 | +
|
| 51 | +Run yarn train |
| 52 | + - For each language, it will build statistics for words and n-grams |
| 53 | + - This goes through massive amount of data and will take time, prepare few coffee |
| 54 | +
|
| 55 | +When your profile files are generated, you can run `yarn build` and you will have a build with those new data |
30 | 56 | ``` |
| 57 | + |
| 58 | +### 2. Generate benchmark data (`yarn bench`) |
| 59 | + |
| 60 | +This step require a bit of time, it will run lot of different test for a set of libraries to generate the benchmark page and diagrams. |
0 commit comments