Skip to content

Commit 6c7e847

Browse files
Bugfix/drive downloads (#41)
1 parent aad3d87 commit 6c7e847

File tree

6 files changed

+490
-731
lines changed

6 files changed

+490
-731
lines changed

README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,14 @@ You can install Sherlock by cloning this repository, and run `pip install .`.
88

99

1010
## Demonstration of usage
11-
The notebooks in `notebooks/` prefixed with `01-data processing.ipynb` and `02-1-train-and-test-sherlock.ipynb` can be used to reproduce the results, and demonstrate the usage of Sherlock (from data preprocessing to model training and evaluation). The `00-WIP-use-sherlock-out-of-the-box.ipynb` notebook demonstrates usage of the readily trained model for a given table (WIP).
11+
The `00-use-sherlock-out-of-the-box.ipynb` notebook demonstrates usage of the readily trained model for a given table (WIP).
12+
13+
The notebooks in `notebooks/` prefixed with `01-data processing.ipynb` and `02-1-train-and-test-sherlock.ipynb` can be used to reproduce the results, and demonstrate the usage of Sherlock (from data preprocessing to model training and evaluation).
1214

1315

1416
## Data
1517
The raw data (corresponding to annotated table columns) can be downloaded using the `download_data()` function in the `helpers` module.
16-
This will download 3.6GB of data into the `data` directory. Use the `01-data-preprocessing.ipynb` notebook to preprocess this data. Each column is then represented by a feature vector of dimensions 1x1588. The extracted features per column are based on "paragraph" embeddings (full column), word embeddings (aggregated from each column cell), character count statistics (e.g. average number of "." in a column's cells) and column-level statistics (e.g. column entropy).
18+
This will download +/- 500MB of data into the `data` directory. Use the `01-data-preprocessing.ipynb` notebook to preprocess this data. Each column is then represented by a feature vector of dimensions 1x1588. The extracted features per column are based on "paragraph" embeddings (full column), word embeddings (aggregated from each column cell), character count statistics (e.g. average number of "." in a column's cells) and column-level statistics (e.g. column entropy).
1719

1820

1921
## The Sherlock model
@@ -36,14 +38,14 @@ The notebook `02-1-train-and-test-sherlock.ipynb` illustrates how Sherlock, as c
3638
├── model_files <- Files with trained model weights and specification.
3739
├── sherlock_model.json
3840
   └── sherlock_weights.h5
39-
41+
4042
├── notebooks <- Notebooks demonstrating data preprocessing and train/test of Sherlock.
4143
└── 00-WIP-use-sherlock-out-of-the-box.ipynb
4244
└── 01-data-preprocessing.ipynb
4345
└── 02-1-train-and-test-sherlock.ipynb
4446
└── 02-2-train-and-test-sherlock-rf-ensemble.ipynb
4547
└── 03-train-paragraph-vector-features-optional.ipynb
46-
48+
4749
├── sherlock <- Package.
4850
    ├── deploy <- Code for (re)training Sherlock, as well as model specification.
4951
└── helpers.py

0 commit comments

Comments
 (0)