Skip to content

Commit 99477f2

Browse files
committed
docs: add end-to-end example for the MANE expanding coverage process
1 parent ed27e67 commit 99477f2

File tree

1 file changed

+58
-3
lines changed

1 file changed

+58
-3
lines changed

tools/preprocessing/README.md

Lines changed: 58 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -98,8 +98,8 @@ Non-runtime paths still live in `config.yaml`:
9898
python -m tools.preprocessing.update_samplesheet_and_structures \
9999
--samplesheet-folder /path/to/mane_missing/data \
100100
--mane-dataset-dir /path/to/mane_only_dataset \
101-
[--predicted-dir /path/to/nfcore/pdbs] \
102-
[--canonical-dir /path/to/af_canonical_pdbs]
101+
[--canonical-dir /path/to/af_canonical_pdbs] \
102+
[--predicted-dir /path/to/nfcore/pdbs]
103103
```
104104

105105
Arguments:
@@ -151,9 +151,64 @@ After each run, `<samplesheet_folder>` contains:
151151

152152
## End-to-end loop
153153

154-
0. **Bootstrap the datasets** – run `oncodrive3d build-datasets --mane_only` to generate the MANE-only baseline and mapping files, and run `oncodrive3d build-datasets --mane` (or a default `build-datasets`) if you also wish to retrieve structures from the canonical AlphaFold download (recommended).
154+
0. **Initialize the datasets** – run `oncodrive3d build-datasets --mane_only` to generate the MANE-only baseline and mapping files, and run `oncodrive3d build-datasets --mane` (or a default `build-datasets`) if you also wish to retrieve structures from the canonical AlphaFold download (recommended).
155155
1. **Prepare the missing set** – run `prepare_samplesheet.py`.
156156
2. **Harvest canonical matches (optional)** – invoke `update_samplesheet_and_structures.py` with `--canonical-dir` to reuse any AlphaFold canonical structures and shrink the missing set before prediction.
157157
3. **Predict the remaining structures** – run nf-core/proteinfold on `missing/samplesheet.csv` + `missing/fasta/`.
158158
4. **Ingest predictions** – re-run `update_samplesheet_and_structures.py` with `--predicted-dir` (and optionally `--canonical-dir` again) to fold new PDBs into `predicted/` and refresh the missing set.
159159
5. **Rebuild Oncodrive3D datasets** – point `oncodrive3d build-datasets --mane_only` at `final_bundle/pdbs` (`--custom_mane_pdb_dir`) and `final_bundle/samplesheet.csv` (`--custom_mane_metadata_path`) so every subsequent `oncodrive3d run` benefits from the extended MANE coverage.
160+
161+
### Example
162+
163+
1. **Initialize datasets**
164+
```bash
165+
# Activate the main O3D environment
166+
source .venv/bin/activate
167+
168+
# Build O3D datasets
169+
oncodrive3d build-datasets --mane_only --output_dir <path/to/o3d_datasets-mane_only-date>
170+
oncodrive3d build-datasets --output_dir <path/to/o3d_datasets-date>
171+
```
172+
173+
2. **Prepare missing set**
174+
```bash
175+
# Activate tools environment
176+
cd tools/preprocessing
177+
source .venv/bin/activate
178+
179+
# Init the MANE missing structures
180+
python -m tools.preprocessing.prepare_samplesheet \
181+
--mane-dataset-dir <path/to/o3d_datasets-mane_only-date> \
182+
--output-dir <path/to/mane_missing-date>
183+
```
184+
185+
3. **Harvest canonical matches (first iteration, optional but recommended)**
186+
```bash
187+
# Retrieve MANE missing structures overlapping sequences of canonical ones
188+
python -m tools.preprocessing.update_samplesheet_and_structures \
189+
--samplesheet-folder <path/to/mane_missing-date> \
190+
--mane-dataset-dir <path/to/o3d_datasets-mane_only-date> \
191+
--canonical-dir <path/to/o3d_datasets-date>
192+
```
193+
194+
4. **Predict remaining structures**
195+
- Feed `<path/to/mane_missing-date>/missing/{samplesheet.csv,fasta/}` to nf-core/proteinfold.
196+
197+
5. **Ingest predictions + canonical reuse**
198+
```bash
199+
# Merge retrieved + predicted structures into a final_bundle
200+
python -m tools.preprocessing.update_samplesheet_and_structures \
201+
--samplesheet-folder <path/to/mane_missing-date> \
202+
--mane-dataset-dir <path/to/o3d_datasets-mane_only-date> \
203+
--canonical-dir <path/to/o3d_datasets-date> \
204+
--predicted-dir <path/to/predicted/pdbs> # (what nf-core/proteinfold produces).
205+
```
206+
207+
6. **Rebuild MANE-only datasets with the final bundle**
208+
```bash
209+
# Build a new MANE only datasets providing the added structures in the final bundle
210+
oncodrive3d build-datasets --mane_only \
211+
--custom_mane_pdb_dir <path/to/mane_missing-date>/final_bundle/pdbs \
212+
--custom_mane_metadata_path <path/to/mane_missing-date>/final_bundle/samplesheet.csv \
213+
--output_dir <path/to/mane_missing-new_date>
214+
```

0 commit comments

Comments
 (0)