@@ -16,23 +16,19 @@ the JSONs from S3 instead of hooking on a live API.
1616The loaded data is organised like-so in the S3 bucket:
1717``` bash
1818< DESTINATION__FILESYSTEM__BUCKET_URL>
19- ├── category # "category" resource from /category endpoint
20- │ └── 2026-02-16
21- │ └── 1771268036.7864842.3722039a90.jsonl # JSONL data from /category for 2026-02-16
22- ├── category_data # "category_data" resouce from /data/category/{categoryId}
23- │ └── 2026-02-16
24- │ └── 1771268036.7864842.4a41d98fad.jsonl # JSONL data from /data/category/{categoryId} for 2026-02-16
25- ├── _dlt_loads # One file per pipeline run (load), describes the load
26- │ └── submission_source__1771268036.7864842.jsonl
27- ├── _dlt_pipeline_state # Pipeline state files
28- │ └── submission-snapshot__1771267844.1206408__998e553c0cea456594bce118ab30fc8850159efc09fbfb1e5179df2b13293c46.jsonl
29- ├── _dlt_version # Dataset schema versioning
30- │ └── submission_source__1771267974.1898882__998e553c0cea456594bce118ab30fc8850159efc09fbfb1e5179df2b13293c46.jsonl
19+ ├── category
20+ │ └── 2026-03-03-data.jsonl # JSONL data from /category for 2026-03-03
21+ ├── category_data
22+ │ └── 2026-03-03-data.jsonl # JSONL data from /data/category/{categoryId} for 2026-03-03
23+ ├── _dlt_loads # Pipeline run metadata files
24+ ├── _dlt_pipeline_state # Pipeline state files
25+ ├── _dlt_version # Dataset schema versioning
3126└── init
3227```
3328
3429> [ !NOTE]
35- > We include the ` Category.id ` and ` Category.studyId ` values from the ` /category ` endpoint in the
30+ > We include the ` Category.id ` and ` Category.studyId ` values from the ` /category ` endpoint in the ` category_data ` items,
31+ > so that downstream ingestions can take the full JSONL file and load each item in the appropriate dataset.
3632
3733### Getting a Submission API OIDC bearer token
3834
@@ -67,7 +63,7 @@ base_url = "<BASE SUBMISSION API URL>"
6763[destination.filesystem]
6864# s3 bucket, use 'file://<ABSOLUTE PATH>' to use the local filesystem
6965bucket_url = "s3://<BUCKET NAME>" # replace with bucket name/path
70- layout = "{table_name}/{YYYY}-{MM}-{DD}/{load_id}.{file_id} .{ext}"
66+ layout = "{table_name}/{YYYY}-{MM}-{DD}-data .{ext}"
7167
7268[destination.filesystem.credentials]
7369# doesn't matter if using a local filesystem
@@ -215,7 +211,12 @@ spec:
215211
216212### Ingest snapshots into Bento with Bento-ETL
217213
218- TODO: Document how to do this.
214+ The ` submission-snapshot` workflow is deployed in PCGL's `dev` cluster at the moment.
219215
220- ~~Need to implement S3 source first.~~
221- Bento-ETL S3 source has been implemented, ready to integrate.
216+ Kustomization base [link](https://github.com/Pan-Canadian-Genome-Library/deployment/blob/main/base/research-portal/submission-snapshots-cronjob/kustomization.yaml).
217+
218+ | Environement | Repo location | ArgoCD Application |
219+ | ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
220+ | `dev` | [Kustomization Link](https://github.com/Pan-Canadian-Genome-Library/deployment/blob/main/dev/research/submission-snapshots/kustomization.yaml) | [App link](https://argocd.ingress.dev.k8s.pcgl.dev-sd4h.ca/applications/argocd/submission-snapshots?view=tree&resource=) |
221+ | `staging` | n/a | |
222+ | `prod` | n/a | |
0 commit comments