Snakemake workflow to setup external data for data analyses. The data sources can be local or remote files.
Easiest is to install via pip:
python -m pip install git+https://github.com/percyfal/datasources-smk@main
Alternatively grab a copy of the source distribution and make a local install:
git clone https://github.com/percyfal/datasources-smk.git
cd datasources-smk
python -m pip install -e .
The workflow and additional commands run via the main entry point:
datasources -h
datasources run -j 1
datasources run --configfile datasources.yaml
See the subcommand help for more information.
This workflow reads a datasources yaml file with list elements
consisting of data and source keys, or alternatively a
tab-separated file with columns data and source. The data and
source keys define file URI mapping from source to a snakemake
target. Supported URI schemes are currently rsync, file, sftp,
http and https.
There are two optional keys; description is a free text field for
provenance information, and tag a tag to group data types such that
subsets of datasources can be targeted.
The datasources file can be provided via the --configfile option. If
unset, the workflow will look for files datasources.yaml,
datasources.tsv, config/datasources.yaml and
config/datasources.tsv, in that order.
URIs are given according to the URI generic
syntax.
For instance, a local file is given as file:relative/path/to/source,
whereas examples of a remote files are
rsync://example.com:80/absolute/path/to/source and
sftp://example.com:80/absolute/path/to/source.
A tsv-formatted datasources file can look like
data source
data/foo1.txt rsync:external_resources/foo1.txt
data/foo2.txt file:external_resources/foo2.txt
data/README.md https://raw.githubusercontent.com/percyfal/datasources-smk/main/README.md
data/foo/foo*txt file:external_resources/
and the corresponding yaml file
- data: data/foo1.txt
source: rsync:external_resources/foo1.txt
description: foo1 data file to copy
- data: data/foo2.txt
source: file:external_resources/foo2.txt
description: foo2 data file to link
- data: data/README.md
source: https://raw.githubusercontent.com/percyfal/datasources-smk/main/README.md
description: Grab readme file from github
- data: data/foo/foo*txt
source: file:external_resources/
description: >-
link all *txt files from directory external_resources to directory
data/foo
- Per Unneberg (@percyfal)
Test cases are in the subfolder src/datasources/.test. They are automatically
executed via continuous integration with Github
Actions.