Rather than the current system of each sub-corpora it is own folder with its own code. Create a top-level downloads.sh which can re-assemble the sub-corpora.
Separately, have the downloaded & pre-processed sub-corpora ready to be referenced from ADR, and NMT repos as submodules etc.