Skip to content

Commit 72e9e79

Browse files
authored
Merge pull request #19 from MPI-EVA-Archaeogenetics/dev
Add ethical sample scrubbing. v1.3.0
2 parents 99b3259 + d33cadd commit 72e9e79

10 files changed

Lines changed: 303 additions & 6 deletions

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,3 +15,5 @@ test_data/
1515
eager_inputs_old/
1616
eager_outputs_old/
1717
array_Logs/
18+
poseidon_packages/
19+
debug_tables/

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,20 @@
33
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
44
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
55

6+
## [1.3.0] - 12/07/2023
7+
8+
### `Added`
9+
- `scripts/ethical_sample_scrub.sh`: A script to remove eager input/outputs for samples that were marked as ethically sensitive after the pipelines picked them up.
10+
- `scripts/cron_ethical_scrub.sh`: A cron-able script to run `ethical_sample_scrub.sh` daily.
11+
- `scripts/clear_work_dirs.sh`: A bash script to `rm -r` the work directories of an individual ID for both `SG` and `TF` processing.
12+
- `scripts/clear_results.sh`: A bash script that deletes the results for an individual while maintaining the nextflow process cache for them.
13+
14+
### `Fixed`
15+
- `scripts/cron_daily_prepare.sh`: Silenced permission errors due to ethical sample scrubbing.
16+
### `Dependencies`
17+
18+
### `Deprecated`
19+
620
## [1.2.0] - 21/03/2023
721

822
### `Added`

README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,3 +146,46 @@ Comparing the timestamp of the Autorun_eager genotypes and those in the poseidon
146146
5. Use `trident update` to bump the package version (`1.0.0` if the package is newly created), and create a Changelog.
147147
6. Validate the resulting package.
148148
7. If validation passes, publish the (updated) version of the package to the central repository in `poseidon_packages/` and remove any temporary files created.
149+
150+
## ethical_sample_scrub.sh
151+
152+
A shell script that scrubs the Autorun_eager input and output directories of all individuals in a specified list of sensitive sequencing IDs. This is used daily with the most up-to-date list of sensitive sequencing IDs to ensure that no results are available even if marking samples as sensitive was done late.
153+
154+
```
155+
usage: ethical_sample_scrub.sh [options] <sensitive_seqIds_list>
156+
157+
This script pulls the Pandora individual IDs from the list of sensitive sequencing IDs, and
158+
removes all Autorun_eager input and outputs from those individuals (if any).
159+
This ensures that no results are available even if marking samples as sensitive was done late.
160+
161+
Options:
162+
-h, --help Print this text and exit.
163+
```
164+
165+
## clear_work_dirs.sh
166+
167+
A shell script that will clear the work directories of individuals in a specified individual ID list from both the SG and TF results directories.
168+
169+
```
170+
usage: clear_work_dirs.sh [options] <ind_id_list>
171+
172+
This script clears the work directories of individuals in a specified individual ID list from both the SG and TF results directories.
173+
174+
Options:
175+
-h, --help Print this text and exit.
176+
```
177+
178+
## clear_results.sh
179+
180+
A shell script that clears the results directories of all individuals in a specified list While maintaining nextflow's caching of already-ran processes. This is useful for refreshing the results directories of individuals when changes to the input might have changes merging of libraries, thus making the directory structure inconsistent.
181+
182+
```
183+
usage: clear_results.sh [options] <ind_id_list>
184+
185+
This script removes all output directory contents for the provided individuals, without clearing out caching, allowing for the results to be re-published.
186+
This enables refreshing of result directories when changes to the input might have changes merging of libraries, thus making the directory structure inconsistent.
187+
188+
Options:
189+
-h, --help Print this text and exit.
190+
-a, --analysis_type Set the analysis type. Options: TF, SG.
191+
```

scripts/clear_results.sh

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
#!/usr/bin/env bash
2+
3+
## This script removes the results for an individiaul while maintaining the nextflow process cache for them.
4+
## It is intended as a way to refresh the results directories of an individual. This can be useful either
5+
## to remove older files after additional libraries appear and are therefore merged, or to remove results
6+
## with misleading names in cases where Pandora entries get updated (e.g. protocol mixup leading to changes
7+
## in strandedness for a library).
8+
9+
## Helptext function
10+
function Helptext() {
11+
echo -ne "\t usage: $0 [options] <ind_id_list>\n\n"
12+
echo -ne "This script removes all output directory contents for the provided individuals, without clearing out caching, allowing for the results to be re-published.\n This enables refreshing of result directories when changes to the input might have changes merging of libraries, thus making the directory structure inconsistent.\n\n"
13+
echo -ne "Options:\n"
14+
echo -ne "-h, --help\t\tPrint this text and exit.\n"
15+
echo -ne "-a, --analysis_type\t\tSet the analysis type. Options: TF, SG.\n"
16+
}
17+
18+
## Print messages to stderr, optionally with colours
19+
function errecho() {
20+
local Normal
21+
local Red
22+
local Yellow
23+
local colour
24+
25+
Normal=$(tput sgr0)
26+
Red=$(tput sgr0)'\033[1;31m' ## Red normal face
27+
Yellow=$(tput sgr0)'\033[1;33m' ## Yellow normal face
28+
29+
colour=''
30+
if [[ ${1} == '-y' ]]; then
31+
colour="${Yellow}"
32+
shift 1
33+
elif [[ ${1} == '-r' ]]; then
34+
colour="${Red}"
35+
shift 1
36+
fi
37+
echo -e ${colour}$*${Normal} 1>&2
38+
}
39+
40+
## Parse CLI args.
41+
TEMP=`getopt -q -o ha: --long analysis_type:,help -n 'clear_results.sh' -- "$@"`
42+
eval set -- "$TEMP"
43+
44+
## Default parameters
45+
ind_id_list_fn=''
46+
analysis_type=''
47+
48+
## Read in CLI arguments
49+
while true ; do
50+
case "$1" in
51+
-h|--help) Helptext; exit 0 ;;
52+
-a|--analysis_type) analysis_type="${2}"; shift 2;;
53+
--) ind_id_list_fn="${2}"; break ;;
54+
*) echo -e "invalid option provided: $1.\n"; Helptext; exit 1;;
55+
esac
56+
done
57+
58+
## Validate inputs
59+
if [[ ${ind_id_list_fn} == '' ]]; then
60+
errecho "No individual ID list provided.\n"
61+
Helptext
62+
exit 1
63+
fi
64+
65+
if [[ ${analysis_type} == '' ]]; then
66+
errecho "No --analysis_type was provided.\n"
67+
Helptext
68+
elif [[ ${analysis_type} != "SG" && ${analysis_type} != "TF" ]]; then
69+
errecho "analysis_type must be SG or TF. You provided: ${analysis_type}\n"
70+
Helptext
71+
fi
72+
73+
root_eager_dir='/mnt/archgen/Autorun_eager/eager_outputs' ## Directory should include subdirectories for each analysis type (TF/SG) and sub-subdirectories for each site and individual.
74+
75+
## Read all individual IDs into an array
76+
input_iids=($(cat ${ind_id_list_fn}))
77+
78+
## Remove all dirs except for 'work' and 'pipeline_info'.
79+
## Both needed for caching.
80+
## Also leave '1240k.imputed' and 'GTL_output' alone.
81+
for ind_id in ${input_iids[@]}; do
82+
site_id=${ind_id:0:3} ## Site id is the first three characters of the individual ID
83+
dirs_to_delete=$(ls -1 -d ${root_eager_dir}/${analysis_type}/${site_id}/${ind_id}/* | grep -vw -e 'work' -e '1240k.imputed' -e 'GTL_output' -e 'pipeline_info')
84+
for dir in ${dirs_to_delete}; do
85+
errecho "Deleting results in: ${dir}"
86+
rm -r ${dir} ## Delete the specific result directory and all its contents
87+
done
88+
done

scripts/clear_work_dirs.sh

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
#!/usr/bin/env bash
2+
3+
## This script accepts a list of individual IDs and clears the nextflow work directories for both SG and TF data processing of each ID.
4+
5+
## Helptext function
6+
function Helptext() {
7+
echo -ne "\t usage: $0 [options] <ind_id_list>\n\n"
8+
echo -ne "This script clears the work directories of individuals in a specified individual ID list from both the SG and TF results directories.\n\n"
9+
echo -ne "Options:\n"
10+
echo -ne "-h, --help\t\tPrint this text and exit.\n"
11+
}
12+
13+
## Print messages to stderr
14+
function errecho() { echo -e $* 1>&2 ;}
15+
16+
## Parse CLI args.
17+
TEMP=`getopt -q -o h --long help -n 'clean_work_dirs.sh' -- "$@"`
18+
eval set -- "$TEMP"
19+
20+
ind_id_list_fn=''
21+
22+
## Read in CLI arguments
23+
while true ; do
24+
case "$1" in
25+
-h|--help) Helptext; exit 0 ;;
26+
--) ind_id_list_fn="${2}"; break ;;
27+
*) echo -e "invalid option provided: $1.\n"; Helptext; exit 1;;
28+
esac
29+
done
30+
31+
if [[ ${ind_id_list_fn} == '' ]]; then
32+
echo -e "No individual ID list provided.\n"
33+
Helptext
34+
exit 1
35+
fi
36+
37+
root_eager_dir='/mnt/archgen/Autorun_eager/eager_outputs' ## Directory should include subdirectories for each analysis type (TF/SG) and sub-subdirectories for each site and individual.
38+
39+
## Read all individual IDs into an array
40+
input_iids=($(cat ${ind_id_list_fn}))
41+
42+
for ind_id in ${input_iids[@]}; do
43+
site_id=${ind_id:0:3} ## Site id is the first three characters of the individual ID
44+
errecho -ne "Clearing work directories for ${ind_id}..."
45+
for analysis_type in SG TF; do
46+
if [[ -d ${root_eager_dir}/${analysis_type}/${site_id}/${ind_id}/work ]]; then
47+
errecho -ne " ${analysis_type}..."
48+
# ls -d ${root_eager_dir}/${analysis_type}/${site_id}/${ind_id}/work
49+
rm -rf ${root_eager_dir}/${analysis_type}/${site_id}/${ind_id}/work
50+
fi
51+
done
52+
errecho ''
53+
done

scripts/cron_daily_prepare.sh

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,15 @@
77
cd /mnt/archgen/Autorun_eager
88

99
# 1240k
10-
# Note: this find only checks runs starting from 2020
11-
find /mnt/archgen/Autorun/Results/Human_1240k/2* -name '*.bam' -mtime -1 | cut -f 7 -d "/"| sort -u| while read RUN ; do
10+
# Note: this find only checks runs starting from 2020. Silence stderr to avoid 'permission denied'.
11+
find /mnt/archgen/Autorun/Results/Human_1240k/2* -name '*.bam' -mtime -1 2>/dev/null | cut -f 7 -d "/" | sort -u | while read RUN ; do
1212
echo "Processing TF data from run: ${RUN}"
1313
scripts/prepare_eager_tsv.R -s $RUN -a TF -o eager_inputs/ -d .eva_credentials
1414
done
1515

1616
# Shotgun
17-
# Note: this find only checks runs starting from 2020
18-
find /mnt/archgen/Autorun/Results/Human_Shotgun/2* -name '*.bam' -mtime -1 | cut -f 7 -d "/"| sort -u| while read RUN ; do
17+
# Note: this find only checks runs starting from 2020. Silence stderr to avoid 'permission denied'.
18+
find /mnt/archgen/Autorun/Results/Human_Shotgun/2* -name '*.bam' -mtime -1 2>/dev/null | cut -f 7 -d "/" | sort -u | while read RUN ; do
1919
echo "Processing SG data from run: ${RUN}"
2020
scripts/prepare_eager_tsv.R -s $RUN -a SG -o eager_inputs/ -d .eva_credentials
2121
done

scripts/cron_ethical_scrub.sh

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
#!/bin/bash
2+
3+
## Use ethically_culturally_sensitive list to scrub any sensitive sample results
4+
5+
cd /mnt/archgen/Autorun_eager
6+
7+
list_fn="/mnt/archgen/Autorun/Pandora_Tables/Ethically_Sensitive.txt"
8+
9+
scripts/ethical_sample_scrub.sh ${list_fn}

scripts/ethical_sample_scrub.sh

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
#!/usr/bin/env bash
2+
3+
## Helptext function
4+
function Helptext() {
5+
echo -ne "\t usage: $0 [options] <sensitive_seqIds_list>\n\n"
6+
echo -ne "This script pulls the Pandora individual IDs from the list of sensitive sequencing IDs, and\n removes all Autorun_eager input and outputs from those individuals (if any).\n This ensures that no results are available even if marking samples as sensitive was done late.\n\n"
7+
echo -ne "Options:\n"
8+
echo -ne "-h, --help\t\tPrint this text and exit.\n"
9+
}
10+
11+
## Print messages to stderr, optionally with colours
12+
function errecho() {
13+
local Normal
14+
local Red
15+
local Yellow
16+
local colour
17+
18+
Normal=$(tput sgr0)
19+
Red=$(tput sgr0)'\033[1;31m' ## Red normal face
20+
Yellow=$(tput sgr0)'\033[1;33m' ## Yellow normal face
21+
22+
colour=''
23+
if [[ ${1} == '-y' ]]; then
24+
colour="${Yellow}"
25+
shift 1
26+
elif [[ ${1} == '-r' ]]; then
27+
colour="${Red}"
28+
shift 1
29+
fi
30+
echo -e ${colour}$*${Normal} 1>&2
31+
}
32+
33+
## Parse CLI args.
34+
TEMP=`getopt -q -o h --long help -n 'ethical_sample_scrub.sh' -- "$@"`
35+
eval set -- "$TEMP"
36+
37+
## Read in CLI arguments
38+
while true ; do
39+
case "$1" in
40+
-h|--help) Helptext; exit 0 ;;
41+
--) sensitive_seq_id_list="${2}"; break ;;
42+
*) echo -e "invalid option provided: $1.\n"; Helptext; exit 1;;
43+
esac
44+
done
45+
46+
## Hardcoded paths
47+
root_input_dir='/mnt/archgen/Autorun_eager/eager_inputs' ## Directory should include subdirectories for each analysis type (TF/SG) and sub-subdirectories for each site and individual.
48+
root_output_dir='/mnt/archgen/Autorun_eager/eager_outputs' ## Directory should include subdirectories for each analysis type (TF/SG) and sub-subdirectories for each site and individual.
49+
50+
51+
if [[ ${sensitive_seq_id_list} = '' ]]; then
52+
echo -e "No input file provided.\n"
53+
Helptext
54+
exit 1
55+
fi
56+
57+
if [[ ! -f ${sensitive_seq_id_list} ]]; then
58+
echo "File not found: ${sensitive_seq_id_list}"
59+
exit 1
60+
else
61+
## Create list of unique individual IDs from the list of sensitive seq_ids
62+
scrub_me=($(cut -d '.' -f 1 ${sensitive_seq_id_list} | sort -u ))
63+
64+
## If the individuals were flagged as sensitive AFTER processing started, both the inputs and outputs should be made inaccessible.
65+
for raw_iid in ${scrub_me[@]}; do
66+
for analysis_type in "SG" "TF"; do
67+
## EAGER_INPUTS
68+
site_id="${raw_iid:0:3}"
69+
eager_input_tsv="${root_input_dir}/${analysis_type}/${site_id}/${raw_iid}/${raw_iid}.tsv"
70+
## If the eager inpput exists, hide the entire directory and make it inaccessible
71+
if [[ -f ${eager_input_tsv} ]]; then
72+
old_name=$(dirname ${eager_input_tsv})
73+
new_name=$(dirname ${old_name})/.${raw_iid}
74+
mv -v ${old_name} ${new_name} ## Hide the input directory
75+
chmod 0700 ${new_name} ## Restrict the directory contents
76+
fi
77+
78+
## EAGER_OUTPUTS
79+
eager_output_dir="${root_output_dir}/${analysis_type}/${site_id}/${raw_iid}/"
80+
if [[ -d ${eager_output_dir} ]]; then
81+
new_outdir_name=$(dirname ${eager_output_dir})/.${raw_iid}
82+
mv -v ${eager_output_dir} ${new_outdir_name} ## Hide the output directory
83+
chmod 0700 ${new_outdir_name} ## Restrict the directory contents
84+
fi
85+
done
86+
done
87+
fi
88+

scripts/prepare_eager_tsv.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ save_ind_tsv <- function(data, rename, output_dir, ...) {
4646
data %>% select(-individual.Full_Individual_Id) %>% readr::write_tsv(file=paste0(ind_dir,"/",ind_id,".tsv")) ## Output structure can be changed here.
4747

4848
## Print Autorun_eager version to file
49-
AE_version <- "1.2.0"
49+
AE_version <- "1.3.0"
5050
cat(AE_version, file=paste0(ind_dir,"/autorun_eager_version.txt"), fill=T, append = F)
5151
}
5252

scripts/update_poseidon_package.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
#!/usr/bin/env bash
22

3-
VERSION="1.2.0"
3+
VERSION="1.3.0"
44

55
## Colours for printing to terminal
66
Yellow=$(tput sgr0)'\033[1;33m' ## Yellow normal face

0 commit comments

Comments
 (0)