Skip to content

Insert a list from a dataset into the a section of a config.yml file! #6

@ccbaumler

Description

@ccbaumler

Please consider the documentation at dib-lab/genome-grist#284 to rapidly include lists into a config file.

This is able to take the tsv sample below and import the list of Assembly Accession identifiers directly into the config file for the spacegraphcats workflow:

The config:
# This is the file path to the metadata file.
# In this case, the file is the full metadata
# output of the SRA Run Selector.

metadata_file_path: metadata/SraRunTable.txt

# Directories
workdir: ~/dissertation-project/seqs

outdir: dissertation-project/seqs

prevent_sra_download: False

# The kmer size within the database (`sourmash sig fileinfo`)
k_size:
  - 21
  - 31
#  - 51 is too large for khmer abundtrimming

# Query genomes for spacegraphcats
query_genomes:
 - GCA_000349525.1

query_radius:
  - 1
  - 5
  - 10

# The amount to scale representative kmer set
scale:
  - 1000
The tsv:
Assembly Accession	Assembly Name	Organism Name	Annotation Name	Assembly Stats Total Sequence Length	Assembly Level	Assembly Release Date	WGS project accession
GCA_000143535.4	ASM14353v4	Botrytis cinerea B05.10	Annotation submitted by Syngenta Biotechnology, Inc.	42630066	Complete Genome	2015-02-05	
GCF_000143535.2	ASM14353v4	Botrytis cinerea B05.10	Annotation submitted by Syngenta Biotechnology, Inc.	42630066	Complete Genome	2015-02-05	
GCA_019186565.1	ASM1918656v1	Botrytis cinerea		42721243	Contig	2021-07-09	JAHHFM01
GCA_019186575.1	ASM1918657v1	Botrytis cinerea		42739314	Contig	2021-07-09	JAHHFN01
GCA_031205075.1	Bcin_M3a_1.1	Botrytis cinerea		43592014	Contig	2023-09-07	JARWBL01
GCA_015148055.1	ASM1514805v1	Botrytis cinerea		41439596	Contig	2020-10-30	JACVFN01

The code:

awk -F'\t' 'NR>1 && NF {print " - " $1}' assembly-test.tsv | sed "/query_genomes:/r /dev/stdin" -i sgc-prep-config.yml
The updated config:
# This is the file path to the metadata file.
# In this case, the file is the full metadata
# output of the SRA Run Selector.

metadata_file_path: metadata/SraRunTable.txt

# Directories
workdir: ~/dissertation-project/seqs

outdir: dissertation-project/seqs

prevent_sra_download: False

# The kmer size within the database (`sourmash sig fileinfo`)
k_size:
  - 21
  - 31
#  - 51 is too large for khmer abundtrimming

# Query genomes for spacegraphcats
query_genomes:
 - GCA_000143535.4
 - GCF_000143535.2
 - GCA_019186565.1
 - GCA_019186575.1
 - GCA_031205075.1
 - GCA_015148055.1
 - GCA_000349525.1

query_radius:
  - 1
  - 5
  - 10

# The amount to scale representative kmer set
scale:
  - 1000

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions