-
Notifications
You must be signed in to change notification settings - Fork 140
Description
Background
I am using MEGAHIT to assemble metagenomic data and need to use the assembly graph (FASTG/GFA) for downstream analysis (specifically for GraphBin). Since MEGAHIT does not produce these files by default, I used megahit_toolkit to convert intermediate contigs:
Bash
megahit_toolkit contig2fastg 141 intermediate_contigs/k141.contigs.fa > k141.fastg
The Problem
The Contig IDs in the generated k141.fastg do not match the IDs in the final.contigs.fa.
Example in final.contigs.fa: >k141_126935
Example in k141.fastg:>NODE_1_length_151_cov_1.0000_ID_1:NODE_536013_length_143_cov_1.0000_ID_1072025;
Because of this mismatch, downstream tools like GraphBin cannot map the binning results (based on final.contigs.fa) back to the assembly graph.
Questions to the Developers
Why do IDs change? Does megahit_toolkit contig2fastg rename nodes during the conversion process, or is it using an internal indexing system?
Standard Workflow: What is the recommended way to generate a GFA or FASTG file that maintains 100% ID consistency with the final assembly output?
Alternative: Is there a way to output the assembly graph directly during the megahit run instead of converting intermediate files post-assembly?