Skip to content

Mismatch Between adapter_trimmed_reads and the Number of split-by-adapter Sequences #36

@Stacyhwr

Description

@Stacyhwr

Hello,

I encountered a problem when performing quality control and adapter trimming using fastplong.
The JSON report indicates that 737 reads contain adapter sequences, but when I count the reads by their names after QC, the number reaches 193,512.

Here is the command I used:

fastplong -l 20 -q 7 -w 6 -i test.fastq.gz -o test.filter.fq.gz -j test.fastplong.json -s ATCATGCGAGGGCTAATTGTATATCACC

Below is the relevant part of the JSON report:

{
	"summary": {
		"fastplong_version": "0.3.0",
		"before_filtering": {
			"total_reads":743644,
			"total_bases":10389662359,
			"q20_bases":7603655307,
			"q30_bases":2685202180,
			"q20_rate":0.731848,
			"q30_rate":0.258449,
			"read_mean_length":13971,
			"gc_content":0.367994
		},
		"after_filtering": {
			"total_reads":745767,
			"total_bases":10361933317,
			"q20_bases":7584624643,
			"q30_bases":2678339372,
			"q20_rate":0.73197,
			"q30_rate":0.258479,
			"read_mean_length":13894,
			"gc_content":0.367975
		}
	},
	"filtering_result": {
		"passed_filter_reads": 745767,
		"low_quality_reads": 963,
		"too_many_N_reads": 0,
		"too_short_reads": 795,
		"too_long_reads": 0
	},
	"adapter_cutting": {
		"adapter_trimmed_reads": 737,
		"adapter_trimmed_bases": 92832,
		"read_start_adapter": "ATCATGCGAGGGCTAATTGTATATCACC",
		"read_end_adapter": "GGTGATATACAATTAGCCCTCGCATGAT",
		"read_adapter_counts": {"CTAATTGTATATCACC":17, "GGTGATATACAATTAGC":8, "GGTGATATACAATTAGCC":16, "GGTGATATACAATTAGCCCTCGCA":10, "GGTGATATACAATTAGCCCTCGCATG":8, "ATCATGCGAGGGCTAATTGTATATCACC":456, "GGTGATATACAATTAGCCCTCGCATGAT":187, "others":35}
	},
}

After QC, my statistics on the output FASTQ show:

  1. 193,512 reads whose names contain the tag split-by-adapter .
  2. Among them, 4,015 reads were split into both left and right parts .
  3. The rest are single‐side splits (only left or only right).

Example:

Original read
ID: 111_65_3798_1359_574291794_77069_2_13.46
Length: 6093 bp

After QC
ID: split-by-adapter-left-111_65_3798_1359_574291794_77069_2_13.46
Length: 5340 bp
ID: split-by-adapter-right-111_65_3798_1359_574291794_77069_2_13.46
Length:705 bp


My Questions

  1. Why does the JSON report only 737 adapter_trimmed_reads, while the output FASTQ contains 193,512 split-by-adapter sequences? Could this discrepancy be caused by other parameters or by a different mechanism in fastplong?
  2. Some reads in my dataset contain adapter sequences at both ends and are therefore split into two fragments. Is there any way to optimize the handling of such “dual-end adapter” cases?

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions