Skip to content

Extreme memory consumption when reading certain SRA records? #31

@jgans

Description

@jgans

I am reading sequence data from SRA records by first downloading the SRA record with the prefetch command and then iterating through the file using the C++ interface (version 2.10.8), i.e.:

ngs::ReadCollection run("DRR001375");
const size_t num_read = run.getReadCount(ngs::Read::all);
ngs::ReadIterator run_iter = ngs::ReadIterator( run.getReadRange ( 1, num_read, ngs::Read::all ) );

size_t read_count = 0;
while( run_iter.nextRead() ){

	++read_count;
	while( run_iter.nextFragment() ){
		
		const string seq = run_iter.getFragmentBases().toString();

		// Process the read sequence ...
		process_read(seq);
	}
}

In general, this approach seems to work well. However, I have noticed that for some SRA records (like ERR191522), there is (a) significant memory consumption and (b) a dramatic slow-down when iterating through the file. The following plot shows the speed (in reads per second) and memory consumption (from /proc/meminfo, reported as a fraction of total system memory):
image

Other SRA records seem to be fine. For DRR001375, the following graph shows fairly constant speed and memory usage:
image

Is there a way to read SRA records, like ERR191522, without the large memory consumption? If not, is there a way to identify SRA records (in advance) that will exhibit this behavior (as the available RAM on on cluster instances can easily be exhausted while processing a single SRA record).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions