-
Notifications
You must be signed in to change notification settings - Fork 42
Description
I am reading sequence data from SRA records by first downloading the SRA record with the prefetch command and then iterating through the file using the C++ interface (version 2.10.8), i.e.:
ngs::ReadCollection run("DRR001375");
const size_t num_read = run.getReadCount(ngs::Read::all);
ngs::ReadIterator run_iter = ngs::ReadIterator( run.getReadRange ( 1, num_read, ngs::Read::all ) );
size_t read_count = 0;
while( run_iter.nextRead() ){
++read_count;
while( run_iter.nextFragment() ){
const string seq = run_iter.getFragmentBases().toString();
// Process the read sequence ...
process_read(seq);
}
}
In general, this approach seems to work well. However, I have noticed that for some SRA records (like ERR191522), there is (a) significant memory consumption and (b) a dramatic slow-down when iterating through the file. The following plot shows the speed (in reads per second) and memory consumption (from /proc/meminfo, reported as a fraction of total system memory):

Other SRA records seem to be fine. For DRR001375, the following graph shows fairly constant speed and memory usage:

Is there a way to read SRA records, like ERR191522, without the large memory consumption? If not, is there a way to identify SRA records (in advance) that will exhibit this behavior (as the available RAM on on cluster instances can easily be exhausted while processing a single SRA record).