-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Good morning,
I'm a student in Computer Science at Università degli Studi di Milano and for my thesis I am assessing some pipeline for the analysis of SARS-CoV-2 samples.
In order to select the best pipeline for our requirements, I'm using the benchmark datasets available here.
I found the Supplementary_table2 in your paper (Xiaoli L, Hagey JV, Park DJ, Gulvik CA, Young EL, Alikhan N-F, Lawsin A, Hassell N, Knipe K, Oakeson KF, Retchless AC, Shakya M, Lo C-C, Chain P, Page AJ, Metcalf BJ, Su M, Rowell J, Vidyaprakash E, Paden CR, Huang AD, Roellig D, Patel K, Winglee K, Weigand MR, Katz LS. 2022. Benchmark datasets for SARS-CoV-2 surveillance bioinformatics. PeerJ 10:e13821 http://doi.org/10.7717/peerj.13821) and I would like to use also the data contained there for evaluations (and not only the file in.tsv available for every dataset).
I'm writing here because I can't understand how the column 'Total reads' is calculated. In particular, I used FastQC (the value of the field 'Total Sequences') to compute this value and I also counted the reads in the original .FASTQ file but the numbers don't correspond to the ones published in the Supplementary_table2.
Do you know why the numbers are different? Is it possible that Supplementary_table2 is outdated with respect to the current version of the dataset?
If this is the case, which version of the dataset is matched to Supplementary_table2 and used in your paper?
Thank you very much for your time :)
Best regards,
Sara Manfredi