Data to reproduce figures and training data are available at 10.5281/zenodo.10597358. To preserve directory structure, we packaged the data into tar files, divided roughly by figure/analysis. Below is a description of the files:
procap_library_prefixes.txt: Prefixes for the PRO-cap libraries (n=67) used to train and evaluate CLIPNET.procap_to_1kGP_conversion.json: Lists the individual ID for each PRO-cap library (for extracting genotypes from 1kGP). Note that some libraries were ultimately excluded from CLIPNET, so this file has more than 67 entries.training_data.tar.gz: Contains processed data used to train the CLIPNET models.individual_pints_peaks/: Contains the PINTS peaks for each individual PRO-cap library.individual_jittered_windows/: Contains the jittered (uniformly random, +/- 250bp around center of each peak) 1 kb windows for each individual PRO-cap library.processed_data/: Contains the processed data used to train the models, including the individualized sequences and PRO-cap signal (RPM normalized). Packaged as npz arrays. Data were concatenated across libraries, then split into the data folds described inprocessed_data/data_fold_assignments.csv.gz. We note that the PRO-cap data are structured as N x 2000 arrays (1000 bp pl strand, 1000 bp mn strand). The sequence data are structured as N x 1000 x 4 arrays (N = number of sequences, 1000 = sequence length, 4 = two-hot encoding of sequences).
evaluation_metric.tar.gz: Contains the evaluation metrics for the CLIPNET models. Supporting data are inevaluation_data.tar.gz.ensemble_test/: Contains the evaluation metrics for the individual models on the complete hold out data set (fold 0).individual_test/: Contains the evaluation metrics for the model folds on the individual model hold out folds (model 1 used fold 1 as a hold out, model 2 used fold 2, etc).fixed_uniq_windows.bed.gz: A fixed set of 1 kb windows used to evaluate the models. We selected PRO-cap peaks that were present in at least 20 of the 67 libraries, then selected 1 kb windows around each of them (with 250 bp jittering).mean_predictor_corrs.csv.gz: Correlation between an averaged PRO-cap track (across loci) against individual tracks.replicate_pearsons.csv.gz: Correlation between tracks from isogenic replicates (n=9).clipnet_test_predictions.h5: Prediction of the ensembled model on data fold 0.puffin_clipnet_test_perf.csv.gz: Track correlations for Puffin's PRO-cap head.
evaluation_data.tar.gz: Contains data and predictions used to evaluate the performance of the CLIPNET models.processed_data/: Contains the processed data used to evaluate the models.procap/: Contains the processed PRO-cap signal (csv) for each data fold.sequences/: Contains the sequences (fasta) for each data fold.
merged_pl_rpm.bw: bigWig file containing RPM-normalized plus strand signals, averaged across all individuals.merged_mn_rpm.bw: bigWig file containing RPM-normalized minus strand signals, averaged across all individuals.
deepshap_scores.tar.gz: Contains DeepSHAP contribution scores.merged_windows_all.bed.gz: A nonredundant set of 212,777 windows around PRO-cap peaks (union across all libraries) used for calculating DeepSHAP scores.all_tss_windows_reference_seq.fna.gz: The reference (hg38) sequence for the windows inmerged_windows_all.bed.gz.all_seqs_onehot.npz: A one-hot encoded version of the reference sequence. This and the score arrays are structured as N x 4 x 1000 arrays for compatibility with TF-MoDISco.mean_across_folds_all_profile.npz: The profile contribution scores (mean across model folds).mean_across_folds_all_quantity.npz: The quantity contribution scores (mean across model folds).
tfmodisco_results.tar.gz: Contains TF-MoDISco results.mean_across_folds_all_profile_modisco.h5: The TF-MoDISco results for the profile contribution scores.mean_across_folds_all_quantity_modisco.h5: The TF-MoDISco results for the quantity contribution scores.mean_across_folds_all_profile_modisco/: A report of the TF-MoDISco results for the profile contribution scores.mean_across_folds_all_quantity_modisco/: A report of the TF-MoDISco results for the quantity contribution scores.mean_across_folds_all_modisco_positions.h5: Distribution of TF-MoDISco motif positions around the max TSS for each window.
qtl_analysis.tar.gz: Contains the finished QTL analysis (log L2 ref - alt scores). Supporting data are inqtl_data.tar.gz.qtl_data.tar.gz: Contains analysis of both tiQTLs and diQTLs.tiqtl/: Contains the tiQTL analysis.predictions/: Contains predictions for each individual centered on each tiQTL.ensemble_predictions/: Contains the predictions of the ensemble model.individual_predictions/: Contains the predictions of the individual models.
tiQTL_snps.bed.gz: The SNPs used for the tiQTL analysis (note that we dropped multiallelic SNPs).tiqtl_windows.bed.gz: The windows used for the tiQTL analysis.
diqtl/: Contains the diQTL analysis. Identical structure totiqtl/.