Contents of Zenodo archive of processed CLIPNET data and results

Data to reproduce figures and training data are available at 10.5281/zenodo.10597358. To preserve directory structure, we packaged the data into tar files, divided roughly by figure/analysis. Below is a description of the files:

procap_library_prefixes.txt: Prefixes for the PRO-cap libraries (n=67) used to train and evaluate CLIPNET.
procap_to_1kGP_conversion.json: Lists the individual ID for each PRO-cap library (for extracting genotypes from 1kGP). Note that some libraries were ultimately excluded from CLIPNET, so this file has more than 67 entries.
training_data.tar.gz: Contains processed data used to train the CLIPNET models.
- individual_pints_peaks/: Contains the PINTS peaks for each individual PRO-cap library.
- individual_jittered_windows/: Contains the jittered (uniformly random, +/- 250bp around center of each peak) 1 kb windows for each individual PRO-cap library.
- processed_data/: Contains the processed data used to train the models, including the individualized sequences and PRO-cap signal (RPM normalized). Packaged as npz arrays. Data were concatenated across libraries, then split into the data folds described in processed_data/data_fold_assignments.csv.gz. We note that the PRO-cap data are structured as N x 2000 arrays (1000 bp pl strand, 1000 bp mn strand). The sequence data are structured as N x 1000 x 4 arrays (N = number of sequences, 1000 = sequence length, 4 = two-hot encoding of sequences).
evaluation_metric.tar.gz: Contains the evaluation metrics for the CLIPNET models. Supporting data are in evaluation_data.tar.gz.
- ensemble_test/: Contains the evaluation metrics for the individual models on the complete hold out data set (fold 0).
- individual_test/: Contains the evaluation metrics for the model folds on the individual model hold out folds (model 1 used fold 1 as a hold out, model 2 used fold 2, etc).
- fixed_uniq_windows.bed.gz: A fixed set of 1 kb windows used to evaluate the models. We selected PRO-cap peaks that were present in at least 20 of the 67 libraries, then selected 1 kb windows around each of them (with 250 bp jittering).
- mean_predictor_corrs.csv.gz: Correlation between an averaged PRO-cap track (across loci) against individual tracks.
- replicate_pearsons.csv.gz: Correlation between tracks from isogenic replicates (n=9).
- clipnet_test_predictions.h5: Prediction of the ensembled model on data fold 0.
- puffin_clipnet_test_perf.csv.gz: Track correlations for Puffin's PRO-cap head.
evaluation_data.tar.gz: Contains data and predictions used to evaluate the performance of the CLIPNET models.
- processed_data/: Contains the processed data used to evaluate the models.
  - procap/: Contains the processed PRO-cap signal (csv) for each data fold.
  - sequences/: Contains the sequences (fasta) for each data fold.
- merged_pl_rpm.bw: bigWig file containing RPM-normalized plus strand signals, averaged across all individuals.
- merged_mn_rpm.bw: bigWig file containing RPM-normalized minus strand signals, averaged across all individuals.
deepshap_scores.tar.gz: Contains DeepSHAP contribution scores.
- merged_windows_all.bed.gz: A nonredundant set of 212,777 windows around PRO-cap peaks (union across all libraries) used for calculating DeepSHAP scores.
- all_tss_windows_reference_seq.fna.gz: The reference (hg38) sequence for the windows in merged_windows_all.bed.gz.
- all_seqs_onehot.npz: A one-hot encoded version of the reference sequence. This and the score arrays are structured as N x 4 x 1000 arrays for compatibility with TF-MoDISco.
- mean_across_folds_all_profile.npz: The profile contribution scores (mean across model folds).
- mean_across_folds_all_quantity.npz: The quantity contribution scores (mean across model folds).
tfmodisco_results.tar.gz: Contains TF-MoDISco results.
- mean_across_folds_all_profile_modisco.h5: The TF-MoDISco results for the profile contribution scores.
- mean_across_folds_all_quantity_modisco.h5: The TF-MoDISco results for the quantity contribution scores.
- mean_across_folds_all_profile_modisco/: A report of the TF-MoDISco results for the profile contribution scores.
- mean_across_folds_all_quantity_modisco/: A report of the TF-MoDISco results for the quantity contribution scores.
- mean_across_folds_all_modisco_positions.h5: Distribution of TF-MoDISco motif positions around the max TSS for each window.
qtl_analysis.tar.gz: Contains the finished QTL analysis (log L2 ref - alt scores). Supporting data are in qtl_data.tar.gz.
qtl_data.tar.gz: Contains analysis of both tiQTLs and diQTLs.
- tiqtl/: Contains the tiQTL analysis.
  - predictions/: Contains predictions for each individual centered on each tiQTL.
    - ensemble_predictions/: Contains the predictions of the ensemble model.
    - individual_predictions/: Contains the predictions of the individual models.
  - tiQTL_snps.bed.gz: The SNPs used for the tiQTL analysis (note that we dropped multiallelic SNPs).
  - tiqtl_windows.bed.gz: The windows used for the tiQTL analysis.
- diqtl/: Contains the diQTL analysis. Identical structure to tiqtl/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contents of Zenodo archive of processed CLIPNET data and results

FilesExpand file tree

DOWNLOADS_README.md

Latest commit

History

DOWNLOADS_README.md

File metadata and controls

Contents of Zenodo archive of processed CLIPNET data and results