To facilitate the cost-effective generation of large scATAC-seq atlases for deep learning model training, we developed a new version of the open-source microfluidic system HyDrop with increased sensitivity and scale: HyDrop v2.
Deciphering the cis-regulatory logic underlying cell type identity is a fundamental question in biology. Single-cell chromatin accessibility (scATAC-seq) atlases have enabled training of sequence-to-function (S2F) deep learning models, allowing decoding of enhancer logic and design of synthetic enhancers. It is expected that the number of single cells that are sampled per cell type and the number of unique ATAC fragments per cell are the main factors to consider when building a training data set for S2F modeling. However, precise criteria to optimize S2F training data have not yet been evaluated. In addition, it is unknown whether custom scATAC-seq technologies, which often have lower sensitivity per cell but are more cost-effective, can be used to train S2F models. Here, we introduce an improved custom scATAC-seq method, called HyDrop v2, with increased sensitivity and scale, and benchmark this method against other scATAC-seq methods in the context of S2F model training. We find that lower fragment counts per cell can be compensated for by adding additional cells to a dataset. S2F models trained on either custom or commercial scATAC-seq atlases are comparable in terms of enhancer prediction, sequence explainability, and transcription factor footprinting. Data generated with different scATAC platforms can be combined into large-scale atlases to serve as training data for deep learning models for a cost-effective use of resources.
The sequencing data and count matrix generated in this study have been deposited in the Gene Expression Omnibus database under the accession code GSE293575 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE293575). For mouse cortex data, this includes the public data from 10x Genomics (https://www.10xgenomics.com/datasets/fresh-cortex-from-adult-mouse-brain-p-50-1-standard-1-1-0, https://www.10xgenomics.com/datasets/8k-adult-mouse-cortex-cells-atac-v1-1-chromium-x-1-1-standard, https://www.10xgenomics.com/datasets/8k-adult-mouse-cortex-cells-atac-v2-chromium-controller-2-standard) and earlier published data of De Rop et al. (2022) with raw data available at GSE175684 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi). The scATAC coverage bigwigs can be downloaded at https://ucsctracks.aertslab.org/papers/hydrop_v2_paper/ or https://zenodo.org/communities/aertslab_hydrop_v2_paper/. The sciATAC-seq data of the Drosophila embryo age 16-20h after egg laying were downloaded from Calderon et al. (2022, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE190149). All processed fragment files are available at https://resources.aertslab.org/papers/hydrop_v2/. Source data are provided with this paper.
The code used to develop the model, perform the analyses, and generate results in this study is publicly available and has been deposited in this GitHub repo, under CC-BY license. The specific version of the code associated with this publication is mentioned in the Key Resource Table S5 and archived in Zenodo, accessible via https://zenodo.org/records/17434493. A detailed explanation of the PUMATAC pipeline to process HyDrop v2 and 10x data can be found at https://github.com/aertslab/PUMATAC. Detailed instructions on CREsted can be found at https://github.com/aertslab/CREsted and https://crested.readthedocs.io/en/latest/changelog.html. For Seq2PRINT (scPRINTER in python implementation), we refer to the tutorial found here https://github.com/buenrostrolab/scPrinter by Hu et al. (2025).