You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: joss/paper.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,7 +35,7 @@ bibliography: paper.bib
35
35
36
36
# Summary
37
37
38
-
Cryogenic electron microscopy (cryo-EM) [@cryoem-drug-review; @cryoem-challenges] is an imaging technique used to obtain the structure of objects of near-atomic scales experimentally via transmission electron microscopy of cryogenically frozen samples. The Electron Microscopy Public Image Archive (EMPIAR) [@empiar] is a public resource for the raw image data collected by cryo-EM experiments and facilitates free access to this data, allowing it to be used for methods development and validation. Deep learning-based image processing approaches have been applied to many steps of the cryo-EM reconstruction workflow [@ai-in-cryoem]. Many of the resulting algorithms have been widely adopted as they enable quicker processing or improved interpretation of the data. Deep learning-based approaches require large amounts of data to train the algorithms. However, as datasets can have hundreds of files and sizes on the order of terabytes or hundreds of gigabytes, downloading and managing these datasets can become a barrier to the development of deep-learning methods. Additionally, the currently recommended tools to download data from EMPIAR either use proprietary software, require a user account or necessitate a web browser.
38
+
Cryogenic electron microscopy (cryo-EM) [@cryoem-drug-review; @cryoem-challenges] is an imaging technique used to obtain the structure of biomolecular objects at near-atomic scales experimentally via transmission electron microscopy of cryogenically frozen samples. The Electron Microscopy Public Image Archive (EMPIAR) [@empiar] is a public resource for the raw image data collected by cryo-EM experiments and facilitates free access to this data, allowing it to be used for methods development and validation. Deep learning-based image processing approaches have been applied to many steps of the cryo-EM reconstruction workflow [@ai-in-cryoem]. Many of the resulting algorithms have been widely adopted as they enable quicker processing and/or improved interpretation of the data. Deep learning-based approaches require large amounts of data to train the algorithms. However, as datasets can have hundreds of files and sizes on the order of terabytes or hundreds of gigabytes, downloading and managing these datasets can become a barrier to the development of deep-learning methods. Additionally, the currently recommended tools to download data from EMPIAR either use proprietary software, require a user account or necessitate a web browser.
39
39
To address this and to provide a way to integrate EMPIAR data into machine learning codebases, we have developed EMPIARreader. This is an open source tool which provides a Python library to allow lazy loading of EMPIAR datasets into a machine learning-compatible format. It parses EMPIAR metadata, uses the mrcfile library [@mrcfile] to interpret MRC files, supports common image file formats and uses the starfile library [@starfile] to interpret STAR files. To our knowledge, there are no other tools to effectively make use of EMPIAR in a dynamic manner for data intensive tasks such as machine learning. EMPIARreader additionally provides a simple, lightweight command line interface (CLI) which allows users to search and download EMPIAR entries using glob patterns or regular expressions and then download files via FTP or HTTP(S).
40
40
EMPIARreader is easily installed in a Python environment via the standard Python package management tools pip and Poetry and has been released as a PyPI [@pypi] package ([EMPIARreader](https://pypi.org/project/empiarreader/)).
41
41
@@ -44,9 +44,9 @@ EMPIARreader is easily installed in a Python environment via the standard Python
44
44
In cryo-EM, the scattering of the electron beam by the electrostatic potential of the molecules in the sample is recorded in the images captured by the detector.
45
45
Due to advancements in hardware and software since 2013, the resolution achievable via cryo-EM reconstruction rivals that possible through x-ray crystallography [@cryoem-resolution], with cryo-EM being the preferable technique for determining the conformations of many macromolecules [@cryoem-development].
46
46
The images which make up cryo-EM datasets commonly have a very low signal to noise ratio (SNR) germane to minimisation of radiation damage induced disorder. Consequently, the structures are obtained by averaging through thousands of examples of the structures in the samples, which necessitates a very large dataset per experiment.
47
-
Raw image datasets are deposited into the online public image archive, EMPIAR [@empiar]. There is a loose structure to follow, but generally each deposited dataset is structured according to the needs or preferences of the depositing user with no particular directory structure enforced. With over 1300 entries and >3PB of data hosted, EMPIAR has become an important resource for the structural biology community, amassing over 700 citations in published works.
47
+
Raw image datasets can be deposited into the online public image archive, EMPIAR [@empiar]. There is a loose structure to follow, but generally each deposited dataset is structured according to the needs or preferences of the depositing user with no particular directory structure enforced. With over 1300 entries and >3PB of data hosted, EMPIAR has become an important resource for the structural biology community, amassing over 700 citations in published works.
48
48
49
-
Deep-learning-based methods have developed significantly in recent years [@dl-development] and a number of algorithms have been developed for use in cryo-EM data processing. Deep-learning has been applied to the particle picking [@topaz; @cryolo], 3D classification and dynamics [@cryodrgn; @3dflex; @dynamight], postprocessing [@deepemhancer]and model building [@jamali2023automated; @backbonepred] stages of the reconstruction pipeline among many more examples [@ai-in-cryoem]. Datasets from EMPIAR have been used extensively for training and validating cryo-EM related deep learning algorithms, particularly for those which rely on raw image data. To make optimal use of the archive it is essential that the datasets are easily accessible and their size does not hinder accessibility or algorithm performance.
49
+
Deep-learning-based methods have developed significantly in recent years [@dl-development] and a number of algorithms have been developed for use in cryo-EM data processing. Deep-learning has been applied to stages of the image processing and reconstruction pipeline, including particle picking [@topaz; @cryolo], 3D classification and dynamics [@cryodrgn; @3dflex; @dynamight]and model building [@jamali2023automated; @backbonepred] among many more examples [@ai-in-cryoem]. Datasets from EMPIAR have been used extensively for training and validating cryo-EM related deep learning algorithms, particularly for those which rely on raw image data. To make optimal use of the archive it is essential that the datasets are easily accessible and their size does not hinder accessibility or algorithm performance.
50
50
51
51
The current recommended methods to download data from EMPIAR are via:
52
52
@@ -61,7 +61,7 @@ EMPIARreader allows the granularity of downloads to be configured from an entire
61
61
62
62
63
63
# Licensing and userbase
64
-
EMPIARreader is offered under a BSD 3-clause license and can be utilised either from a CLI or via a Python library. It is currently in active use by researchers at the Alan Turing Institute and the STFC Scientific Computing Department.
64
+
EMPIARreader is offered under a BSD 3-clause license and can be utilised either from a CLI or via a Python library. It is currently in active use by researchers at the Alan Turing Institute and STFC Scientific Computing Department.
0 commit comments