All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Added support for Bruker .d (TDF) files in parsers and datasets.
- Added the
hf_converterkeyword toSpectrumDataset._to_tensorandAnnotatedSpectrumDataset._to_tensorfor pylance compatibility. - Added Jupyter to documentation dependencies.
- Fixed peptide tokenizer detokenizing in reverse.
- Handle MGF
index=prefixes during parsing. - Handle missing precursor charge in the MzML parser.
- Adjusted the Lance URL in datasets and documentation.
Tokenizer.detokenize()now truncates the output to the first stop token it finds, iftrim_stop_token=True.
- Add stop and start tokens for
AnnotatedSpectrumDataset, when available. - When
reverseis used for thePeptideTokenizer, automatically reverse the decoded peptide.
- Added support for unsigned modification masses that don't quite conform to the Proforma standard.
- The
scan_idcolumn for parsed spectra is not a sting instead of an integer. This is less space efficient, but we ran into issues with Sciex indexing when trying to use only an integer.
- Partially revert length changes to
SpectrumDatasetandAnnotatedSpectrumDataset. We removed__len__from both due to problems with PyTorch Lightning compatibility. - Simplify dataset code by removing redundancy with
lance.pytorch.LanceDatset. - Improved warning message for skipped spectra.
- Length of the
SpectrumDatasetandAnnotatedSpectrumDatasetnow reflect thesamplesparameter of thelance.pytorch.LanceDatasetparent class.
- The length of
SpectrumDatasetandAnnotatedSpectrumDatasetis now the number of batches, not the number of spectra. This let's tools like PyTorch Lighting create their progress bars properly. - Parsing a dataset now no longer requires reading essentially the whole first file. Now the schema is inferred from the first 128 spectra.
- Significant updates to documentation. Add how to model mass spectra.
- Reading and writing from cloud storage on everything!
- Migrated to Mike for mkdocs to manage multiple versions.
- Moved test GitHub Action from pip to uv.
We have completely reworked of the data module. Depthcharge now uses Apache Arrow-based formats instead of HDF5; spectra are converted either Parquet or streamed with PyArrow, optionally into Lance datasets.
We now also have full support for small molecules, with the MoleculeTokenizer,
AnalyteTransformerEncoder, and AnalyteTransformerDecoder classes.
PeptideTransformer*are nowAnalyteTransformer*, providing full support for small molecule analytes. Additionally the interface has been completely reworked.- Mass spectrometry data parsers now function as iterators, yielding batches of spectra as
pyarrow.RecordBatchobjects. - Parsers can now be told to read arbitrary fields from their respective file formats with the
custom_fieldsparameter. - The parsing functionality of
SpctrumDatasetand its subclasses have been moved to thespectra_to_*functions in the data module. SpectrumDatasetand its subclasses now return dictionaries of data rather than a tuple of data. This allows us to incorporate arbitrary additional dataSpectrumDatasetand its subclasses are nowlance.torch.data.LanceDatasetsubclasses, providing native PyTorch integration.- All dataset classes now do not have a
loader()method.
- Support for small molecules.
- Added the
StreamingSpectrumDatasetfor fast inference. - Added
spectra_to_df,spectra_to_df,spectra_to_streamto thedepthcharge.datamodule.
- Determining the mass spectrometry data file format is now less fragile. It now looks for known line contents, rather than relying on the extension.
- Support for fine-tuning the wavelengths used for encoding floating point numbers like m/z and intensity to the
FloatEncoderandPeakEncoder.
- The
tgt_maskin thePeptideTransformerDecoderwas the incorrect type. Now it isboolas it should be. Thanks @justin-a-sanders!
- Providing a proper tokenization class (also resolves #24 and #18)
- First-class support for ProForma peptide annotations, thanks to
spectrum_utilsandpyteomics. - Adding primitive dataclasses for peptides, peptide ions, mass spectra ... and even small molecules 🚀
- Adding type hints to everything and stricter linting with Ruff.
- Adding a ton of tests.
- Tight integration with
spectrum_utils💪
- Moving preprocessing onto parsing instead of data loading (similar to @bittremieux's proposal in #31)
- Combining the SpectrumIndex and SpectrumDataset classes into one.
- Changing peak encodings. Instead of encoding the intensity using a linear projection and summing with the sinusoidal m/z encodings, now the intensity is also sinusoidally encoded and is combined with the sinusoidal m/z encodings using a linear layer.
- Applied hotfix from v0.3.1
- Fixed retrieving version information.
- Change target mask from float to boolean.
- Log the number spectra that are skipped due to an invalid precursor charge.
- Dropped pytorch-lightning as a dependency.
- Removed SpectrumDataModule
- Removed full-blown models (depthcharge.models)
- Fixed sinusoidal encoders (Issue #27)
MassEncoderis nowFloatEncoder, because its generally useful for encoding floating-point numbers.
- pre-commit hooks and linting with Ruff.
- Tensorboard is now an optional dependency.
- The example de novo peptide sequencing model.
- The
detokenize()method now returns a list instead of a string.
- This if the first release! All changes from this point forward will be recorded in this changelog.