All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Renamed
annbatch.Loader.add_anndatasto {meth}annbatch.Loader.add_adatas. - Renamed
annbatch.Loader.add_anndatato {meth}annbatch.Loader.add_adata. - The
sparse_chunk_size,sparse_shard_size,dense_chunk_size, anddense_shard_sizeparameters of {func}annbatch.write_shardedhave been replaced byn_obs_per_chunk(number of observations per chunk, automatically converted to element counts for sparse arrays) andshard_size(number of observations per shard or a size string). The corresponding parameters in {meth}annbatch.DatasetCollection.add_adatasaren_obs_per_chunkandshard_size.
- Formatted progress bar descriptions to be more readable.
- {class}
annbatch.DatasetCollectionnow accepts arngargument to the {meth}annbatch.DatasetCollection.add_adatasmethod.
shard_sizein {meth}annbatch.DatasetCollection.add_adatasandshard_sizein {func}annbatch.write_shardednow accept a human-readable size string (e.g.'1GB','512MB') in addition to an integer number of observations. When a string is provided, the observation count is derived independently for each array element from its uncompressed bytes-per-row so that every shard stays close to the target size.dataset_sizein {meth}annbatch.DatasetCollection.add_adatasnow accepts a human-readable size string (e.g.'20GB','512MB') in addition to an integer number of observations. When a string is provided, the per-row byte size is estimated from the on-disk metadata of the input datasets during validation and used to derive the observation count. The default has changed from2_097_152to'20GB'.
- {class}
~annbatch.Loaderacccepts anrngargument now
- Make the in-memory concatenation strategy configurable for {meth}
annbatch.Loader.__iter__via aconcat_strategyargument to__init__- sparse on-disk will concatenated then shuffled/yielded (faster, higher memory usage) but dense will be shuffled and then concated/yielded (lower memory usage). - Downcast
indicesof sparse matrices if possible when writing to disk via {attr}anndata.settings.write_csr_csc_indices_with_min_possible_dtype
- Don't concatenate all i/o-ed chunks in-memory, instead yielding from individual chunks as though they were concatenated (i.e., not abreaking hcange with the {class}
annbatch.abc.SamplerAPI). Should improve memory performance especially for dense data
- Fix bug with bringing the nullable/categorical columns into memory by default
- Now {class}
annbatch.Loaderexpectspreload_nchunks * chunk_size % batch_size == 0for simplification and efficiency.
- Introduced an {class}
annbatch.abc.Samplerabstract base class. Users can implement and pass any class instance that is a subclass to thebatch_samplerargument of {class}annbatch.Loader. - Exposed the older default sampling scheme as {class}
annbatch.ChunkSampler, which is used internally to match older behavior whenbatch_samplerisn't provided to {class}annbatch.Loader.
- Load into memory nullables/categoricals from
obsby default when shuffling (i.e., no customload_adataargument toannbatch.DatasetCollection.add_adatas)
- Revert
h5adshuffling into one big store (i.e., go back to sharding into individual files) and add warning thath5adis not fully supported byannbatch.is_collection_h5adargument to initialization of {class}annbatch.DatasetCollectionmust be passed when initializing into to use a preshuffled collection ofh5adfiles, reading or writing. - Renamed {class}
annbatch.types.LoaderOutput["labels"]and["data"]to["obs"]and["X"]respectively.
ZarrSparseDatasetandZarrDenseDatasethave been conslidated into {class}annbatch.Loadercreate_anndata_collectionandadd_to_collectionhave been moved into theannbatch.DatasetCollection.add_adatasmethod- Default reading of input data is now fully lazy in
annbatch.DatasetCollection.add_adatas, and therefore the shuffle process may now be slower although have better memory properties. Useload_adataargument inannbatch.DatasetCollection.add_adatasto customize this behavior. - Files shuffled under the old
create_anndata_collectionwill not be recognized by {class}annbatch.DatasetCollectionand therefore are not usable with the new {class}annbatch.Loader.use_collectionAPI. At the moment, the file metadata we maintain is only for internal purposes - however, if you wish to migrate to be able to use {class}annbatch.DatasetCollectionin conjunction with {class}annbatch.Loader.use_collection, the root folder of the old collection must have attrs{"encoding-type": "annbatch-preshuffled", "encoding-version": "0.1.0"}and be a {class}zarr.Group. The subfolders (i.e., datasets) must be calleddataset_([0-9]*). Otherwise you can use theannbatch.DatasetCollection.add_adatasas before.
preload_to_gpunow depends on whethercupyis installed instead of defaulting toTrue
- First release