Releases: argonne-lcf/dlio_benchmark
Releases · argonne-lcf/dlio_benchmark
Release v2.0.1
What's Changed
- Fixed mocking for DFTracer by @hariharan-devarajan in #220
- Fixed iterator to only store data for that rank. by @hariharan-devarajan in #216
- Fix PyPI Publish Issue and Improve Project Metadata by @izzet in #224
- Fix missing import for chunking. by @hariharan-devarajan in #223
- Improve CI Performance. by @hariharan-devarajan in #227
- For sample indexing we fix the uneven sampling by @hariharan-devarajan in #226
- fix misleading generator message by @rayandrew in #231
- fix negative value of computation time when stdev exists by @rayandrew in #233
- Bugfix: fix type of number for offset and size by @hariharan-devarajan in #229
- Fix wrong configuration for hdf5 chunking by @rayandrew in #237
- fix last step is not executed by @rayandrew in #236
- fix wrong tracing location of fetch data by @rayandrew in #238
- enable option to disable pin_memory in pytorch by @rayandrew in #239
- Change
maxtoabsfor preprocess time by @rayandrew in #240 - New improved modelling for LLM Deepspeed. by @hariharan-devarajan in #230
- Add user config to specify type of distribution of time configuration by @rayandrew in #241
- fixed bug on doc action by @zhenghh04 in #251
- Update jekyll-gh-pages.yml by @zhenghh04 in #254
- upgrade pydftracer package by @rayandrew in #242
- Support for setting different DLIO_LOG_LEVEL by @zhenghh04 in #222
- Enhancing metric calculation and output functionality by @zhenghh04 in #253
- Checkpointing support for transformer type models by @zhenghh04 in #247
- Update jekyll-gh-pages.yml by @zhenghh04 in #257
- Fix doc deployment issue by @zhenghh04 in #258
- left over logging fix by @zhenghh04 in #259
- docker: use pip install to match readme by @glimchb in #265
- ci: add docker build and publish by @glimchb in #263
- Darshan preload environment variable removed by @zhenghh04 in #260
- Copyright update by @zhenghh04 in #261
- Update docker.yml by @zhenghh04 in #266
- ci: also publish docker image on releases by @glimchb in #267
- Fix saving checkpoint print by @LouisDDN in #270
- docs: small readme typo by @glimchb in #268
- Fixed loading checkpoint timer by @zhenghh04 in #273
- Refactor: move pydftracer dependency to extras for better management by @hariharan-devarajan in #275
- Enhancement for checkpoint feature by @zhenghh04 in #276
- Separate read and write checkpoints. by @zhenghh04 in #278
- configs by @zhenghh04 in #284
- docker: improve docker cache and remove sources by @glimchb in #287
- Fixes for v2.0 benchmark by @johnugeorge in #289
- Reorganized the code provided by YardenMa for O_DIRECT support with NPY and NPZ formats and pytorch by @timothy-chau in #286
- RAM optimisations for checkpointing by @LouisDDN in #283
- Randomize tensor data by default (checkpoint) by @LouisDDN in #291
- S3 Fix by @zhenghh04 in #294
- Mlperf storage v2.0 by @zhenghh04 in #303
- Dimension-based Dataset Generation by @rayandrew in #301
- docs(profiling): fix dftracer repo location by @glimchb in #304
- Add DFTracer AI logging support with dftracer by @rayandrew in #302
- increase tests timeout to 600s (10 minutes) by @rayandrew in #312
New Contributors
- @rayandrew made their first contribution in #231
- @glimchb made their first contribution in #265
- @timothy-chau made their first contribution in #286
Full Changelog: v2.0.0...v2.0.1
Release v2.0.0
What's Changed
- Add docker image with CPU only dependencies by @johnugeorge in #8
- Add dlio fixes by @johnugeorge in #10
- Fixed issues related to checkpointing and profiling by @zhenghh04 in #13
- Config parameters fixes by @johnugeorge in #11
- Fixing folder number for evaluation by @johnugeorge in #14
- fixed checkpoint issues by @zhenghh04 in #16
- Adding PR unit tests for testing different data format and fixing issues for reading png and jpeg with pytorch data folder. by @zhenghh04 in #17
- A bunch of minor fixes by @zhenghh04 in #18
- Minor fixes by @zhenghh04 in #22
- Add ckpting to UNET3D workload, remove old prefetch param by @lhovon in #23
- Minor modification of configuration options to remove some confusion by @zhenghh04 in #25
- Adding Storage interface for supporting multiple storage backends by @johnugeorge in #20
- Code Fixes by @johnugeorge in #26
- Add the UNET3D sleep time for V100 32GB batch size 4 by @lhovon in #29
- Minor config changes by @johnugeorge in #31
- Make hydra config folder configurable by @johnugeorge in #32
- Mlperf storage v0.5 by @zhenghh04 in #33
- Changes to support segregation of data loader and reader by @hariharan-devarajan in #37
- Added application-level profile support for DLIO by @hariharan-devarajan in #39
- Multithreading issue with TensorFlow and PyTorch dataloader by @hariharan-devarajan in #44
- bug fix to free memory once file is completely read by @hariharan-devarajan in #51
- Pull changes from mlperf_storage_v0.5.1 by @zhenghh04 in #52
- Improved tracing utility added preprocessing support by @zhenghh04 in #53
- Trace improvement. by @hariharan-devarajan in #48
- Moved resize image to config by @zhenghh04 in #55
- instead of using direct methods using enter and exit. by @hariharan-devarajan in #54
- Reorganizing output files by @zhenghh04 in #56
- Generator fixed random seed by @zhenghh04 in #58
- Merging branch mlperf_storage_v0.5.1 by @zhenghh04 in #57
- fixing mistakes in calculating total number of steps by @zhenghh04 in #59
- Mlperf storage v0.5.1 by @zhenghh04 in #60
- Added support for Dali data loader by @hariharan-devarajan in #49
- Changed datatype to be np.uint8 universally in the call by @zhenghh04 in #61
- Adding support for training on a subset of dataset by @zhenghh04 in #63
- DLIO profiler integration by @hariharan-devarajan in #62
- Added Support Power9PC by @hariharan-devarajan in #65
- Update unet3d.yaml to correct the sample size for unet3d by @zhenghh04 in #68
- For X86 and AMD machines, we can create a pip based dlio installations by @hariharan-devarajan in #66
- Added validation to check enough core available for reading by @hariharan-devarajan in #73
- Added custom plugin code for custom data loader and reader. by @hariharan-devarajan in #74
- Changes required within DLIO Benchmark for creating a pip wheel by @hariharan-devarajan in #77
- Update bert.yaml to be consistent with mlperf storage by @zhenghh04 in #79
- Fixing subfolder issues and added subset tests by @zhenghh04 in #82
- Documentation: Instructions to compile and run on Lassen machine. by @OlgaKogiou in #85
- Changes to improve documentation by @hariharan-devarajan in #89
- Fixed dali data loader execution. by @hariharan-devarajan in #91
- Enhancing Dali data loader support by @zhenghh04 in #94
- Fixing Dali Data loader Parallelism and Pipelining. by @hariharan-devarajan in #93
- Update typo which gives issue for pytorch 1.3.1 by @hariharan-devarajan in #103
- Added documentation for the JPEG generator issue by @kaushikvelusamy in #100
- Workloads by @zhenghh04 in #97
- Added Info logging for profiler and removed unnecessary bracket calls. by @hariharan-devarajan in #104
- Fix the data dir path by @hariharan-devarajan in #108
- Making DLIO Profiler default for dlio_benchmark. by @hariharan-devarajan in #111
- Adding dlp logger. by @hariharan-devarajan in #109
- Workloads by @zhenghh04 in #112
- fixed readthedoc build issue by @zhenghh04 in #115
- fix Docker file to use venv. by @hariharan-devarajan in #119
- Switch dlio_profiler to use pypi instead of github by @hariharan-devarajan in #120
- Added force install for profiler for avoiding caching issues by @hariharan-devarajan in #123
- Update README.md by @venkat-1 in #121
- torch checkpoint creation should use storage class methods by @krehm in #126
- Reducing Github actions time by @zhenghh04 in #128
- Create output_folder using os.makedirs() by @krehm in #124
- Adding Native Dali Data Loader support for TFRecord, Images, and NPZ files by @zhenghh04 in #118
- Add support for pytorch spawn and forkserver multiprocessing_context by @krehm in #129
- Reopen dlio.log in non-fork reader_threads child processes by @krehm in #130
- added checkpointing to support LLMs by @hariharan-devarajan in #114
- added dlp for spawned workers pytorch by @hariharan-devarajan in #136
- Fix MPI finalization. by @hariharan-devarajan in #139
- Adding dlio_profiler to requirements.txt by @johnugeorge in #144
- Fix dataloader initialization to only happen once. Not on every epoch. by @hariharan-devarajan in #143
- Fix random sampling pytorch non-determinism. by @hariharan-devarajan in #145
- Fixed printing for DLIO output. by @hariharan-devarajan in #142
- Doc changes to fix DLIO profiler and remove IOStat by @hariharan-devarajan in #146
- Support for custom checkpointing. by @hariharan-devarajan in #137
- Feature/parallel io generator by @hariharan-devarajan in #148
- fix random bugs and printing by @hariharan-devarajan in #147
- Release for v2.0 by @zhenghh04 in #113
- Fix requirements file by @johnugeorge in #150
- fixed sample distribution bugs by @zhenghh04 in #152
- Fix sample shuffling by @hariharan-devarajan in #154
- Optimization to sample distribution by @TheAssembler1 in https://github.com/argonne-lcf/dlio_benchmark/pull...
DLIO v1.1
In this new release, we have the following changes and new enhancements
- Added support for S3 storage
- Updated config files for MLPerf Storage workloads: UNet3D and Bert.
- Changes on configuration options:
- added variability support for sample size, training and validation computation time.
- changes on shuffling, prefetching setting.
- moved batch_size, batch_size_eval to reader session
This release is correspondence to MLPerf storage v0.5 prerelease: https://github.com/mlcommons/storage/releases/tag/v0.5-rc0
DLIO v1.0
DLIO v1.0 Release Notes
We are excited to announce the release of DLIO 1.0! There are many new features and new enhancements compared to previous 0.0.1 version:
- Using YAML file to configure DLIO in Hydra.cc framework; The configuration options are organized in a hierarchical way, including
model,framework,workflow,dataset,train,evaluation,checkpoint,profiling. a set of YAML files for some workloads are included. - Data loader support enhancement:
- Added data loader layer above data format to allow user to choose data loader and data format independently.
- Added PyTorch data loader support. We have full PyTorch data loader support for one sample per file dataset
- Enhanced TensorFlow tf.data loader to support for generic file format beyond tfrecord format (currently only support one sample per file case for generic data format)
- New dataset support
- Added support for png and jpeg formats
- Supporting multiple subfolders for training and validation datasets.
- Supporting generating validation dataset
- Profiling and logging
- Added support for iostat profiling
- Added detailed logging info
- Added support for validation.
- Added post processing python script
- Added unit tests and GitHub Actions tests.
- User and developer documentation in
github.io: https://argonne-lcf.github.io/dlio_benchmark