diff --git a/Rules.md b/Rules.md new file mode 100644 index 0000000..e15c5c7 --- /dev/null +++ b/Rules.md @@ -0,0 +1,431 @@ +# MLPerf™ Storage V2.0 Benchmark Validation Rules +—————————————————————————————————————————— + +- [MLPerf Storage V2.0 Benchmark Validation Rules](#mlperf-storage-v20-benchmark-validation-rules) + - [1. Introduction](#1-introduction) + - [2. Directory Structure for All Submissions](#2-directory-structure-for-all-submissions) + - [3. Validating the Training Options](#3-validating-the-training-options) + - [3.1. Benchmark Dataset Generation Options](#31-benchmark-dataset-generation-options) + - [3.2. Benchmark Run Options](#32-benchmark-run-options) + - [4. Validating the Checkpointing Options](#3-validating-the-checkpointing-options) + - [4.1. Benchmark Run Options](#41-benchmark-run-options) + - [4.2. Storage System Must Be Simultaneously R/W or Remappable](#42-storage-system-must-be-simultaneously-rw-or-remappable) + + +# 1. Introduction + +These are the requirements for the *submission validation checker* for version 2.0 of the MLPerf™ Storage benchmark, +but since the `mlpstorage` tool will be responsible for generating the vast majority (if not all) of the contents of a submission, it is also a spec for what `mlpstorage` should generate. + +The *submission validation checker* should check that the tested directory hierarachy matches the below requirements and output messages for all cases where it does not match. +The tool should make it's best effort to continue testing all the other aspects of the directory hierarchy after any given failure. +If the tested directory hierarchy does not meet all of the below requirements, then it should be labelled as invalid and the validation check should fail. + +Even if the structure of a submission package matches the spec, the options that were used to run the benchmark may not fall within acceptable bounds, +so we need the *submission validation checker* to check for illegal/inapproriate option settings, +and for semantic mismatches between different options that were used. + +The `mlpstorage` tool must be used to run the benchmarks, submitters are not allowed to run the underlying tools (eg: DLIO) directly to generate a submission package. + +1.1. **mlpstorageGeneratesHierarchy** -- The `mlpstorage` command must obtain (somehow) the pathname of the output file directory hierarchy and directly create and/or append to the files within that hierarchy to successively build out the submission folder. We don't want the submitter to manually create anything in that hierarchy except for the SystemDescription.* files (if we can help it). + +# 2. Directory Structure for All Submissions + +2.1. **submitterRootDirectory** -- The submission structure must start from a single directory whose name is the name of the submitter. This can be any string, but a blank or any other character in that string that cannot be part of a POSIX filename should be replaced 1-for-1 with a dash character. + +2.2. **topLevelSubdirectories** -- Within the top-level directory of the submission structure there must be a directory named "closed" and/or one named "open", and nothing more. These names are case-sensitive. + +2.3. **openMatchesClosed** -- The "open" directory hierarchy should be constructed identically to the "closed" directory hierarchy describe just below. + +2.4. **closedSubmitterDirectory** -- Within the "closed" directory there must be a single directory whose name is the name of the submitter (the same as the top-level directory). + +2.5. **requiredSubdirectories** -- Within the submitter directory mentioned just above, there must be exactly three directories: "code", "results", and "systems". These names are case-sensitive. + +2.6. **codeDirectoryContents** -- The "code" directory must include a complete copy of the MLPerf Storage github repo that was used to run the test that resulted in the "results" directory's contents. +If this is in the "open" hierarchy, any modifications made to the benchmark code must be included here, and if this is in the "closed" hierarchy, there must be no changes to the benchmark code. +Note that in both cases this must be the code that was actually run to generate those results. In a CLOSED submission, the *submission validator* should do an md5sum of the code directory hierarchy, compare that to a value hard-coded into the validator code, and fail the validation if there is a difference. + +2.7. **systemsDirectoryFiles** -- The "systems" directory must contain two files for each "system name", a .yaml file and a .pdf file, and nothing more. Each of those files must be named with the "system name". +Eg: for a system-under-test named "Big_and_Fast_4000_buffered", there must be a "Big_and_Fast_4000_buffered.yaml" and a "Big_and_Fast_4000_buffered.pdf" file. These names are case-sensitive. + +2.8. **resultsDirectorySystems** -- The "results" directory, whether it is within the "closed' or "open" hierarchies, must include one or more directories that are the names of the systems-under-test. Eg: a system name could be "Big_and_Fast_4000_buffered". +This name can be anything the submitter wants, it is just a name to both idenfity the set of results that were collected from a given +configuration of storage system and to link together those results with the .pdf and .yaml files that describe the system-under-test. + +2.9. **identicalSystemConfig** -- All the configuration parameters and hardware and software components of the system-under-test that are part of a given *system name* must be identical. Any changes to those configuration parameters or hardware or software must be submitted as a separate *system name*, so we should compare the configuration parameters and hardware and software components to verify that they're the same across all the tests and runs within the given *system name* directory hierarchy, to the extent that we can. The *system names* are case-sensitive. + +2.10. **workloadCategories** -- Within a *system name* directory in the "results" directory, there must be one or both of the following directories, and nothing else: "training", and/or "checkpointing". These names are case-sensitive. + +2.11. **trainingWorkloads** -- Within the "training" directory, there must be one or more of the following *workload directories*, and nothing else: "unet3d", "resnet50" and/or "cosmoflow". These names are case-sensitive. + +2.12. **trainingPhases** -- Within the *workload directories* in the "training" hierarchy, there must exist *phase directories* named "datagen" and "run", and nothing else. These names are case-sensitive. + +2.13. **datagenTimestamp** -- Within the "datagen" *phase directory* within the "training" directory hierarchy, there must be exactly one *timestamp directory* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. + +2.14. **datagenFiles** -- Within the *timestamp directory* within the "datagen" *phase*, there must exist the following files: "training_datagen.stdout.log", "training_datagen.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. + +2.15. **datagenDlioConfig** -- The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. + +2.16. **runResultsJson** -- Within the "run" *phase directory* within the "training" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. + +2.17. **runTimestamps** -- Within the "run" *phase directory* within the "training" directory hierarchy, there must also be exactly 6 subdirectories named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. Note that the 1st of those 6 is the *warm up* run and will not be included in the reported performance. + +2.18. **runTimestampGap** -- The timestamp (the day and time) represented by the name of each *timestamp directory* must be separated by less than the duration of a single *timestamp directory* from it's neighboring *timestamp directories*. Ie: the gap between a consecutive pair of *timestamp directories* must be short enough that we can be sure that there was no benchmark activity between them. + +2.19. **runFiles** -- Within each *timestamp directory* within the "run" *phase*, there must exist the following files: "training_run.stdout.log", "training_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. + +2.20. **runDlioConfig** -- The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. + +2.21. **checkpointingWorkloads** -- Within the "checkpointing" directory, there must be one or more of the following *workload directories*, and nothing else: "llama3-8b", "llama3-70b", "llama3-405b", and/or "llama3-1t". These names are case-sensitive. + +2.22. **checkpointingResultsJson** -- Within the *workload directories* within the "checkpointing" directory hierarchy, there must be one "results.json" file. This name is case-sensitive. + +2.23. **checkpointingTimestamps** -- Within the *workload directories* within the "checkpointing" directory hierarchy, there must also be exactly ten *timestamp directories* named *YYYYMMDD_HHmmss" that represent a *timestamp* of when that part of the test run was completed. Where Y's are replaced with the year the run was performed, M's are replaced with the month, D's with the day, H's with the hour (in 24-hour format), m's with the minute, and s's with the second. The timestamps should be relative to the local timezone where the test was actually run. + +2.24. **checkpointingTimestampGap** -- The timestamp (the day and time) represented by the name of each *timestamp directory* must be separated by less than the duration of a single *timestamp directory* from it's neighboring *timestamp directories*. Ie: the gap between a consecutive pair of *timestamp directories* must be short enough that we can be sure that there was no benchmark activity between them. + +2.25. **checkpointingFiles** -- Within the *timestamp directories* within the "checkpointing" directory hierarchy, there must exist the following files: "checkpointing_run.stdout.log", "checkpointing_run.stderr.log" file, "*output.json, "*per_epoch_stats.json", "*summary.json", and "dlio.log", plus a subdirectory named "dlio_config". These names are case-sensitive. + +2.26. **checkpointingDlioConfig** -- The "dlio_config" subdirectory in each *timestamp directory* must contain the following list of files, and nothing else: "config.yaml", "hydra.yaml", and "overrides.yaml". These names are case-sensitive. + +2.27. **directoryDiagram** -- Pictorially, here is what this looks like: +``` +root_folder (or any name you prefer) +├── Closed +│ └── +│ ├── code +│ ├── results +│ │ └──system-name-1 +│ │ ├── training +│ │ │ ├── unet3d +│ │ │ │ ├── datagen +│ │ │ │ │ └── YYYYMMDD_HHmmss +│ │ │ │ │ └── dlio_config +│ │ │ │ └── run +│ │ │ │ ├──results.json +│ │ │ │ ├── YYYYMMDD_HHmmss +│ │ │ │ │ └── dlio_config +│ │ │ │ ... (5x Runs per Emulated Accelerator Type) +│ │ │ │ └── YYYYMMDD_HHmmss +│ │ │ │ └── dlio_config +│ │ │ ├── resnet50 +│ │ │ │ ├── datagen +│ │ │ │ │ └── YYYYMMDD_HHmmss +│ │ │ │ │ └── dlio_config +│ │ │ │ └── run +│ │ │ │ ├──results.json +│ │ │ │ ├── YYYYMMDD_HHmmss +│ │ │ │ │ └── dlio_config +│ │ │ │ ... (5x Runs per Emulated Accelerator Type) +│ │ │ │ └── YYYYMMDD_HHmmss +│ │ │ │ └── dlio_config +│ │ │ └── cosmoflow +│ │ │ ├── datagen +│ │ │ │ └── YYYYMMDD_HHmmss +│ │ │ │ └── dlio_config +│ │ │ └── run +│ │ │ ├──results.json +│ │ │ ├── YYYYMMDD_HHmmss +│ │ │ │ └── dlio_config +│ │ │ ... (5x Runs per Emulated Accelerator Type) +│ │ │ └── YYYYMMDD_HHmmss +│ │ │ └── dlio_config +│ │ └── checkpointing +│ │ ├── llama3-8b +│ │ │ ├──results.json +│ │ │ ├── YYYYMMDD_HHmmss +│ │ │ │ └── dlio_config +│ │ ... (10x Runs for Read and Write. May be combined in a single run) +│ │ │ └── YYYYMMDD_HHmmss +│ │ │ └── dlio_config +│ │ ├── llama3-70b +│ │ │ ├──results.json +│ │ │ ├── YYYYMMDD_HHmmss +│ │ │ │ └── dlio_config +│ │ ... (10x Runs for Read and Write. May be combined in a single run) +│ │ │ └── YYYYMMDD_HHmmss +│ │ │ └── dlio_config +│ │ ├── llama3-405b +│ │ │ ├──results.json +│ │ │ ├── YYYYMMDD_HHmmss +│ │ │ │ └── dlio_config +│ │ ... (10x Runs for Read and Write. May be combined in a single run) +│ │ │ └── YYYYMMDD_HHmmss +│ │ │ └── dlio_config +│ │ └── llama3-1t +│ │ ├──results.json +│ │ ├── YYYYMMDD_HHmmss +│ │ │ └── dlio_config +│ │ ... (10x Runs for Read and Write. May be combined in a single run) +│ │ └── YYYYMMDD_HHmmss +│ │ └── dlio_config +│ └── systems +│ ├──system-name-1.yaml +│ ├──system-name-1.pdf +│ ├──system-name-2.yaml +│ └──system-name-2.pdf +│ +└── Open + └── + ├── code + ├── results + │ └──system-name-1 + │ ├── training + │ │ ├── unet3d + │ │ │ ├── datagen + │ │ │ │ └── YYYYMMDD_HHmmss + │ │ │ │ └── dlio_config + │ │ │ └── run + │ │ | ├──results.json + │ │ │ ├── YYYYMMDD_HHmmss + │ │ │ │ └── dlio_config + │ │ │ ... (5x Runs per Emulated Accelerator Type) + │ │ │ └── YYYYMMDD_HHmmss + │ │ │ └── dlio_config + │ │ ├── resnet50 + │ │ │ ├── datagen + │ │ │ │ └── YYYYMMDD_HHmmss + │ │ │ │ └── dlio_config + │ │ │ └── run + │ │ | ├──results.json + │ │ │ ├── YYYYMMDD_HHmmss + │ │ │ │ └── dlio_config + │ │ │ ... (5x Runs per Emulated Accelerator Type) + │ │ │ └── YYYYMMDD_HHmmss + │ │ │ └── dlio_config + │ │ └── cosmoflow + │ │ ├── datagen + │ │ │ └── YYYYMMDD_HHmmss + │ │ │ └── dlio_config + │ │ └── run + │ │ ├──results.json + │ │ ├── YYYYMMDD_HHmmss + │ │ │ └── dlio_config + │ │ ... (5x Runs per Emulated Accelerator Type) + │ │ └── YYYYMMDD_HHmmss + │ │ └── dlio_config + │ └── checkpointing + │ ├── llama3-8b + │ | ├──results.json + │ │ ├── YYYYMMDD_HHmmss + │ │ │ └── dlio_config + │ │ ... (10x Runs for Read and Write. May be combined in a single run) + │ │ └── YYYYMMDD_HHmmss + │ │ └── dlio_config + │ ├── llama3-70b + │ | ├──results.json + │ │ ├── YYYYMMDD_HHmmss + │ │ │ └── dlio_config + │ │ ... (10x Runs for Read and Write. May be combined in a single run) + │ │ └── YYYYMMDD_HHmmss + │ │ └── dlio_config + │ ├── llama3-405b + │ | ├──results.json + │ │ ├── YYYYMMDD_HHmmss + │ │ │ └── dlio_config + │ │ ... (10x Runs for Read and Write. May be combined in a single run) + │ │ └── YYYYMMDD_HHmmss + │ │ └── dlio_config + │ └── llama3-1t + │ ├──results.json + │ ├── YYYYMMDD_HHmmss + │ │ └── dlio_config + │ ... (10x Runs for Read and Write. May be combined in a single run) + │ └── YYYYMMDD_HHmmss + │ └── dlio_config + └── systems + ├──system-name-1.yaml + ├──system-name-1.pdf + ├──system-name-2.yaml + └──system-name-2.pdf +``` +2.29. **dlioLog** -- Since the "dlio_log" subdirectory has a similar structure in all cases, it is describe pictorially just below: +``` +└── YYYYMMDD_HHmmss + ├── [training|checkpointing]_[datagen|run].stdout.log + ├── [training|checkpointing]_[datagen|run].stderr.log + ├── *[output|per_epoch_stats|summary].json + ├── dlio.log + └── dlio_config + ├── config.yaml + ├── hydra.yaml + └── overrides.yaml +``` + +# 3. Validating the Training Workloads + +## 3.1. Datasize Options + +3.1.1. **verifyDatasizeUsage** -- The *submission validator* must verify that the *datasize* option was used by finding the entry(s) in the log file showing its use. + +3.1.2. **recalculateDatasetSize** -- The *submission validator* must recalculate the minimum dataset size by using the provided number of simulated accelerators and the sizes of all of the host node’s memory as reported in the logfiles as described below and fail the run if the size recorded in the run's logfile doesn't exactly match the recalculated value. + * Calculate required minimum samples given number of steps per epoch (NB: `num_steps_per_epoch` is a minimum of 500): + * `min_samples_steps_per_epoch = num_steps_per_epoch * batch_size * num_accelerators_across_all_nodes` + * Calculate required minimum samples given host memory to eliminate client-side caching effects; (NB: HOST_MEMORY_MULTIPLIER = 5): + * `min_samples_host_memory_across_all_nodes = number_of_hosts * memory_per_host_in_GB * HOST_MEMORY_MULTIPLIER * 1024 * 1024 * 1024 / record_length` + * Ensure we meet both constraints: + * `min_samples = max(min_samples_steps_per_epoch, min_samples_host_memory_across_all_nodes)` + * Calculate minimum files to generate + * `min_total_files= min_samples / num_samples_per_file` + * `min_files_size = min_samples * record_length / 1024 / 1024 / 1024` + * A minimum of `min_total_files` files are required which will consume `min_files_size` GB of storage. + +## 3.2. Datagen Options + +3.2.1. **datagenMinimumSize** -- The amount of data generated during the *datagen* phase must be equal **or larger** -- than the amount of data calculated during the *datasize* phase or the run must be failed. + +## 3.3. Run Options + +3.3.1. **runDataMatchesDatasize** -- The amount of data the *run* phase is told to use must be exactly equal to the *datasize* value calculated earlier, but can be less than the value used in the *datagen* phase. To express that, you can run the benchmark on a subset of that dataset by setting `num_files_train` or `num_files_eval` smaller than the number of files available in the dataset folder, but `num_subfolders_train` and `num_subfolders_eval` must be to be equal to the actual number of subfolders inside the dataset folder in order to generate valid results. + +3.3.2. **acceleratorUtilizationCheck** -- To pass a benchmark run, the AU (Accelerator Utilization) should be equal to or greater than the minimum value: + * `total_compute_time = (records_per_file * total_files) / simulated_accelerators / batch_size * computation_time * epochs` + * `AU = (total_compute_time/total_benchmark_running_time) * 100` + * All the I/O operations from the first step are excluded from the AU calculation. The I/O operations that are excluded from the AU calculation are included in the samples/second reported by the benchmark, however. + +3.3.3. **singleHostSimulatedAccelerators** -- For single-host submissions, increase the number of simulated accelerators by changing the `--num-accelerators` parameter to the benchmark.sh script. Note that the benchmarking tool requires approximately 0.5GB of host memory per simulated accelerator. + +3.3.4. **singleHostClientLimit** -- For single-host submissions, in both CLOSED and OPEN division results, the validator should fail the run if there is more than one client node used during that run. + +3.3.5. **distributedDataAccessibility** -- For distributed Training submissions, all the data must be accessible to all the host nodes. **_(not clear how to check this, so maybe remove?)_** + +3.3.6. **identicalAcceleratorsPerNode** -- For distributed Training submissions, the number of simulated accelerators in each host node must be identical. + +3.3.7. **nodeCapabilityConsistency** -- For distributed Training submissions, the *submission validation checker* should emit a warning (not fail the validation) if the physical nodes that run the benchmark code are widely enough different in their capability. **_(not clear we should do this, so maybe remove?)_** + +3.3.8. **closedSubmissionChecksum** -- For CLOSED submissions of this benchmark, the MLPerf Storage codebase cannot be changed, so the *submission validation checker* SHOULD do an `md5sum` of the code directory hierachy in the submission package and verify that that matches a precalculated checksum stored as a literal in the validator's codebase. + +3.3.9. **closedSubmissionParameters** -- For CLOSED submissions of this benchmark, only a small number of parameters can be modified, and those parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. + +**Table: Training Workload Tunable Parameters for CLOSED** + +| Parameter | Description | Default | +|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|----------| +| *Dataset parameters* | | | +| dataset.num_files_train | Number of files for the training set | -- | +| dataset.num_subfolders_train | Number of subfolders that the training set is stored | 0 | +| dataset.data_folder | The path where dataset is stored | -- | +| | | | +| *Reader parameters* | | | +| reader.read_threads | Number of threads to load the data | -- | +| reader.computation_threads | Number of threads to preprocess the data (only for resnet) | -- | +| reader.transfer_size | An int64 scalar representing the number of bytes in the read buffer. (only supported for Tensorflow models -- Resnet and Cosmoflow) | | +| reader.prefetch_size | An int64 scalar representing the amount of prefetching done, with values of 0, 1, or 2. | | +| reader.odirect | Enable ODIRECT mode for Unet3D Training | False | +| | | | +| *Storage parameters* | | | +| storage.storage_root | The storage root directory | ./ | +| storage.storage_type | The storage type | local_fs | + +3.3.10. **openSubmissionParameters** -- For OPEN submissions of this benchmark, only a few additional parameters can be modified over those allowed in CLOSED, and those additional parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. + +**Table: Training Workload Tunable Parameters for OPEN** + +| Parameter | Description | Default | +|------------------------------|--------------------------------------------|---------------------------------------------------------------------------------------| +| framework | The machine learning framework. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | +| | | | +| *Dataset parameters* | | | +| dataset.format | Format of the dataset. | 3D U-Net: .npz
ResNet-50: .tfrecord
Cosmoflow: .tfrecord | +| dataset.num_samples_per_file | | 3D U-Net: 1
ResNet-50: 1251
Cosmoflow: 1 | +| | | | +| *Reader parameters* | | | +| reader.data_loader | Supported options: Tensorflow or PyTorch. | 3D U-Net: PyTorch
ResNet-50: Tensorflow
Cosmoflow: Tensorflow | + +3.3.11. **mlpstoragePathArgs** -- The arguments to `mlpstorage` that set the directory pathname where the dataset is stored and the directory where the output logfiles are stored must both be set and must be set to different values. + +3.3.12. **mlpstorageFilesystemCheck** -- The `mlpstorage` command should do a "df" command on the directory pathname where the dataset is stored and another one on the directory pathname where the output logfiles are stored and record those values in the logfile. The *submission validator* should find those entries in the run's logfile and verify that they are different filesystems. We don't want the submitter to, by acccident, place the logfiles onto the storage system under test since that would skew the results. + +# 4. Validating the Checkpointing Workloads + +## 4.1. Benchmark Run Options + +4.1.1. **checkpointDataSizeRatio** -- The checkpoint data written per client node must be more than 3x the client node's memory capacity, otherwise the filesystem cache needs to be cleared between the write and read phases. + +4.1.2. **fsyncVerification** -- We must verify that all the benchmark workload configuration files have been set to do an fsync call at the end of each of the 10 checkpoint writes. + +4.1.3. **modelConfigurationReq** -- The benchmark must be run with one of the four model configuration detailed below. + +4.1.4. **closedMpiProcesses** -- For CLOSED submissions, the number of MPI processes must be set to 8, 64, 512, and 1024 for the respective models. (see table 2) + +4.1.5. **closedAcceleratorsPerHost** -- For CLOSED submissions, submitters may adjust the number of simulated accelerators **per host**, as long as each host uses more than 4 simulated accelerators and the total number of simulated accelerators (the total number of processes) matches the requirement. (see table 2) + +4.1.6. **aggregateAcceleratorMemory** -- The aggregate simulated accelerator memory across all nodes must be sufficient to accommodate the model’s checkpoint size. That is, the GB of memory associated with the chosen accelerator (eg: H100) times the accelerator count must be equal to or greater than the total checkpoint size for that scale of checkpoint. (see table 2) + +**Table 2 LLM models** + +| Model | 8B | 70B | 405B | 1T | +|------------------------|--------|--------|---------|--------| +| Hidden dimension | 4096 | 8192 | 16384 | 25872 | +| FFN size | 14336 | 28672 | 53248 | 98304 | +| num_attention_heads | 32 | 128 | 128 | 192 | +| num_kv_heads | 8 | 8 | 8 | 32 | +| Num layers | 32 | 80 | 126 | 128 | +| Parallelism (TPxPPxDP) | 1×1×8 | 8×1x8 | 8×32×2 | 8×64×2 | +| Total Processes | 8 | 64 | 512 | 1024 | +| ZeRO | 3 | 3 | 1 | 1 | +| Checkpoint size | 105 GB | 912 GB | 5.29 TB | 18 TB | +| Subset: 8-Process Size | 105 GB | 114 GB | 94 GB | 161 GB | + +4.1.7. **closedCheckpointParameters** -- For CLOSED submissions of this benchmark, only a small number of parameters can be modified, and those parameters are listed in the table below. Any other parameters being modified must generate a message and fail the validation. + +**Table: Checkpoint Workload Tunable Parameters for CLOSED** + +| Parameter | Description | Default | +|----------------------------------|-------------------------------------------------------------|-----------------------| +| checkpoint.checkpoint_folder | The storage directory for writing and reading checkpoints | ./checkpoints/ | + +4.1.8. **openSubmissionScaling** -- For OPEN submissions of this benchmark, the total number of processes may be increased in multiples of (TP×PP) to showcase the scalability of the storage solution. + +**Table 3: Configuration parameters and their mutability in CLOSED and OPEN divisions** + +| Parameter | Meaning | Default value | Changeable in CLOSED | Changeable in OPEN | +|------------------------------------|----------------------------------------------|-----------------------------------------------|----------------------|--------------------| +| --ppn hostname:slotcount | Number of processes per node | N/A | YES (minimal 4) | YES (minimal 4) | +| --num-processes | Total number of processes | Node local: 8
Global: the value in Table 1 | NO | YES | +| --checkpoint-folder | The folder to save the checkpoint data | checkpoint/{workload} | YES | YES | +| --num-checkpoints-write | Number of write checkpoints | 10 or 0** | NO | NO | +| --num-checkpoints-read | Number of write checkpoints | 10 or 0** | NO | NO | + +**NOTE: In the ``--ppn`` syntax above, the ``slotcount`` value means the number of processes per node to run.** + +4.1.9. **checkpointPathArgs** -- The arguments to `mlpstorage` that set the directory pathname where the checkpoints are written and read and the directory where the output logfiles are stored must both be set and must be set to different values. + +4.1.10. **checkpointFilesystemCheck** -- The `mlpstorage` command should do a "df" command on the directory pathname where the checkpoints are written and read and another one on the directory pathname where the output logfiles are stored and record those values in the logfile. The *submission validator* should find those entries in the run's logfile and verify that they are different filesystems. We don't want the submitter to, by acccident, place the logfiles onto the storage system under test since that would skew the results. + +4.1.11. **subsetRunValidation** -- The `mlpstorage` command must accept a parameter telling it that this is a *subset* run and add that info to the output log file. The *submission validator* must flag an error if the `subset` argument is given but the total number of accelerators is not exactly 8, or the model is "8B". + +## 4.2. Storage System Must Be Simultaneously R/W or _Remappable_ + +4.2.1. **cacheFlushValidation** -- If a submitter needs to issue a cache flush operation between the write phase and the read phase of a checkpoint benchmark run, then the validator must check that ``--num-checkpoints-read=0`` was set during the write phase, that there was a short pause of up to 30 seconds maximum, then the write phase was started with ``--num-checkpoints-write=0`` set. + +4.2.2. **totalTestDuration** -- The validator must verify that the total test duration starts from the timestamp of the first checkpoint written and ends at the ending timestamp of the last checkpoint read, notably including the "remapping" time. + +4.2.3. **remappingTimeReporting** -- For a _remapping_ solution, the time duration between the checkpoint being completed and the earliest time that that checkpoint could be read by a different host node must be reported in the `SystemDescription.yaml` file. + +4.2.4. **simultaneousRwSupport** -- The system_configuration.yaml document must list whether the solution support simultaneous reads and/or writes as such: +``` +System: + shared_capabilities: + multi_host_support: True # False is used for local storage + simultaneous_write_support: False # Are simultaneous writes by multiple hosts supported in the submitted configuration + simultaneous_read__support: True # Are simultaneous reads by multiple hosts supported in the submitted configuration +``` + + + + + + + + + + + + + + + + + + diff --git a/SystemDescription_Schema.yaml b/SystemDescription_Schema.yaml new file mode 100644 index 0000000..7d855be --- /dev/null +++ b/SystemDescription_Schema.yaml @@ -0,0 +1,82 @@ +system: include('system_description',required=True) +power: include('power_requirements',required=True) +nodes: + dlio_nodes: include('node_description',required=True) + storage_data_nodes: include('node_description',required=True) + storage_metadata_nodes: include('node_description',required=False) +--- +system_description: + name: str(min=1) + description: str(min=1) + storage_location: enum('remote','local','hyper-converged') + client_software: enum('in-box','proprietary') + storage_interface: enum('block','file','object') + required_rack_units: int(min=1) + shared_capabilities: + multi_host_support: enum('True','False') # False is used for local storage + simultaneous_write_support: enum('True','False') # Are simultaneous writes by multiple hosts supported? + simultaneous_read__support: enum('True','False') # Are simultaneous reads by multiple hosts supported? + max_sequential_read: int(min=1,required=True) # In GiB/s + max_sequential_write: int(min=1,required=True) # In GiB/s + max_random_read: int(min=1,required=True) # In GiB/s + max_random_write: int(min=1,required=True) # In GiB/s +--- +power_requirements: + provisioned: include('power_summary',required=True) + consumed: include('power_summary',required=False) +--- +power_summary: + dlio_client: include('power_detail') + storage_data_node: include('power_detail') + backend_switch: include('power_detail') +--- +power_detail: + quantity: int(min=1 ) + psu1_nameplate_power: int(min=1,required=True) # in watts + psu2_nameplate_power: int(min=1,required=False) # in watts + psu3_nameplate_power: int(min=1,required=False) # in watts + psu4_nameplate_power: int(min=1,required=False) # in watts + psu5_nameplate_power: int(min=1,required=False) # in watts + psu6_nameplate_power: int(min=1,required=False) # in watts + design_power: int(min=1) # in Watts + num_active_psus: int(min=1) + num_passive_psus: int(min=0) +--- +node_description: + quantity: int(min=1) + hardware: include('hardware_description') + networking: list(include('network_instance'),min=1) + operating_system: include('operating_system_description') + tuning: + # All non-default tunings for OS need to be listed + mpi_configuration: + environment_variables: + version: Open MPI 4.1.4 + sysctl_parameters: + + +--- +hardware_description: + model: str(min=1) + rack_units: int(min=1) + power_supplies: int(min=1) + psu_configuration: enum('active/passive','active/active') + psu_rating: int(min=1) + memory_capacity: int(min=1) # in GB, eg: 256 + memory_configuration: 8x32GB + cpu_qty: int(min=1) + cpu_model: str(min=1) + cpu_cores: int(min=1) +--- +network_instance: + type: enum('management','data','backend') + model: str(min=1) + speed: int(min=1) # in Gb/s + qty: int(min=1) +--- +operating_system_description: + name: str(min=1) + version: str(min=1) + release_date: str(min=1) + kernel_version: str(min=1) + cpu_architecture: enum('x86','arm') diff --git a/mlpstorage/checker/README.md b/mlpstorage/checker/README.md new file mode 100644 index 0000000..6c36b1f --- /dev/null +++ b/mlpstorage/checker/README.md @@ -0,0 +1,4 @@ +# This directory contains the submision validation checker. + +The required reviews for this directory hierarchy are different tfrom the rest of the benchmark repo, +MLCommons' internal development group are required reviewers for any changes here.