|
| 1 | +# MLPerf Inference reference implementation for DLRMv3 |
| 2 | + |
| 3 | +## Install dependencies and build loadgen |
| 4 | + |
| 5 | +The reference implementation has been tested on a single host, with x86_64 CPUs and 8 NVIDIA H100/B200 GPUs. Dependencies can be installed below, |
| 6 | +``` |
| 7 | +sh setup.sh |
| 8 | +``` |
| 9 | + |
| 10 | +## Dataset download |
| 11 | + |
| 12 | +DLRMv3 uses a synthetic dataset specifically designed to match the model and system characteristics of large-scale sequential recommendation (large item set and long average sequence length for each request). To generate the dataset used for both training and inference, run |
| 13 | +``` |
| 14 | +python streaming_synthetic_data.py |
| 15 | +``` |
| 16 | +The generated dataset has 2TB size, and contains 5 million users interacting with a billion items over 100 timestamps. |
| 17 | + |
| 18 | +Only 1% of the dataset is used in the inference benchmark. The sampled DLRMv3 dataset and trained checkpoint are available at https://inference.mlcommons-storage.org/. |
| 19 | + |
| 20 | +Script to download the sampled dataset used in inference benchmark: |
| 21 | +``` |
| 22 | +bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://inference.mlcommons-storage.org/metadata/dlrm-v3-dataset.uri |
| 23 | +``` |
| 24 | +Script to download the 1TB trained checkpoint: |
| 25 | +``` |
| 26 | +bash <(curl -s https://raw.githubusercontent.com/mlcommons/r2-downloader/refs/heads/main/mlc-r2-downloader.sh) https://inference.mlcommons-storage.org/metadata/dlrm-v3-checkpoint.uri |
| 27 | +``` |
| 28 | + |
| 29 | +## Inference benchmark |
| 30 | + |
| 31 | +``` |
| 32 | +WORLD_SIZE=8 python main.py --dataset sampled-streaming-100b |
| 33 | +``` |
| 34 | + |
| 35 | +`WORLD_SIZE` is the number of GPUs used in the inference benchmark. |
| 36 | + |
| 37 | +``` |
| 38 | +usage: main.py [-h] [--dataset {streaming-100b,sampled-streaming-100b}] [--model-path MODEL_PATH] [--scenario-name {Server,Offline}] [--batchsize BATCHSIZE] |
| 39 | + [--output-trace OUTPUT_TRACE] [--data-producer-threads DATA_PRODUCER_THREADS] [--compute-eval COMPUTE_EVAL] [--find-peak-performance FIND_PEAK_PERFORMANCE] |
| 40 | + [--dataset-path-prefix DATASET_PATH_PREFIX] [--warmup-ratio WARMUP_RATIO] [--num-queries NUM_QUERIES] [--target-qps TARGET_QPS] [--numpy-rand-seed NUMPY_RAND_SEED] |
| 41 | + [--sparse-quant SPARSE_QUANT] [--dataset-percentage DATASET_PERCENTAGE] |
| 42 | +
|
| 43 | +options: |
| 44 | + -h, --help show this help message and exit |
| 45 | + --dataset {streaming-100b,sampled-streaming-100b} |
| 46 | + name of the dataset |
| 47 | + --model-path MODEL_PATH |
| 48 | + path to the model checkpoint. Example: /home/username/ckpts/streaming_100b/89/ |
| 49 | + --scenario-name {Server,Offline} |
| 50 | + inference benchmark scenario |
| 51 | + --batchsize BATCHSIZE |
| 52 | + batch size used in the benchmark |
| 53 | + --output-trace OUTPUT_TRACE |
| 54 | + Whether to output trace |
| 55 | + --data-producer-threads DATA_PRODUCER_THREADS |
| 56 | + Number of threads used in data producer |
| 57 | + --compute-eval COMPUTE_EVAL |
| 58 | + If true, will run AccuracyOnly mode and outputs both predictions and labels for accuracy calcuations |
| 59 | + --find-peak-performance FIND_PEAK_PERFORMANCE |
| 60 | + Whether to find peak performance in the benchmark |
| 61 | + --dataset-path-prefix DATASET_PATH_PREFIX |
| 62 | + Prefix to the dataset path. Example: /home/username/ |
| 63 | + --warmup-ratio WARMUP_RATIO |
| 64 | + The ratio of the dataset used to warmup SUT |
| 65 | + --num-queries NUM_QUERIES |
| 66 | + Number of queries to run in the benchmark |
| 67 | + --target-qps TARGET_QPS |
| 68 | + Benchmark target QPS. Needs to be tuned for different implementations to balance latency and throughput |
| 69 | + --numpy-rand-seed NUMPY_RAND_SEED |
| 70 | + Numpy random seed |
| 71 | + --sparse-quant SPARSE_QUANT |
| 72 | + Whether to quantize sparse arch |
| 73 | + --dataset-percentage DATASET_PERCENTAGE |
| 74 | + Percentage of the dataset to run in the benchmark |
| 75 | +``` |
| 76 | + |
| 77 | +## Accuracy test |
| 78 | + |
| 79 | +Set `run.compute_eval` will run the accuracy test and dump prediction outputs in |
| 80 | +`mlperf_log_accuracy.json`. To check the accuracy, run |
| 81 | + |
| 82 | +``` |
| 83 | +python accuracy.py --path path/to/mlperf_log_accuracy.json |
| 84 | +``` |
| 85 | +We use normalized entropy (NE), accuracy, and AUC as the metrics to evaluate the model quality. For accepted submissions, all three metrics (NE, Accuracy, AUC) must be within 99% of the reference implementation values. The accuracy for the reference implementation evaluated on 34,996 requests across 10 inference timestamps are listed below: |
| 86 | +``` |
| 87 | +NE: 86.687% |
| 88 | +Accuracy: 69.651% |
| 89 | +AUC: 78.663% |
| 90 | +``` |
0 commit comments