|
| 1 | +--- |
| 2 | +title: Run the benchmark |
| 3 | +weight: 5 |
| 4 | + |
| 5 | +### FIXED, DO NOT MODIFY |
| 6 | +layout: learningpathall |
| 7 | +--- |
| 8 | + |
| 9 | +In this section, you will run the benchmark and inspect the results. |
| 10 | + |
| 11 | +## Build PyTorch |
| 12 | + |
| 13 | +You will use a commit hash of the the `Tool-Solutions` repository to set up a Docker container with PyTorch. It will includes releases of PyTorch which enhance the performance of ML frameworks on Arm. |
| 14 | + |
| 15 | +```bash |
| 16 | +cd $HOME |
| 17 | +git clone https://github.com/ARM-software/Tool-Solutions.git |
| 18 | +cd $HOME/Tool-Solutions/ |
| 19 | +git checkout f606cb6276be38bbb264b5ea64809c34837959c4 |
| 20 | +``` |
| 21 | + |
| 22 | +The `build.sh` script builds a wheel and a Docker image containing a PyTorch wheel and dependencies. It then runs the MLPerf container which is used for the benchmark in the next section. This script takes around 20 minutes to finish. |
| 23 | + |
| 24 | +```bash |
| 25 | +cd ML-Frameworks/pytorch-aarch64/ |
| 26 | +./build.sh |
| 27 | +``` |
| 28 | + |
| 29 | +You now have everything set up to analyze the performance. Proceed to the next section to run the benchmark and inspect the results. |
| 30 | + |
| 31 | +## Run the benchmark |
| 32 | + |
| 33 | + A repository is set up to run the next steps. This collection of scripts streamlines the process of building and running the DLRM (Deep Learning Recommendation Model) benchmark from the MLPerf suite inside a Docker container, tailored for Arm-based systems. |
| 34 | + |
| 35 | +Start by cloning it. |
| 36 | + |
| 37 | + ```bash |
| 38 | + cd $HOME |
| 39 | + git clone https://github.com/ArmDeveloperEcosystem/dlrm-mlperf-lp.git |
| 40 | + ``` |
| 41 | + |
| 42 | +The main script is the `run_dlrm_benchmark.sh`. At a glance, it automates the full workflow of executing the MLPerf DLRM benchmark by performing the following steps: |
| 43 | + |
| 44 | +* Initializes and configures MLPerf repositories within the container. |
| 45 | +* Applies necessary patches (from `mlperf_patches/`) and compiles the MLPerf codebase inside the container. |
| 46 | +* Converts pretrained weights into a usable model format. |
| 47 | +* Performs INT8 calibration if needed. |
| 48 | +* Executes the offline benchmark test, generating large-scale binary data during runtime. |
| 49 | + |
| 50 | +```bash |
| 51 | +cd dlrm-mlperf-lp |
| 52 | +./run_dlrm_benchmark.sh int8 |
| 53 | +``` |
| 54 | + |
| 55 | +The script can take an hour or more to run. |
| 56 | + |
| 57 | +{{% notice Note %}} |
| 58 | + |
| 59 | +To run the `fp32` offline test, it's recommended to use the pre-generated binary data files from the int8 tests. You will need a CSP instance with enough RAM. For this purpose, the AWS `r8g.24xlarge` is recommended. After running the `int8` test, save the files in the `model` and `data` directories, and copy them to the instance intended for the `fp32` benchmark. |
| 60 | +{{% /notice %}} |
| 61 | + |
| 62 | +## Understanding the results |
| 63 | + |
| 64 | +As a final step, have a look at the results generated in a text file. |
| 65 | + |
| 66 | +The DLRM model optimizes the Click-Through Rate (CTR) prediction. It is a fundamental task in online advertising, recommendation systems, and search engines. Essentially, the model estimates the probability that a user will click on a given ad, product recommendation, or search result. The higher the predicted probability, the more likely the item is to be clicked. In a server context, the goal is to observe a high through-put of these probabilities. |
| 67 | + |
| 68 | +```bash |
| 69 | +cat $HOME/results/int8/mlperf_log_summary.txt |
| 70 | +``` |
| 71 | + |
| 72 | +Your output should contain a `Samples per second`, where each sample tells probability of the user clicking a certain ad. |
| 73 | + |
| 74 | +```output |
| 75 | +================================================ |
| 76 | +MLPerf Results Summary |
| 77 | +================================================ |
| 78 | +SUT name : PyFastSUT |
| 79 | +Scenario : Offline |
| 80 | +Mode : PerformanceOnly |
| 81 | +Samples per second: 1434.8 |
| 82 | +Result is : VALID |
| 83 | + Min duration satisfied : Yes |
| 84 | + Min queries satisfied : Yes |
| 85 | + Early stopping satisfied: Yes |
| 86 | +
|
| 87 | +================================================ |
| 88 | +Additional Stats |
| 89 | +================================================ |
| 90 | +Min latency (ns) : 124022373 |
| 91 | +Max latency (ns) : 883187615166 |
| 92 | +Mean latency (ns) : 442524059715 |
| 93 | +50.00 percentile latency (ns) : 442808926434 |
| 94 | +90.00 percentile latency (ns) : 794977004363 |
| 95 | +95.00 percentile latency (ns) : 839019402197 |
| 96 | +97.00 percentile latency (ns) : 856679847578 |
| 97 | +99.00 percentile latency (ns) : 874336993877 |
| 98 | +99.90 percentile latency (ns) : 882255616119 |
| 99 | +
|
| 100 | +================================================ |
| 101 | +Test Parameters Used |
| 102 | +================================================ |
| 103 | +samples_per_query : 1267200 |
| 104 | +target_qps : 1920 |
| 105 | +target_latency (ns): 0 |
| 106 | +max_async_queries : 1 |
| 107 | +min_duration (ms): 600000 |
| 108 | +max_duration (ms): 0 |
| 109 | +min_query_count : 1 |
| 110 | +max_query_count : 0 |
| 111 | +qsl_rng_seed : 6023615788873153749 |
| 112 | +sample_index_rng_seed : 15036839855038426416 |
| 113 | +schedule_rng_seed : 9933818062894767841 |
| 114 | +accuracy_log_rng_seed : 0 |
| 115 | +accuracy_log_probability : 0 |
| 116 | +accuracy_log_sampling_target : 0 |
| 117 | +print_timestamps : 0 |
| 118 | +performance_issue_unique : 0 |
| 119 | +performance_issue_same : 0 |
| 120 | +performance_issue_same_index : 0 |
| 121 | +performance_sample_count : 204800 |
| 122 | +``` |
| 123 | + |
| 124 | +On successfully running the benchmark, you’ve gained practical experience in evaluating large-scale AI recommendation systems in a reproducible and efficient manner—an essential skill for deploying and optimizing AI workloads on modern platforms. |
0 commit comments