Skip to content

Commit c527664

Browse files
committed
Update DLRM LP
- Remove copying files section - Add patches for hosting - Combine scripts - Clarify metrics
1 parent ef04c3c commit c527664

File tree

7 files changed

+1191
-109
lines changed

7 files changed

+1191
-109
lines changed

content/learning-paths/servers-and-cloud-computing/dlrm/1-overview.md

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@ Before you can run the benchmark, you will need an Arm-based Cloud Service Provi
2222
| --------------------- | -------------- |
2323
| Google Cloud Platform | c4a-highmem-72 |
2424
| Amazon Web Services | r8g.16xlarge |
25-
| Microsoft Azure | TODO |
2625

2726
### Verify Python installation
2827
Make sure Python is installed by running the following and making sure a version is printed.
@@ -37,27 +36,35 @@ Python 3.12.6
3736

3837
## Install Docker
3938

40-
```bash
41-
sudo apt-get update
42-
sudo apt-get install ca-certificates curl docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin make -y
43-
```
39+
Start by adding the official Docker GPG key to your system’s APT keyrings directory:
4440

4541
```bash
4642
sudo install -m 0755 -d /etc/apt/keyrings
4743
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
4844
sudo chmod a+r /etc/apt/keyrings/docker.asc
4945
```
5046

47+
Next, install some additional dependencies dependencies:
48+
49+
```bash
50+
sudo apt-get update
51+
sudo apt-get install ca-certificates curl docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin make -y
52+
```
53+
54+
Finally, the following commands will finalize the Docker installation:
55+
5156
```bash
5257
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
5358
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
5459
```
5560

56-
{{ % notice Note % }}
57-
If you run into permission issues with Docker, try running the following
61+
{{% notice Note %}}
62+
If you run into permission issues with Docker, try running the following:
5863

5964
```bash
6065
sudo usermod -aG docker $USER
6166
sudo chmod 666 /var/run/docker.sock
6267
```
63-
{{ % /notice % }}
68+
{{% /notice %}}
69+
70+
With your development environment set up, you can move on to download the model.

content/learning-paths/servers-and-cloud-computing/dlrm/2-download-model.md

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ rclone v1.69.1 has successfully installed.
2828
Now run "rclone config" for setup. Check https://rclone.org/docs/ for more details.
2929
```
3030

31-
Configure the credentials as instructed.
31+
Configure the following credentials for rclone:
3232

3333
```bash
3434
rclone config create mlc-inference s3 provider=Cloudflare \
@@ -38,17 +38,14 @@ rclone config create mlc-inference s3 provider=Cloudflare \
3838
```
3939

4040
You will now download the data and model weights. This process takes an hour or more depending on your internet connection.
41+
4142
```bash
4243
rclone copy mlc-inference:mlcommons-inference-wg-public/dlrm_preprocessed $HOME/data -P
4344
rclone copy mlc-inference:mlcommons-inference-wg-public/model_weights $HOME/model/model_weights -P
4445
```
4546

4647
Once it finishes, you should see that the `model` and `data` directories are populated.
4748

48-
* Overview of Dataset Used in MLPerf DLRM
49-
* Steps to Download and Prepare the Data
50-
* Preprocessing Data for Training and Inference
51-
5249
## Build DLRM image
5350

5451
You will use a branch of the the `Tool-Solutions` repository. This branch includes releases of PyTorch which enhance the performance of ML frameworks.
@@ -60,7 +57,7 @@ cd $HOME/Tool-Solutions/
6057
git checkout ${1:-"pytorch-aarch64--r24.12"}
6158
```
6259

63-
A setup script runs which installs docker and builds a PyTorch image for a specific commit hash. Finally, it runs the MLPerf container which is used for the benchmark in the next section. This script takes around 20 minutes to finish.
60+
The `build.sh` script builds a wheel and a Docker image containing PyTorch and dependencies. It then runs the MLPerf container which is used for the benchmark in the next section.
6461

6562
```bash
6663
cd ML-Frameworks/pytorch-aarch64/

content/learning-paths/servers-and-cloud-computing/dlrm/3-run-benchmark.md

Lines changed: 68 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -6,37 +6,46 @@ weight: 5
66
layout: learningpathall
77
---
88

9-
The final step is to run the actual benchmark.
9+
The final step is to run the benchmark.
10+
11+
## Download patches
12+
13+
Start by downloading the patches which will be applied during setup.
14+
15+
```bash
16+
wget -r --no-parent https://github.com/ArmDeveloperEcosystem/arm-learning-paths/tree/main/content/learning-paths/servers-and-cloud-computing/dlrm/mlpef_patches $HOME/mlperf_patches
17+
```
1018

1119
## Benchmark script
1220

13-
You will now create a script which uses the Docker container to run the benchmark. Create a new file called `run_dlrm_benchmark.sh`. Paste the code below.
21+
You will now create a script that automates the setup, configuration, and execution of MLPerf benchmarking for the DLRM (Deep Learning Recommendation Model) inside a Docker container. It simplifies the process by handling dependency installation, model preparation, and benchmarking in a single run. Create a new file called `run_dlrm_benchmark.sh`. Paste the code below.
1422

1523
```bash
1624
#!/bin/bash
1725

1826
set -ex
1927
yellow="\e[33m"
2028
reset="\e[0m"
29+
2130
data_type=${1:-"int8"}
31+
2232
echo -e "${yellow}Data type chosen for the setup is $data_type${reset}"
2333

24-
# setup environment variables for the dlrm container
34+
# Setup directories
2535
data_dir=$HOME/data/
2636
model_dir=$HOME/model/
2737
results_dir=$HOME/results/
2838
dlrm_container="benchmark_dlrm"
2939

30-
# Create results directory
3140
mkdir -p $results_dir/$data_type
3241

3342
###### Run the dlrm container and setup MLPerf #######
34-
# Check if the container exists
43+
3544
echo -e "${yellow}Checking if the container '$dlrm_container' exists...${reset}"
3645
container_exists=$(docker ps -aqf "name=^$dlrm_container$")
3746

3847
if [ -n "$container_exists" ]; then
39-
echo "${yellow}Container '$dlrm_container' already exists. Will not create a new one. ${reset}"
48+
echo "${yellow}Container '$dlrm_container' already exists.${reset}"
4049
else
4150
echo "Creating a new '$dlrm_container' container..."
4251
docker run -td --shm-size=200G --privileged \
@@ -45,125 +54,94 @@ else
4554
-v $results_dir:$results_dir \
4655
-e DATA_DIR=$data_dir \
4756
-e MODEL_DIR=$model_dir \
48-
-e CONDA_PREFIX=/opt/conda \
49-
-e NUM_SOCKETS="1" \
50-
-e CPUS_PER_SOCKET=$(nproc) \
51-
-e CPUS_PER_PROCESS=$(nproc) \
52-
-e CPUS_PER_INSTANCE="1" \
53-
-e CPUS_FOR_LOADGEN="1" \
54-
-e BATCH_SIZE="400" \
5557
-e PATH=/opt/conda/bin:$PATH \
5658
--name=$dlrm_container \
5759
toolsolutions-pytorch:latest
5860
fi
5961

60-
###### Build MLPerf & Dependencies #######
61-
# Copy MLPerf build script to the benchmark_dlrm container
62-
docker cp ~/dlrm_docker_setup/build_mlperf.sh $dlrm_container:$HOME/
63-
64-
# Copy the patches
65-
docker cp ~/dlrm_docker_setup/mlperf_patches $dlrm_container:$HOME/
66-
67-
echo -e "${yellow}Setting up MLPerf benchmarking inside the container...${reset}"
68-
docker exec -it $dlrm_container bash -c ". $HOME/build_mlperf.sh $data_type"
69-
70-
###### Dump the model #######
62+
echo -e "${yellow}Setting up MLPerf inside the container...${reset}"
63+
docker cp $HOME/mlperf_patches $dlrm_container:$HOME/
64+
docker exec -it $dlrm_container bash -c "
65+
set -ex
66+
sudo apt update && sudo apt install -y \
67+
software-properties-common lsb-release scons \
68+
build-essential libtool autoconf unzip git vim wget \
69+
numactl cmake gcc-12 g++-12 python3-pip python-is-python3
70+
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12 --slave /usr/bin/g++ g++ /usr/bin/g++-12
71+
72+
if [ ! -d \"/opt/conda\" ]; then
73+
wget -O \"$HOME/miniconda.sh\" https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh
74+
chmod +x \"$HOME/miniconda.sh\"
75+
sudo bash \"$HOME/miniconda.sh\" -b -p /opt/conda
76+
rm \"$HOME/miniconda.sh\"
77+
fi
78+
export PATH=\"/opt/conda/bin:$PATH\"
79+
/opt/conda/bin/conda install -y python=3.10.12
80+
/opt/conda/bin/conda install -y -c conda-forge cmake gperftools numpy==1.23.0 ninja pyyaml setuptools
81+
82+
git clone --recurse-submodules https://github.com/mlcommons/inference.git inference || (cd inference ; git pull)
83+
cd inference && git submodule update --init --recursive && cd loadgen
84+
CFLAGS=\"-std=c++14\" python setup.py bdist_wheel
85+
pip install dist/*.whl
86+
87+
rm -rf inference_results_v4.0
88+
git clone https://github.com/mlcommons/inference_results_v4.0.git
89+
cd inference_results_v4.0 && git checkout ceef1ea
90+
91+
if [ \"$data_type\" = \"fp32\" ]; then
92+
git apply $HOME/mlperf_patches/arm_fp32.patch
93+
else
94+
git apply $HOME/mlperf_patches/arm_int8.patch
95+
fi
96+
"
97+
98+
echo -e "${yellow}Checking for dumped FP32 model...${reset}"
7199
dumped_fp32_model="dlrm-multihot-pytorch.pt"
72100
int8_model="aarch64_dlrm_int8.pt"
73101
dlrm_test_path="$HOME/inference_results_v4.0/closed/Intel/code/dlrm-v2-99.9/pytorch-cpu-int8"
74102

75-
# Check if FP32 model is already dumped
76-
if [ -f "$HOME/model/$dumped_fp32_model" ]; then
77-
echo -e "${yellow}File '$dumped_fp32_model' exists. Skipping model dumping step.${reset}"
78-
else
79-
echo -e "${yellow}File '$dumped_fp32_model' does not exist. Dumping the model weights...${reset}"
80-
docker cp $HOME/dlrm_docker_setup/requirements.txt $dlrm_container:$HOME
81-
docker exec -it "$dlrm_container" bash -c "pip install -r requirements.txt ; cd $dlrm_test_path && python python/dump_torch_model.py --model-path=$model_dir/model_weights --dataset-path=$data_dir"
103+
if [ ! -f "$HOME/model/$dumped_fp32_model" ]; then
104+
echo -e "${yellow}Dumping model weights...${reset}"
105+
docker exec -it "$dlrm_container" bash -c "
106+
pip install -r --extra-index-url https://download.pytorch.org/whl/nightly/cpu tensordict==0.1.2 torchsnapshot==0.1.0 fbgemm_gpu==2025.1.22+cpu torchrec==1.1.0.dev20250127+cpu
107+
"
108+
docker exec -it "$dlrm_container" bash -c "
109+
cd $dlrm_test_path && python python/dump_torch_model.py --model-path=$model_dir/model_weights --dataset-path=$data_dir
110+
"
82111
fi
83112

84-
###### Calibrate the model #######
85-
# In the case of INT8, calibrate the model if not already calibrated.
86113
echo -e "${yellow}Checking if INT8 model calibration is required...${reset}"
87-
88114
if [ "$data_type" == "int8" ] && [ ! -f "$HOME/model/$int8_model" ]; then
89-
echo -e "${yellow}File '$int8_model' does not exist. Running calibration...${reset}"
90-
# the calibration will create aarch64_dlrm_int8.pt in the $HOME/model directory.
115+
echo -e "${yellow}Running INT8 calibration...${reset}"
91116
docker exec -it "$dlrm_container" bash -c "cd $dlrm_test_path && ./run_calibration.sh"
92-
else
93-
echo -e "${yellow}Calibration step is not needed.${reset}"
94117
fi
95118

96-
###### Run the test #######
97-
# Run the offline test
98119
echo -e "${yellow}Running offline test...${reset}"
99120
docker exec -it "$dlrm_container" bash -c "cd $dlrm_test_path && bash run_main.sh offline $data_type"
100121

101-
# Copy results to the host machine
102122
echo -e "${yellow}Copying results to host...${reset}"
103123
docker exec -it "$dlrm_container" bash -c "cd $dlrm_test_path && cp -r output/pytorch-cpu/dlrm/Offline/performance/run_1/* $results_dir/$data_type/"
104124

105-
# Display the MLPerf summary results
106-
echo -e "${yellow}Displaying MLPerf results...${reset}"
107125
cat $results_dir/$data_type/mlperf_log_summary.txt
108-
109126
```
110127

111-
At a glance, these are the steps it goes through:
112-
113-
- Sets up MLPerf repositories within the container.
114-
- Dumps the model from existing model weights if not already available.
115-
- Calibrates the INT8 model from the dumped model if it has not been previously generated.
116-
- Executes the offline benchmark test, generating terabyte-scale binary data files during the process.
117-
118-
Run the offline test with the `int8` datatype. You can also specify the argument `fp32` to build for the floating point datatype.
128+
With the script ready, it's time to run the benchmark:
119129

120130
```bash
121-
cd $HOME/dlrm_docker_setup
122-
./run_dlrm_benchmark.sh int8
123-
```
124-
125-
## Save output files
126-
127-
You may want to save the final model and data files to run on smaller servers. You can use `scp` to achieve this.
128-
129-
From your long-term storage machine, run the following command. You need to update the parameters before running.
130-
131-
```
132-
scp -i <key-pair> <username>@<ipaddress>:/remote/path/to/file $HOME/model/int8/
133-
```
134-
where `key-pair` is the key-pair used for the larger instance, `username` and `ipaddress` the corresponding access points, and the two paths are the source and destination paths respectively.
135-
136-
Save the following files for long-term storage.
137-
138-
```console
139-
$HOME/model/aarch64_dlrm_int8.pt
140-
$HOME/model/dlrm-multihot-pytorch.pt
141-
$HOME/data/terabyte_processed_test_v2_dense.bin
142-
$HOME/data/terabyte_processed_test_v2_label_sparse.bin
143-
```
144-
145-
To run the INT8 model, an instance with 250 GB of RAM and 500 GB of disk space is enough. For example, the following instance types:
146-
147-
| CSP | Instance type |
148-
| --------------------- | -------------- |
149-
| Google Cloud Platform | c4a-highmem-32 |
150-
| Amazon Web Services | r8g.8xlarge |
151-
| Microsoft Azure | TODO |
152-
153-
For example, you can re-run the offline `int8` benchmark by cloning the repository to the smaller instance and the following command.
154-
155-
```bash
156-
./run_main.sh offline int8
131+
./run_dlrm_benchmark.sh
157132
```
158133

159134
## Understanding the results
160135

161136
As a final step, have a look at the results generated in a text file.
137+
138+
The DLRM model optimizes the Click-Through Rate (CTR) prediction. It is a fundamental task in online advertising, recommendation systems, and search engines. Essentially, the model estimates the probability that a user will click on a given ad, product recommendation, or search result. The higher the predicted probability, the more likely the item is to be clicked. In a server context, the goal is to observe a high through-put of these probabilities.
139+
162140
```bash
163141
cat $HOME/results/int8/mlperf_log_summary.txt
164142
```
165143

166-
It should look something like this. Note the ....
144+
Your output should contain a `Samples per second`, where each sample tells probability of the user clicking a certain ad.
167145

168146
```output
169147
================================================
@@ -172,7 +150,7 @@ MLPerf Results Summary
172150
SUT name : PyFastSUT
173151
Scenario : Offline
174152
Mode : PerformanceOnly
175-
Samples per second: 1434.8 # Each sample tells probability of the user clicking a certain ad. Can be used by Amazon to pick the top 5 ads to recommend to a user
153+
Samples per second: 1434.8
176154
Result is : VALID
177155
Min duration satisfied : Yes
178156
Min queries satisfied : Yes
@@ -214,4 +192,3 @@ performance_issue_same : 0
214192
performance_issue_same_index : 0
215193
performance_sample_count : 204800
216194
```
217-

content/learning-paths/servers-and-cloud-computing/dlrm/_index.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,10 @@ title: MLPerf Benchmarking on Arm Neoverse V2
33

44
minutes_to_complete: 10
55

6-
who_is_this_for: This is an introductory topic for software developers who want to set up a pipeline in the cloud for recommendation models. You will . Then, you’ll build and run the benchmark using MLPerf, analyzing key performance metrics along the way.
7-
6+
who_is_this_for: This is an introductory topic for software developers who want to set up a pipeline in the cloud for recommendation models. You will build and run the benchmark using MLPerf and PyTorch.
87

98
learning_objectives:
10-
- build the Deep Learning Recommendation Model (DLRM)
9+
- build the Deep Learning Recommendation Model (DLRM) using a Docker image
1110
- run a modified performant DLRMv2 benchmark and inspect the results
1211

1312
prerequisites:
@@ -22,9 +21,10 @@ armips:
2221
- Neoverse
2322
tools_software_languages:
2423
- Docker
25-
- TODO
24+
- MLPerf
2625
operatingsystems:
2726
- Linux
27+
cloud_service_providers: AWS
2828

2929
further_reading:
3030
- resource:

0 commit comments

Comments
 (0)