Skip to content
This repository was archived by the owner on May 29, 2025. It is now read-only.

Commit f6f74fb

Browse files
authored
Cleaner argument handling & nlp/common/ folder (#16)
By moving the arguments into their own dataclass (available in Python 3.7), we can group certain types of arguments, such as ModelArguments and SageMakerArguments. This lets us consolidate the sagemaker scripts into a single file, and makes the arguments simpler to pass around in functions. Moves several files to common/. Users will need to set PYTHONPATH=/path/to/deep-learning-models/nlp. Also fixes PYTHONPATH to /opt/ml/... in the SageMaker container, so those jobs should run. Also adds support to log hyperparameters in TensorBoard.
1 parent baae6c0 commit f6f74fb

17 files changed

+459
-362
lines changed

models/nlp/albert/README.md

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,9 @@ Language models help AWS customers to improve search results, text classificatio
2020
3. Create an Amazon Elastic Container Registry (ECR) repository. Then build a Docker image from `docker/ngc_sagemaker.Dockerfile` and push it to ECR.
2121

2222
```bash
23-
export IMAGE=${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/${REPO}:ngc_tf21_sagemaker
23+
export ACCOUNT_ID=
24+
export REPO=
25+
export IMAGE=${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/${REPO}:ngc_tf210_sagemaker
2426
docker build -t ${IMAGE} -f docker/ngc_sagemaker.Dockerfile .
2527
$(aws ecr get-login --no-include-email)
2628
docker push ${IMAGE}
@@ -39,8 +41,13 @@ export SAGEMAKER_SECURITY_GROUP_IDS=sg-123,sg-456
3941
5. Launch the SageMaker job.
4042

4143
```bash
42-
python sagemaker_pretraining.py \
44+
# Add the main folder to your PYTHONPATH
45+
export PYTHONPATH=$PYTHONPATH:/path/to/deep-learning-models/models/nlp
46+
47+
python launch_sagemaker.py \
4348
--source_dir=. \
49+
--entry_point=run_pretraining.py \
50+
--sm_job_name=albert-pretrain \
4451
--instance_type=ml.p3dn.24xlarge \
4552
--instance_count=1 \
4653
--load_from=scratch \
@@ -52,9 +59,35 @@ python sagemaker_pretraining.py \
5259
--total_steps=125000 \
5360
--learning_rate=0.00176 \
5461
--optimizer=lamb \
62+
--log_frequency=10 \
5563
--name=myfirstjob
5664
```
5765

66+
6. Launch a SageMaker finetuning job.
67+
68+
```bash
69+
python launch_sagemaker.py \
70+
--source_dir=. \
71+
--entry_point=run_squad.py \
72+
--sm_job_name=albert-squad \
73+
--instance_type=ml.p3dn.24xlarge \
74+
--instance_count=1 \
75+
--load_from=scratch \
76+
--model_type=albert \
77+
--model_size=base \
78+
--batch_size=6 \
79+
--total_steps=8144 \
80+
--warmup_steps=814 \
81+
--learning_rate=3e-5 \
82+
--task_name=squadv2
83+
```
84+
85+
7. Enter the Docker container to debug and edit code.
86+
87+
```bash
88+
docker run -it -v=/fsx:/fsx --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm ${IMAGE} /bin/bash
89+
```
90+
5891
<!-- ### Training results
5992
6093
These will be posted shortly. -->

models/nlp/albert/arguments.py

Lines changed: 0 additions & 123 deletions
This file was deleted.
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
import argparse
2+
import dataclasses
3+
4+
from transformers import HfArgumentParser
5+
6+
from common.arguments import (
7+
DataTrainingArguments,
8+
LoggingArguments,
9+
ModelArguments,
10+
SageMakerArguments,
11+
TrainingArguments,
12+
)
13+
from common.sagemaker_utils import launch_sagemaker_job
14+
15+
if __name__ == "__main__":
16+
parser = argparse.ArgumentParser()
17+
parser = HfArgumentParser(
18+
(
19+
ModelArguments,
20+
DataTrainingArguments,
21+
TrainingArguments,
22+
LoggingArguments,
23+
SageMakerArguments,
24+
)
25+
)
26+
model_args, data_args, train_args, log_args, sm_args = parser.parse_args_into_dataclasses()
27+
28+
hyperparameters = dict()
29+
for args in [model_args, data_args, train_args, log_args]:
30+
for key, value in dataclasses.asdict(args).items():
31+
if value is not None:
32+
hyperparameters[key] = value
33+
hyperparameters["fsx_prefix"] = "/opt/ml/input/data/training"
34+
35+
instance_abbr = {
36+
"ml.p3dn.24xlarge": "p3dn",
37+
"ml.p3.16xlarge": "p316",
38+
"ml.g4dn.12xlarge": "g4dn",
39+
}[sm_args.instance_type]
40+
job_name = f"{sm_args.sm_job_name}-{sm_args.instance_count}x{instance_abbr}"
41+
42+
launch_sagemaker_job(
43+
hyperparameters=hyperparameters,
44+
job_name=job_name,
45+
source_dir=sm_args.source_dir,
46+
entry_point=sm_args.entry_point,
47+
instance_type=sm_args.instance_type,
48+
instance_count=sm_args.instance_count,
49+
role=sm_args.role,
50+
image_name=sm_args.image_name,
51+
fsx_id=sm_args.fsx_id,
52+
subnet_ids=sm_args.subnet_ids,
53+
security_group_ids=sm_args.security_group_ids,
54+
)

0 commit comments

Comments
 (0)