-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Hi,
I'm training a YOLOv5 model on sagemaker. I've created an Experiment and Trial for training the model. But the training metrics like precision, recall, mAP, etc are not being recorded in the Sagemaker.
I've followed the process similar to https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-experiments/mnist-handwritten-digits-classification-experiment/mnist-handwritten-digits-classification-experiment.ipynb
Is it a problem with the IAM role or something like that?
I'm triggering the training process using 'Estimator' as shown below.
yolov5_experiment = Experiment.create(
experiment_name=f"yolov5-training-job-{timenow}",
description="yolov5n model training",
sagemaker_boto_client=sm,
)
yolov5_training_job_name = f'yolov5-training-job-{timenow}'
trial_name = f"yolov5-training-job-{timenow}"
yolov5_trial = Trial.create(
trial_name=trial_name,
experiment_name=yolov5_experiment.experiment_name,
sagemaker_boto_client=sm,
)
estimator = Estimator(
image_uri=container,
role=role,
instance_count=1,
instance_type='ml.m4.xlarge',
# instance_type='local',
input_mode='File',
output_path=outpath,
base_job_name='yolov5',
sagemaker_session=sagemaker.Session(sagemaker_client=sm),
metric_definitions=[
{'Name': 'metrics/mAP_0.5', "Regex": "metrics/mAP_0.5: (.?);"},
{'Name': 'metrics/mAP_0.5:0.95', "Regex": "metrics/mAP_0.5:0.95: (.?);"},
{'Name': 'metrics/recall', "Regex": "metrics/recall: (.?);"},
{'Name': 'metrics/precision', "Regex": "metrics/precision: (.?);"},
{'Name': 'train/box_loss', "Regex": "train/box_loss: (.?);"},
{'Name': 'train/cls_loss', "Regex": "train/cls_loss: (.?);"},
{'Name': 'train/obj_loss', "Regex": "train/obj_loss: (.?);"},
{'Name': 'val/cls_loss', "Regex": "val/cls_loss: (.?);"},
{'Name': 'val/obj_loss', "Regex": "val/obj_loss: (.?);"},
{'Name': 'val/box_loss',"Regex": "val/box_loss: (.?);"},
{'Name': 'Epoch', "Regex": "Epoch: (.*?);"}
],
enable_sagemaker_metrics=True,
)
estimator.fit(inputs,job_name=yolov5_training_job_name,
experiment_config={
"ExperimentName": yolov5_experiment.experiment_name,
"TrialName": yolov5_trial.trial_name,
"TrialComponentDisplayName": "Training",
},
wait=True,)