Skip to content

Training metrics not being recorded in Sagemaker Experiments #169

@santoshmedisetty

Description

@santoshmedisetty

Hi,

I'm training a YOLOv5 model on sagemaker. I've created an Experiment and Trial for training the model. But the training metrics like precision, recall, mAP, etc are not being recorded in the Sagemaker.

I've followed the process similar to https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-experiments/mnist-handwritten-digits-classification-experiment/mnist-handwritten-digits-classification-experiment.ipynb

Is it a problem with the IAM role or something like that?

I'm triggering the training process using 'Estimator' as shown below.

yolov5_experiment = Experiment.create(
experiment_name=f"yolov5-training-job-{timenow}",
description="yolov5n model training",
sagemaker_boto_client=sm,
)

yolov5_training_job_name = f'yolov5-training-job-{timenow}'

trial_name = f"yolov5-training-job-{timenow}"
yolov5_trial = Trial.create(
trial_name=trial_name,
experiment_name=yolov5_experiment.experiment_name,
sagemaker_boto_client=sm,
)

estimator = Estimator(
image_uri=container,
role=role,
instance_count=1,
instance_type='ml.m4.xlarge',
# instance_type='local',
input_mode='File',
output_path=outpath,
base_job_name='yolov5',
sagemaker_session=sagemaker.Session(sagemaker_client=sm),
metric_definitions=[
{'Name': 'metrics/mAP_0.5', "Regex": "metrics/mAP_0.5: (.?);"},
{'Name': 'metrics/mAP_0.5:0.95', "Regex": "metrics/mAP_0.5:0.95: (.
?);"},
{'Name': 'metrics/recall', "Regex": "metrics/recall: (.?);"},
{'Name': 'metrics/precision', "Regex": "metrics/precision: (.
?);"},
{'Name': 'train/box_loss', "Regex": "train/box_loss: (.?);"},
{'Name': 'train/cls_loss', "Regex": "train/cls_loss: (.
?);"},
{'Name': 'train/obj_loss', "Regex": "train/obj_loss: (.?);"},
{'Name': 'val/cls_loss', "Regex": "val/cls_loss: (.
?);"},
{'Name': 'val/obj_loss', "Regex": "val/obj_loss: (.?);"},
{'Name': 'val/box_loss',"Regex": "val/box_loss: (.
?);"},
{'Name': 'Epoch', "Regex": "Epoch: (.*?);"}
],
enable_sagemaker_metrics=True,

)

estimator.fit(inputs,job_name=yolov5_training_job_name,
experiment_config={
"ExperimentName": yolov5_experiment.experiment_name,
"TrialName": yolov5_trial.trial_name,
"TrialComponentDisplayName": "Training",
},
wait=True,)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions