-
Notifications
You must be signed in to change notification settings - Fork 80
Description
While profiling, the checkpoints are stored appropriately, but there seems to be intermediate latency report files generated by perf analyzer which are not found and causing the below error.
This is my Job.yaml file
apiVersion: batch/v1
kind: Job
metadata:
name: {{ .Release.Name }}-model-analyzer
labels:
app: {{ .Release.Name }}-model-analyzer
chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
release: {{ .Release.Name }}
spec:
backoffLimit: 5
activeDeadlineSeconds: {{ .Values.jobTimeout }}
template:
spec:
shareProcessNamespace: true
restartPolicy: OnFailure
volumes:
- hostPath:
path: /s/k3/
type: Directory
name: scratch
- configMap:
name: analyzer-config
name: config
terminationGracePeriodSeconds: 1800
containers:
- name: analyzer
image: {{ .Values.images.analyzer.image }}
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
command: ["/bin/bash", "-c"]
args: [
"model-analyzer profile
--model-repository /models
--output-model-repository /output_models/output
--checkpoint-directory /checkpoints/
--triton-launch-mode local -f /config/config.yaml
&& model-analyzer analyze -e /results/
--checkpoint-directory /checkpoints/
-f /config/config.yaml
&& model-analyzer report -e /results/
--checkpoint-directory /checkpoints/
-f /config/config.yaml"]
volumeMounts:
- name: scratch
mountPath: /results
subPath: results
- name: scratch
mountPath: /models
subPath: models
- name: scratch
mountPath: /output_models
subPath: output-models
- name: scratch
mountPath: /checkpoints
subPath: checkpoints
- name: config
mountPath: /config
The error log is attached below
Details
Matplotlib created a temporary cache directory at /tmp/matplotlib-2qa7cu2g because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[Model Analyzer] Initializing GPUDevice handles
[Model Analyzer] Using GPU 0 Tesla T4 with UUID GPU-d3e4feb1-072a-c3ab-86b7-fb577104ef77
[Model Analyzer] WARNING: Overriding the output model repo path "/output_models/output"
[Model Analyzer] Starting a local Triton Server
[Model Analyzer] Loaded checkpoint from file /checkpoints/2.ckpt
[Model Analyzer] GPU devices match checkpoint - skipping server metric acquisition
[Model Analyzer]
[Model Analyzer] Starting Optuna mode search to find optimal configs
[Model Analyzer]
[I 2024-10-08 14:29:18,355] A new study created in memory with name: bge-small-en-v1.5-onnx
[Model Analyzer] Measuring default configuration to establish a baseline measurement
[Model Analyzer] Creating model config: bge-small-en-v1.5-onnx_config_default
[Model Analyzer]
[Model Analyzer] Profiling bge-small-en-v1.5-onnx_config_default: concurrency=16
[Model Analyzer] Saved checkpoint to /checkpoints/3.ckpt
Traceback (most recent call last):
File "/usr/local/bin/model-analyzer", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/entrypoint.py", line 278, in main
analyzer.profile(
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/analyzer.py", line 131, in profile
self._profile_models()
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/analyzer.py", line 251, in _profile_models
self._model_manager.run_models(models=[model])
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/model_manager.py", line 154, in run_models
measurement = self._metrics_manager.execute_run_config(run_config)
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/record/metrics_manager.py", line 245, in execute_run_config
measurement = self.profile_models(run_config)
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/record/metrics_manager.py", line 278, in profile_models
perf_analyzer_metrics, model_gpu_metrics = self._run_perf_analyzer(
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/record/metrics_manager.py", line 618, in _run_perf_analyzer
status = perf_analyzer.run(metrics_to_gather, env=perf_analyzer_env)
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/perf_analyzer/perf_analyzer.py", line 235, in run
self._parse_outputs(metrics)
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/perf_analyzer/perf_analyzer.py", line 535, in _parse_outputs
self._parse_generic_outputs(metrics)
File "/usr/local/lib/python3.10/dist-packages/model_analyzer/perf_analyzer/perf_analyzer.py", line 551, in _parse_generic_outputs
with open(perf_config["latency-report-file"], mode="r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'bge-small-en-v1.5-onnx-results.csv'