-
Notifications
You must be signed in to change notification settings - Fork 2
Description
When logging a registered model via mlflow.pytorch.log_model, MLflow tries to re-log metrics for the model (log_model_metrics_for_step) at the given step (default step=0), and if those metrics have already been logged at that step (which training.py does) then the new metrics will have the same key, ("key", value, timestamp, step, is_nan, run_uuid), causing a unique key violation in the sqlite3 backend.
I am not exactly sure why this wasn't an issue when the e2e test was originally created and merged into main (the original version of the e2e test still logged train/loss and eval/loss metrics including at step 0) and why it came up when working on the metrics tracker features (possibly due to using an updated version of the mlflow library). Also not sure if this is just an issue with a sqlite3 backend or if other backends are affected.
This seems like a bug with the mlflow library, but have not been able to put together a simple example to reproduce the issue. For now have been able to avoid the issue by passing step=num_steps + 1 as an argument to log_model which is guaranteed not to have any associated metrics, but this is just a hack that prevents any metrics from being associated with the logged model.
tests/end_to_end/training.py:171: in train
persister.save_model_to_registry(predictive_model, registered_model_name, model_inputs=sample_inputs)
simplexity/persistence/mlflow_persister.py:289: in save_model_to_registry
model_info = mlflow_pytorch.log_model(**log_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/mlflow/pytorch/__init__.py:288: in log_model
return Model.log(
.venv/lib/python3.12/site-packages/mlflow/models/model.py:1195: in log
log_model_metrics_for_step(
.venv/lib/python3.12/site-packages/mlflow/models/model.py:1139: in log_model_metrics_for_step
client.log_batch(run_id=run_id, metrics=metrics_for_step)
.venv/lib/python3.12/site-packages/mlflow/tracking/client.py:2455: in log_batch
return self._tracking_client.log_batch(
.venv/lib/python3.12/site-packages/mlflow/telemetry/track.py:30: in wrapper
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/mlflow/tracking/_tracking_service/client.py:581: in log_batch
self.store.log_batch(run_id=run_id, metrics=metrics_batch, params=[], tags=[])
.venv/lib/python3.12/site-packages/mlflow/store/tracking/sqlalchemy_store.py:1601: in log_batch
raise e
.venv/lib/python3.12/site-packages/mlflow/store/tracking/sqlalchemy_store.py:1597: in log_batch
self._log_metrics(run_id, metrics)
.venv/lib/python3.12/site-packages/mlflow/store/tracking/sqlalchemy_store.py:839: in _log_metrics
with self.ManagedSessionMaker() as session:
../../miniconda3/lib/python3.12/contextlib.py:158: in __exit__
self.gen.throw(value)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
@contextmanager
def make_managed_session():
"""Provide a transactional scope around a series of operations."""
with SessionMaker() as session:
try:
if db_type == SQLITE:
session.execute(sql.text("PRAGMA foreign_keys = ON;"))
session.execute(sql.text("PRAGMA busy_timeout = 20000;"))
session.execute(sql.text("PRAGMA case_sensitive_like = true;"))
yield session
session.commit()
except MlflowException:
session.rollback()
raise
except sqlalchemy.exc.OperationalError as e:
session.rollback()
_logger.exception(
"SQLAlchemy database error. The following exception is caught.\n%s",
e,
)
raise MlflowException(message=e, error_code=TEMPORARILY_UNAVAILABLE) from e
except sqlalchemy.exc.SQLAlchemyError as e:
session.rollback()
> raise MlflowException(message=e, error_code=BAD_REQUEST) from e
E mlflow.exceptions.MlflowException: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
E (sqlite3.IntegrityError) UNIQUE constraint failed: metrics.key, metrics.timestamp, metrics.step, metrics.run_uuid, metrics.value, metrics.is_nan
E [SQL: INSERT INTO metrics ("key", value, timestamp, step, is_nan, run_uuid) VALUES (?, ?, ?, ?, ?, ?)]
E [parameters: ('loss/progress_to_optimal', 0.0, 1764723434073, 0, 1, 'ce154c44322544d8bb5e5a3b100f129e')]
E (Background on this error at: https://sqlalche.me/e/20/gkpj)
.venv/lib/python3.12/site-packages/mlflow/store/db/utils.py:194: MlflowException