Skip to content

MLflow error when attempting to log model at same step with logged metrics #125

@ealt

Description

@ealt

When logging a registered model via mlflow.pytorch.log_model, MLflow tries to re-log metrics for the model (log_model_metrics_for_step) at the given step (default step=0), and if those metrics have already been logged at that step (which training.py does) then the new metrics will have the same key, ("key", value, timestamp, step, is_nan, run_uuid), causing a unique key violation in the sqlite3 backend.

I am not exactly sure why this wasn't an issue when the e2e test was originally created and merged into main (the original version of the e2e test still logged train/loss and eval/loss metrics including at step 0) and why it came up when working on the metrics tracker features (possibly due to using an updated version of the mlflow library). Also not sure if this is just an issue with a sqlite3 backend or if other backends are affected.

This seems like a bug with the mlflow library, but have not been able to put together a simple example to reproduce the issue. For now have been able to avoid the issue by passing step=num_steps + 1 as an argument to log_model which is guaranteed not to have any associated metrics, but this is just a hack that prevents any metrics from being associated with the logged model.

tests/end_to_end/training.py:171: in train
    persister.save_model_to_registry(predictive_model, registered_model_name, model_inputs=sample_inputs)
simplexity/persistence/mlflow_persister.py:289: in save_model_to_registry
    model_info = mlflow_pytorch.log_model(**log_kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/mlflow/pytorch/__init__.py:288: in log_model
    return Model.log(
.venv/lib/python3.12/site-packages/mlflow/models/model.py:1195: in log
    log_model_metrics_for_step(
.venv/lib/python3.12/site-packages/mlflow/models/model.py:1139: in log_model_metrics_for_step
    client.log_batch(run_id=run_id, metrics=metrics_for_step)
.venv/lib/python3.12/site-packages/mlflow/tracking/client.py:2455: in log_batch
    return self._tracking_client.log_batch(
.venv/lib/python3.12/site-packages/mlflow/telemetry/track.py:30: in wrapper
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/mlflow/tracking/_tracking_service/client.py:581: in log_batch
    self.store.log_batch(run_id=run_id, metrics=metrics_batch, params=[], tags=[])
.venv/lib/python3.12/site-packages/mlflow/store/tracking/sqlalchemy_store.py:1601: in log_batch
    raise e
.venv/lib/python3.12/site-packages/mlflow/store/tracking/sqlalchemy_store.py:1597: in log_batch
    self._log_metrics(run_id, metrics)
.venv/lib/python3.12/site-packages/mlflow/store/tracking/sqlalchemy_store.py:839: in _log_metrics
    with self.ManagedSessionMaker() as session:
../../miniconda3/lib/python3.12/contextlib.py:158: in __exit__
    self.gen.throw(value)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    @contextmanager
    def make_managed_session():
        """Provide a transactional scope around a series of operations."""
        with SessionMaker() as session:
            try:
                if db_type == SQLITE:
                    session.execute(sql.text("PRAGMA foreign_keys = ON;"))
                    session.execute(sql.text("PRAGMA busy_timeout = 20000;"))
                    session.execute(sql.text("PRAGMA case_sensitive_like = true;"))
                yield session
                session.commit()
            except MlflowException:
                session.rollback()
                raise
            except sqlalchemy.exc.OperationalError as e:
                session.rollback()
                _logger.exception(
                    "SQLAlchemy database error. The following exception is caught.\n%s",
                    e,
                )
                raise MlflowException(message=e, error_code=TEMPORARILY_UNAVAILABLE) from e
            except sqlalchemy.exc.SQLAlchemyError as e:
                session.rollback()
>               raise MlflowException(message=e, error_code=BAD_REQUEST) from e
E               mlflow.exceptions.MlflowException: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
E               (sqlite3.IntegrityError) UNIQUE constraint failed: metrics.key, metrics.timestamp, metrics.step, metrics.run_uuid, metrics.value, metrics.is_nan
E               [SQL: INSERT INTO metrics ("key", value, timestamp, step, is_nan, run_uuid) VALUES (?, ?, ?, ?, ?, ?)]
E               [parameters: ('loss/progress_to_optimal', 0.0, 1764723434073, 0, 1, 'ce154c44322544d8bb5e5a3b100f129e')]
E               (Background on this error at: https://sqlalche.me/e/20/gkpj)

.venv/lib/python3.12/site-packages/mlflow/store/db/utils.py:194: MlflowException

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions