Skip to content

MLFlowLogger assumes cwd is writable #20641

@yxtay

Description

@yxtay

Bug description

I'm facing a bug when using MLFlowLogger with Lightning on Databricks.

Due to how Databricks manages git source workflow tasks, the working directory of the code may not be writable.

The offending code is here. I think it is unnecessary to pass the prefix, suffix and dir arguments into the TemporaryDirectory and the tmp_dir will be safely create in /tmp path.

with tempfile.TemporaryDirectory(prefix="test", suffix="test", dir=os.getcwd()) as tmp_dir:
# Log the metadata
with open(f"{tmp_dir}/metadata.yaml", "w") as tmp_file_metadata:
yaml.dump(metadata, tmp_file_metadata, default_flow_style=False)
# Log the aliases
with open(f"{tmp_dir}/aliases.txt", "w") as tmp_file_aliases:
tmp_file_aliases.write(str(aliases))
# Log the metadata and aliases
self.experiment.log_artifacts(self._run_id, tmp_dir, artifact_path)

The fix is quite simple. If it is ok, I will proceed with a PR.

What version are you seeing the problem on?

v2.5

How to reproduce the bug

I basically used the quick start and added MLFlowLogger.

# example.py

import os

import lightning as L
import torch
import torch.nn.functional as F
from torch import nn
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision.datasets import MNIST


class Encoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))

    def forward(self, x):
        return self.l1(x)


class Decoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))

    def forward(self, x):
        return self.l1(x)


class LitAutoEncoder(L.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def training_step(self, batch, batch_idx):
        # training_step defines the train loop.
        x, _ = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


dataset = MNIST(os.getcwd(), transform=transforms.ToTensor())
train_loader = DataLoader(dataset)
# model
autoencoder = LitAutoEncoder(Encoder(), Decoder())

# train model
mlflow_logger = L.pytorch.loggers.mlflow.MLFlowLogger(log_model=True)
trainer = L.Trainer(logger=mlflow_logger, max_steps=50)
trainer.fit(model=autoencoder, train_dataloaders=train_loader)
# download.py

import os

from torchvision.datasets import MNIST

MNIST(os.getcwd(), download=True)
# Dockerfile

FROM python:3.13 AS base

# set up user
ARG USER=user \
    UID=1000
RUN useradd --shell /bin/false --uid ${UID} ${USER}

# set up environment
ARG APP_HOME=/work/app

WORKDIR ${APP_HOME}

# set up python
RUN pip install lightning mlflow-skinny torchvision \
    && pip list

# set up project
COPY example.py download.py ./
RUN python download.py \
    && mkdir mlruns \
    && chown ${USER}:${USER} mlruns

USER ${USER}
CMD ["python", "example.py"]

Error messages and logs

# Error messages and logs here please

Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/trainer/trainer.py", line 575, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
    ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/trainer/trainer.py", line 995, in _run
    call._call_teardown_hook(self)
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/trainer/call.py", line 143, in _call_teardown_hook
    logger.finalize("success")
    ~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/loggers/mlflow.py", line 287, in finalize
    self._scan_and_log_checkpoints(self._checkpoint_callback)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/loggers/mlflow.py", line 370, in _scan_and_log_checkpoints
    with tempfile.TemporaryDirectory(prefix="test", suffix="test", dir=os.getcwd()) as tmp_dir:
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/tempfile.py", line 882, in __init__
    self.name = mkdtemp(suffix, prefix, dir)
                ~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/tempfile.py", line 384, in mkdtemp
    _os.mkdir(file, 0o700)
    ~~~~~~~~~^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: '/work/app/test1ewl_8q8test'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/work/app/example.py", line 58, in <module>
    trainer.fit(model=autoencoder, train_dataloaders=train_loader)
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/trainer/trainer.py", line 539, in fit
    call._call_and_handle_interrupt(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/trainer/call.py", line 67, in _call_and_handle_interrupt
    _interrupt(trainer, exception)
    ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/trainer/call.py", line 81, in _interrupt
    logger.finalize("failed")
    ~~~~~~~~~~~~~~~^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/loggers/mlflow.py", line 287, in finalize
    self._scan_and_log_checkpoints(self._checkpoint_callback)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/loggers/mlflow.py", line 370, in _scan_and_log_checkpoints
    with tempfile.TemporaryDirectory(prefix="test", suffix="test", dir=os.getcwd()) as tmp_dir:
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/tempfile.py", line 882, in __init__
    self.name = mkdtemp(suffix, prefix, dir)
                ~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/tempfile.py", line 384, in mkdtemp
    _os.mkdir(file, 0o700)
    ~~~~~~~~~^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: '/work/app/testqqet_lo6test'

Environment

Current environment
  • CUDA:
    - GPU: None
    - available: False
    - version: None
  • Lightning:
    - lightning: 2.5.0.post0
    - lightning-utilities: 0.14.0
    - pytorch-lightning: 2.5.0.post0
    - torch: 2.6.0
    - torchmetrics: 1.6.3
    - torchvision: 0.21.0
  • Packages:
    - aiohappyeyeballs: 2.6.1
    - aiohttp: 3.11.13
    - aiosignal: 1.3.2
    - annotated-types: 0.7.0
    - anyio: 4.8.0
    - attrs: 25.3.0
    - autocommand: 2.2.2
    - backports.tarfile: 1.2.0
    - cachetools: 5.5.2
    - certifi: 2025.1.31
    - charset-normalizer: 3.4.1
    - click: 8.1.8
    - cloudpickle: 3.1.1
    - databricks-sdk: 0.46.0
    - deprecated: 1.2.18
    - fastapi: 0.115.11
    - filelock: 3.18.0
    - frozenlist: 1.5.0
    - fsspec: 2025.3.0
    - gitdb: 4.0.12
    - gitpython: 3.1.44
    - google-auth: 2.38.0
    - h11: 0.14.0
    - idna: 3.10
    - importlib-metadata: 8.6.1
    - inflect: 7.3.1
    - jaraco.collections: 5.1.0
    - jaraco.context: 5.3.0
    - jaraco.functools: 4.0.1
    - jaraco.text: 3.12.1
    - jinja2: 3.1.6
    - lightning: 2.5.0.post0
    - lightning-utilities: 0.14.0
    - markupsafe: 3.0.2
    - mlflow-skinny: 2.21.0
    - more-itertools: 10.3.0
    - mpmath: 1.3.0
    - multidict: 6.1.0
    - networkx: 3.4.2
    - numpy: 2.2.3
    - opentelemetry-api: 1.31.0
    - opentelemetry-sdk: 1.31.0
    - opentelemetry-semantic-conventions: 0.52b0
    - packaging: 24.2
    - pillow: 11.1.0
    - pip: 24.3.1
    - platformdirs: 4.2.2
    - propcache: 0.3.0
    - protobuf: 5.29.3
    - pyasn1: 0.6.1
    - pyasn1-modules: 0.4.1
    - pydantic: 2.10.6
    - pydantic-core: 2.27.2
    - pytorch-lightning: 2.5.0.post0
    - pyyaml: 6.0.2
    - requests: 2.32.3
    - rsa: 4.9
    - setuptools: 76.0.0
    - smmap: 5.0.2
    - sniffio: 1.3.1
    - sqlparse: 0.5.3
    - starlette: 0.46.1
    - sympy: 1.13.1
    - tomli: 2.0.1
    - torch: 2.6.0
    - torchmetrics: 1.6.3
    - torchvision: 0.21.0
    - tqdm: 4.67.1
    - typeguard: 4.3.0
    - typing-extensions: 4.12.2
    - urllib3: 2.3.0
    - uvicorn: 0.34.0
    - wheel: 0.43.0
    - wrapt: 1.17.2
    - yarl: 1.18.3
    - zipp: 3.21.0
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    - ELF
    - processor:
    - python: 3.13.2
    - release: 6.11.3-200.fc40.aarch64
    - version: Proposal for help #1 SMP PREEMPT_DYNAMIC Thu Oct 10 22:53:48 UTC 2024

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.5.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions