-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersver: 2.5.x
Description
Bug description
I'm facing a bug when using MLFlowLogger with Lightning on Databricks.
Due to how Databricks manages git source workflow tasks, the working directory of the code may not be writable.
The offending code is here. I think it is unnecessary to pass the prefix
, suffix
and dir
arguments into the TemporaryDirectory
and the tmp_dir
will be safely create in /tmp
path.
pytorch-lightning/src/lightning/pytorch/loggers/mlflow.py
Lines 372 to 382 in 1278308
with tempfile.TemporaryDirectory(prefix="test", suffix="test", dir=os.getcwd()) as tmp_dir: | |
# Log the metadata | |
with open(f"{tmp_dir}/metadata.yaml", "w") as tmp_file_metadata: | |
yaml.dump(metadata, tmp_file_metadata, default_flow_style=False) | |
# Log the aliases | |
with open(f"{tmp_dir}/aliases.txt", "w") as tmp_file_aliases: | |
tmp_file_aliases.write(str(aliases)) | |
# Log the metadata and aliases | |
self.experiment.log_artifacts(self._run_id, tmp_dir, artifact_path) |
The fix is quite simple. If it is ok, I will proceed with a PR.
What version are you seeing the problem on?
v2.5
How to reproduce the bug
I basically used the quick start and added MLFlowLogger.
# example.py
import os
import lightning as L
import torch
import torch.nn.functional as F
from torch import nn
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision.datasets import MNIST
class Encoder(nn.Module):
def __init__(self):
super().__init__()
self.l1 = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
def forward(self, x):
return self.l1(x)
class Decoder(nn.Module):
def __init__(self):
super().__init__()
self.l1 = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))
def forward(self, x):
return self.l1(x)
class LitAutoEncoder(L.LightningModule):
def __init__(self, encoder, decoder):
super().__init__()
self.encoder = encoder
self.decoder = decoder
def training_step(self, batch, batch_idx):
# training_step defines the train loop.
x, _ = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
return loss
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
dataset = MNIST(os.getcwd(), transform=transforms.ToTensor())
train_loader = DataLoader(dataset)
# model
autoencoder = LitAutoEncoder(Encoder(), Decoder())
# train model
mlflow_logger = L.pytorch.loggers.mlflow.MLFlowLogger(log_model=True)
trainer = L.Trainer(logger=mlflow_logger, max_steps=50)
trainer.fit(model=autoencoder, train_dataloaders=train_loader)
# download.py
import os
from torchvision.datasets import MNIST
MNIST(os.getcwd(), download=True)
# Dockerfile
FROM python:3.13 AS base
# set up user
ARG USER=user \
UID=1000
RUN useradd --shell /bin/false --uid ${UID} ${USER}
# set up environment
ARG APP_HOME=/work/app
WORKDIR ${APP_HOME}
# set up python
RUN pip install lightning mlflow-skinny torchvision \
&& pip list
# set up project
COPY example.py download.py ./
RUN python download.py \
&& mkdir mlruns \
&& chown ${USER}:${USER} mlruns
USER ${USER}
CMD ["python", "example.py"]
Error messages and logs
# Error messages and logs here please
Traceback (most recent call last):
File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/trainer/trainer.py", line 575, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/trainer/trainer.py", line 995, in _run
call._call_teardown_hook(self)
~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/trainer/call.py", line 143, in _call_teardown_hook
logger.finalize("success")
~~~~~~~~~~~~~~~^^^^^^^^^^^
File "/usr/local/lib/python3.13/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/loggers/mlflow.py", line 287, in finalize
self._scan_and_log_checkpoints(self._checkpoint_callback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/loggers/mlflow.py", line 370, in _scan_and_log_checkpoints
with tempfile.TemporaryDirectory(prefix="test", suffix="test", dir=os.getcwd()) as tmp_dir:
~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/tempfile.py", line 882, in __init__
self.name = mkdtemp(suffix, prefix, dir)
~~~~~~~^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/tempfile.py", line 384, in mkdtemp
_os.mkdir(file, 0o700)
~~~~~~~~~^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: '/work/app/test1ewl_8q8test'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/work/app/example.py", line 58, in <module>
trainer.fit(model=autoencoder, train_dataloaders=train_loader)
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/trainer/trainer.py", line 539, in fit
call._call_and_handle_interrupt(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/trainer/call.py", line 67, in _call_and_handle_interrupt
_interrupt(trainer, exception)
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/trainer/call.py", line 81, in _interrupt
logger.finalize("failed")
~~~~~~~~~~~~~~~^^^^^^^^^^
File "/usr/local/lib/python3.13/site-packages/lightning_utilities/core/rank_zero.py", line 42, in wrapped_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/loggers/mlflow.py", line 287, in finalize
self._scan_and_log_checkpoints(self._checkpoint_callback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/site-packages/lightning/pytorch/loggers/mlflow.py", line 370, in _scan_and_log_checkpoints
with tempfile.TemporaryDirectory(prefix="test", suffix="test", dir=os.getcwd()) as tmp_dir:
~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/tempfile.py", line 882, in __init__
self.name = mkdtemp(suffix, prefix, dir)
~~~~~~~^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.13/tempfile.py", line 384, in mkdtemp
_os.mkdir(file, 0o700)
~~~~~~~~~^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: '/work/app/testqqet_lo6test'
Environment
Current environment
- CUDA:
- GPU: None
- available: False
- version: None - Lightning:
- lightning: 2.5.0.post0
- lightning-utilities: 0.14.0
- pytorch-lightning: 2.5.0.post0
- torch: 2.6.0
- torchmetrics: 1.6.3
- torchvision: 0.21.0 - Packages:
- aiohappyeyeballs: 2.6.1
- aiohttp: 3.11.13
- aiosignal: 1.3.2
- annotated-types: 0.7.0
- anyio: 4.8.0
- attrs: 25.3.0
- autocommand: 2.2.2
- backports.tarfile: 1.2.0
- cachetools: 5.5.2
- certifi: 2025.1.31
- charset-normalizer: 3.4.1
- click: 8.1.8
- cloudpickle: 3.1.1
- databricks-sdk: 0.46.0
- deprecated: 1.2.18
- fastapi: 0.115.11
- filelock: 3.18.0
- frozenlist: 1.5.0
- fsspec: 2025.3.0
- gitdb: 4.0.12
- gitpython: 3.1.44
- google-auth: 2.38.0
- h11: 0.14.0
- idna: 3.10
- importlib-metadata: 8.6.1
- inflect: 7.3.1
- jaraco.collections: 5.1.0
- jaraco.context: 5.3.0
- jaraco.functools: 4.0.1
- jaraco.text: 3.12.1
- jinja2: 3.1.6
- lightning: 2.5.0.post0
- lightning-utilities: 0.14.0
- markupsafe: 3.0.2
- mlflow-skinny: 2.21.0
- more-itertools: 10.3.0
- mpmath: 1.3.0
- multidict: 6.1.0
- networkx: 3.4.2
- numpy: 2.2.3
- opentelemetry-api: 1.31.0
- opentelemetry-sdk: 1.31.0
- opentelemetry-semantic-conventions: 0.52b0
- packaging: 24.2
- pillow: 11.1.0
- pip: 24.3.1
- platformdirs: 4.2.2
- propcache: 0.3.0
- protobuf: 5.29.3
- pyasn1: 0.6.1
- pyasn1-modules: 0.4.1
- pydantic: 2.10.6
- pydantic-core: 2.27.2
- pytorch-lightning: 2.5.0.post0
- pyyaml: 6.0.2
- requests: 2.32.3
- rsa: 4.9
- setuptools: 76.0.0
- smmap: 5.0.2
- sniffio: 1.3.1
- sqlparse: 0.5.3
- starlette: 0.46.1
- sympy: 1.13.1
- tomli: 2.0.1
- torch: 2.6.0
- torchmetrics: 1.6.3
- torchvision: 0.21.0
- tqdm: 4.67.1
- typeguard: 4.3.0
- typing-extensions: 4.12.2
- urllib3: 2.3.0
- uvicorn: 0.34.0
- wheel: 0.43.0
- wrapt: 1.17.2
- yarl: 1.18.3
- zipp: 3.21.0 - System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor:
- python: 3.13.2
- release: 6.11.3-200.fc40.aarch64
- version: Proposal for help #1 SMP PREEMPT_DYNAMIC Thu Oct 10 22:53:48 UTC 2024
More info
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersver: 2.5.x