-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
Bug description
I was playing around with the LightningCLI
and I found out that it can still work even if the config.yaml
contains invalid data types. For example, max_epochs
for Trainer
should be int
. However, it still succeeds with a str
in the .yaml
. In the MWE, you can see that config.yaml
contains str
for both seed_everything
and max_epochs
. This is also evident when reading back the config.yaml
file:
import yaml
with open('config.yaml', 'r') as fhand:
data = yaml.load(fhand)
print(data)
{'seed_everything': '1042', 'trainer': {'max_epochs': '2'}} # Prints this
Note
I am not sure if this is really a bug, since it might be the case that the LightningCLI
converts the given data types to the correct ones based on the type hints. However, I couldn't find if this is really the case.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
# main.py
from lightning.pytorch.cli import LightningCLI
# simple demo classes for your convenience
from lightning.pytorch.demos.boring_classes import DemoModel, BoringDataModule
def cli_main():
cli = LightningCLI(DemoModel, BoringDataModule)
# note: don't call fit!!
if __name__ == "__main__":
cli_main()
# note: it is good practice to implement the CLI in a function and call it in the main if block
# config.yaml
seed_everything: "1042"
trainer:
max_epochs: "2"
Now from the CLI:
python main.py fit --config=config.yaml
### Error messages and logs
config.yaml lightning_logs/ main.py
(aidsorb) [ansar@mofinium ligthning_bug]$ python main.py fit --config=config.yaml
Seed set to 1042
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/home/ansar/venvir/aidsorb/lib64/python3.11/site-packages/lightning/pytorch/trainer/configuration_validator.py:68: You passed in a val_dataloader
but have no validation_step
. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[rank: 1] Seed set to 1042
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
| Name | Type | Params | Mode
0 | l1 | Linear | 330 | train
330 Trainable params
0 Non-trainable params
330 Total params
0.001 Total estimated model params size (MB)
1 Modules in train mode
0 Modules in eval mode
/home/ansar/venvir/aidsorb/lib64/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers
argumentto
num_workers=9in the
DataLoader to improve performance. /home/ansar/venvir/aidsorb/lib64/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py:298: The number of training batches (32) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch. Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 1100.86it/s, v_num=3]
Trainer.fitstopped:
max_epochs=2` reached.
Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 1045.92it/s, v_num
### Environment
<details>
<summary>Current environment</summary>
* CUDA:
- GPU:
- Quadro RTX 4000
- Quadro RTX 4000
- available: True
- version: 12.1
* Lightning:
- lightning: 2.4.0
- lightning-utilities: 0.11.7
- pytorch-lightning: 2.4.0
- torch: 2.4.1
- torchmetrics: 1.4.3
- torchvision: 0.19.1
* Packages:
- absl-py: 2.1.0
- aidsorb: 1.0.0
- aiohappyeyeballs: 2.4.3
- aiohttp: 3.10.9
- aiosignal: 1.3.1
- ase: 3.23.0
- attrs: 24.2.0
- contourpy: 1.3.0
- cycler: 0.12.1
- docstring-parser: 0.16
- filelock: 3.16.1
- fire: 0.7.0
- fonttools: 4.54.1
- frozenlist: 1.4.1
- fsspec: 2024.9.0
- grpcio: 1.66.2
- idna: 3.10
- importlib-resources: 6.4.5
- jinja2: 3.1.4
- jsonargparse: 4.33.2
- kiwisolver: 1.4.7
- lightning: 2.4.0
- lightning-utilities: 0.11.7
- markdown: 3.7
- markupsafe: 3.0.1
- matplotlib: 3.9.2
- mpmath: 1.3.0
- multidict: 6.1.0
- networkx: 3.3
- numpy: 1.26.4
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 9.1.0.70
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-nccl-cu12: 2.20.5
- nvidia-nvjitlink-cu12: 12.6.77
- nvidia-nvtx-cu12: 12.1.105
- packaging: 24.1
- pandas: 2.2.3
- pillow: 10.4.0
- pip: 24.2
- plotly: 5.24.1
- propcache: 0.2.0
- protobuf: 5.28.2
- pyparsing: 3.1.4
- python-dateutil: 2.9.0.post0
- pytorch-lightning: 2.4.0
- pytz: 2024.2
- pyyaml: 6.0.2
- scipy: 1.14.1
- setuptools: 65.5.1
- six: 1.16.0
- sympy: 1.13.3
- tenacity: 9.0.0
- tensorboard: 2.18.0
- tensorboard-data-server: 0.7.2
- termcolor: 2.5.0
- torch: 2.4.1
- torchmetrics: 1.4.3
- torchvision: 0.19.1
- tqdm: 4.66.5
- triton: 3.0.0
- typeshed-client: 2.7.0
- typing-extensions: 4.12.2
- tzdata: 2024.2
- werkzeug: 3.0.4
- yarl: 1.14.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.11.7
- release: 5.14.0-427.16.1.el9_4.x86_64
- version: #1 SMP PREEMPT_DYNAMIC Wed May 8 17:48:14 UTC 2024
</details>
### More info
_No response_