Skip to content

Commit c8799c1

Browse files
mauvilsaSkafteNickiBorda
authored andcommitted
Fix LightningCLI loading of hyperparameters from ckpt_path failing for subclass model mode (#21246)
* Fix LightningCLI loading of hyperparameters from ckpt_path failing for subclass model mode * Changelog pull number * Update src/lightning/pytorch/cli.py Co-authored-by: Nicki Skafte Detlefsen <[email protected]> --------- Co-authored-by: Nicki Skafte Detlefsen <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>
0 parents  commit c8799c1

File tree

710 files changed

+138528
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

710 files changed

+138528
-0
lines changed

.actions/assistant.py

Lines changed: 485 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#!/bin/bash
2+
3+
# Run this script from the project root.
4+
URL="https://pl-public-data.s3.amazonaws.com/legacy/checkpoints.zip"
5+
mkdir -p tests/legacy
6+
# wget is simpler but does not work on Windows
7+
python -c "from urllib.request import urlretrieve; urlretrieve('$URL', 'tests/legacy/checkpoints.zip')"
8+
ls -l tests/legacy/
9+
10+
unzip -o tests/legacy/checkpoints.zip -d tests/legacy/
11+
ls -l tests/legacy/checkpoints/

.actions/requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
jsonargparse
2+
requests
3+
packaging

.azure/README.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Creation GPU self-hosted agent pool
2+
3+
## Prepare the machine
4+
5+
This is a slightly modified version of the script from
6+
https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/docker
7+
8+
```bash
9+
apt-get update
10+
apt-get install -y --no-install-recommends \
11+
ca-certificates \
12+
curl \
13+
jq \
14+
git \
15+
iputils-ping \
16+
libcurl4 \
17+
libunwind8 \
18+
netcat \
19+
libssl1.0
20+
21+
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
22+
mkdir /azp
23+
```
24+
25+
## Stating the agents
26+
27+
```bash
28+
export TARGETARCH=linux-x64
29+
export AZP_URL="https://dev.azure.com/Lightning-AI"
30+
export AZP_TOKEN="xxxxxxxxxxxxxxxxxxxxxxxxxx"
31+
export AZP_POOL="lit-rtx-3090"
32+
33+
for i in {0..7..2}
34+
do
35+
nohup bash .azure/start.sh \
36+
"AZP_AGENT_NAME=litGPU-YX_$i,$((i+1))" \
37+
"CUDA_VISIBLE_DEVICES=$i,$((i+1))" \
38+
> "agent-$i.log" &
39+
done
40+
```
41+
42+
## Check running agents
43+
44+
```bash
45+
ps aux | grep start.sh
46+
```
47+
48+
# Machine maintenance
49+
50+
Since most of our jobs/checks are running in a Docker container, the OS/machine can become polluted and fail to run with errors such as:
51+
52+
```
53+
No space left on device : '/azp/agent-litGPU-21_0,1/_diag/pages/8bb191f4-a8c2-419a-8788-66e3f0522bea_1.log'
54+
```
55+
56+
In such cases, you need to log in to the machine and run `docker system prune`.
57+
58+
## Automated ways
59+
60+
Let's explore adding a cron job for periodically removing all Docker caches:
61+
62+
1. Open your user's cron tab for editing: `crontab -e`
63+
1. Schedule/add the command with the `--force` flag to force pruning without interactive confirmation:
64+
```bash
65+
# every day at 2:00 AM clean docker caches
66+
0 2 * * * docker system prune --force
67+
```
68+
1. Verify the entry: `crontab -l`
69+
70+
Note: You may need to add yourself to the Docker group by running `sudo usermod -aG docker <your_username>` to have permission to execute this command without needing `sudo` and entering the password.

.azure/gpu-benchmarks.yml

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Python package
2+
# Create and test a Python package on multiple Python versions.
3+
# Add steps that analyze code, save the dist with the build record, publish to a PyPI-compatible index, and more:
4+
# https://docs.microsoft.com/azure/devops/pipelines/languages/python
5+
6+
trigger:
7+
tags:
8+
include: ["*"]
9+
branches:
10+
include:
11+
- "master"
12+
- "release/*"
13+
- "refs/tags/*"
14+
15+
pr:
16+
branches:
17+
include:
18+
- "master"
19+
- "release/*"
20+
paths:
21+
include:
22+
- ".azure/gpu-benchmarks.yml"
23+
- "requirements/fabric/**"
24+
- "requirements/pytorch/**"
25+
- "src/lightning/fabric/**"
26+
- "src/lightning/pytorch/**"
27+
- "tests/parity_fabric/**"
28+
- "tests/parity_pytorch/**"
29+
exclude:
30+
- "requirements/*/docs.txt"
31+
- "*.md"
32+
- "**/*.md"
33+
34+
schedules:
35+
- cron: "0 0 * * *" # At the end of every day
36+
displayName: Daily midnight benchmark
37+
branches:
38+
include:
39+
- "master"
40+
41+
jobs:
42+
- job: benchmarks
43+
timeoutInMinutes: "90"
44+
cancelTimeoutInMinutes: "2"
45+
pool: lit-rtx-3090
46+
variables:
47+
DEVICES: $( python -c 'print("$(Agent.Name)".split("_")[-1])' )
48+
container:
49+
image: "pytorchlightning/pytorch_lightning:base-cuda12.6.3-py3.12-torch2.8"
50+
options: "--gpus=all --shm-size=32g"
51+
strategy:
52+
matrix:
53+
"pkg: Fabric":
54+
PACKAGE_NAME: "fabric"
55+
"pkg: Pytorch":
56+
PACKAGE_NAME: "pytorch"
57+
workspace:
58+
clean: all
59+
60+
steps:
61+
- bash: |
62+
echo "##vso[task.setvariable variable=CUDA_VISIBLE_DEVICES]$(DEVICES)"
63+
cuda_ver=$(python -c "import torch ; print(''.join(map(str, torch.version.cuda.split('.')[:2])))")
64+
echo "##vso[task.setvariable variable=TORCH_URL]https://download.pytorch.org/whl/cu${cuda_ver}/torch_stable.html"
65+
displayName: "set env. vars"
66+
67+
- bash: |
68+
echo $CUDA_VISIBLE_DEVICES
69+
echo $TORCH_URL
70+
whereis nvidia
71+
nvidia-smi
72+
which python && which pip
73+
python --version
74+
pip --version
75+
pip list
76+
displayName: "Image info & NVIDIA"
77+
78+
- bash: |
79+
pip install -U -q -r .actions/requirements.txt
80+
python .actions/assistant.py copy_replace_imports --source_dir="./tests" \
81+
--source_import="lightning.fabric,lightning.pytorch" \
82+
--target_import="lightning_fabric,pytorch_lightning"
83+
displayName: "Adjust tests"
84+
85+
- bash: pip install -e .[dev] --find-links ${TORCH_URL}
86+
env:
87+
FREEZE_REQUIREMENTS: "1"
88+
displayName: "Install package"
89+
90+
- bash: |
91+
set -e
92+
python requirements/collect_env_details.py
93+
python -c "import torch ; mgpu = torch.cuda.device_count() ; assert mgpu == 2, f'GPU: {mgpu}'"
94+
displayName: "Env details"
95+
96+
- bash: python -m pytest parity_$(PACKAGE_NAME) -v --durations=0
97+
env:
98+
PL_RUNNING_BENCHMARKS: "1"
99+
RUN_ONLY_CUDA_TESTS: "1"
100+
workingDirectory: tests/
101+
displayName: "Testing: benchmarks"
102+
103+
- bash: |
104+
bash run_standalone_tasks.sh cpu
105+
bash run_standalone_tasks.sh cuda
106+
workingDirectory: tests/parity_fabric
107+
# without succeeded this could run even if the job has already failed
108+
condition: and(succeeded(), eq(variables['PACKAGE_NAME'], 'fabric'))
109+
env:
110+
RUN_ONLY_CUDA_TESTS: "1"
111+
PL_RUN_STANDALONE_TESTS: "1"
112+
displayName: "Testing: fabric standalone tasks"
113+
timeoutInMinutes: "10"

.azure/gpu-tests-fabric.yml

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# Python package
2+
# Create and test a Python package on multiple Python versions.
3+
# Add steps that analyze code, save the dist with the build record, publish to a PyPI-compatible index, and more:
4+
# https://docs.microsoft.com/azure/devops/pipelines/languages/python
5+
6+
trigger:
7+
tags:
8+
include: ["*"]
9+
branches:
10+
include:
11+
- "master"
12+
- "release/*"
13+
- "refs/tags/*"
14+
15+
pr:
16+
branches:
17+
include:
18+
- "master"
19+
- "release/*"
20+
paths:
21+
include:
22+
- ".actions/*"
23+
- ".azure/gpu-tests-fabric.yml"
24+
- "examples/fabric/**"
25+
- "examples/run_fabric_examples.sh"
26+
- "tests/run_standalone_*.sh"
27+
- "requirements/fabric/**"
28+
- "src/lightning/__init__.py"
29+
- "src/lightning/__setup__.py"
30+
- "src/lightning/__version__.py"
31+
- "src/lightning/fabric/**"
32+
- "src/lightning_fabric/*"
33+
- "tests/tests_fabric/**"
34+
- "pyproject.toml" # includes pytest config
35+
exclude:
36+
- "requirements/*/docs.txt"
37+
- "*.md"
38+
- "**/*.md"
39+
40+
jobs:
41+
- job: testing
42+
# how long to run the job before automatically cancelling
43+
timeoutInMinutes: "20"
44+
# how much time to give 'run always even if cancelled tasks' before stopping them
45+
cancelTimeoutInMinutes: "2"
46+
pool: lit-rtx-3090
47+
variables:
48+
DEVICES: $( python -c 'print("$(Agent.Name)".split("_")[-1])' )
49+
FREEZE_REQUIREMENTS: "1"
50+
PIP_CACHE_DIR: "/var/tmp/pip"
51+
RUN_ONLY_CUDA_TESTS: "1"
52+
container:
53+
image: $(image)
54+
# default shm size is 64m. Increase it to avoid:
55+
# 'Error while creating shared memory: unhandled system error, NCCL version 2.7.8'
56+
options: "--gpus=all --shm-size=2gb -v /var/tmp:/var/tmp"
57+
strategy:
58+
matrix:
59+
"Fabric | oldest":
60+
image: "pytorchlightning/pytorch_lightning:base-cuda12.1.1-py3.10-torch2.1"
61+
PACKAGE_NAME: "fabric"
62+
"Fabric | latest":
63+
image: "pytorchlightning/pytorch_lightning:base-cuda12.6.3-py3.12-torch2.8"
64+
PACKAGE_NAME: "fabric"
65+
#"Fabric | future":
66+
# image: "pytorchlightning/pytorch_lightning:base-cuda12.6.3-py3.12-torch2.7"
67+
# PACKAGE_NAME: "fabric"
68+
"Lightning | latest":
69+
image: "pytorchlightning/pytorch_lightning:base-cuda12.6.3-py3.12-torch2.8"
70+
PACKAGE_NAME: "lightning"
71+
workspace:
72+
clean: all
73+
steps:
74+
- bash: |
75+
echo "##vso[task.setvariable variable=CUDA_VISIBLE_DEVICES]$(DEVICES)"
76+
cuda_ver=$(python -c "import torch ; print(''.join(map(str, torch.version.cuda.split('.')[:2])))")
77+
echo "##vso[task.setvariable variable=CUDA_VERSION_MM]$cuda_ver"
78+
echo "##vso[task.setvariable variable=TORCH_URL]https://download.pytorch.org/whl/cu${cuda_ver}/torch_stable.html"
79+
scope=$(python -c 'n = "$(PACKAGE_NAME)" ; print(dict(fabric="lightning_fabric").get(n, n))')
80+
echo "##vso[task.setvariable variable=COVERAGE_SOURCE]$scope"
81+
displayName: "set env. vars"
82+
- bash: |
83+
echo "##vso[task.setvariable variable=TORCH_URL]https://download.pytorch.org/whl/test/cu${CUDA_VERSION_MM}"
84+
condition: endsWith(variables['Agent.JobName'], 'future')
85+
displayName: "extend env. vars 4 future"
86+
87+
- bash: |
88+
echo $(DEVICES)
89+
echo $CUDA_VISIBLE_DEVICES
90+
echo $CUDA_VERSION_MM
91+
echo $TORCH_URL
92+
echo $COVERAGE_SOURCE
93+
whereis nvidia
94+
nvidia-smi
95+
which python && which pip
96+
python --version
97+
pip --version
98+
pip list
99+
displayName: "Image info & NVIDIA"
100+
101+
- bash: |
102+
set -ex
103+
pip install "cython<3.0" wheel # for compatibility
104+
pip install -U "lightning-utilities[cli]"
105+
cd requirements/fabric
106+
# replace range by pin minimal requirements
107+
python -m lightning_utilities.cli requirements set-oldest --req_files "['base.txt', 'strategies.txt']"
108+
# drop deepspeed since it is not supported by our minimal Torch requirements
109+
python -m lightning_utilities.cli requirements prune-pkgs --packages deepspeed --req_files strategies.txt
110+
# uninstall deepspeed since some older docker images have it pre-installed
111+
pip uninstall -y deepspeed
112+
condition: contains(variables['Agent.JobName'], 'oldest')
113+
displayName: "setting oldest dependencies"
114+
115+
- bash: |
116+
PYTORCH_VERSION=$(python -c "import torch; print(torch.__version__.split('+')[0])")
117+
pip install -q wget packaging
118+
python -m wget https://raw.githubusercontent.com/Lightning-AI/utilities/main/scripts/adjust-torch-versions.py
119+
for fpath in `ls requirements/**/*.txt`; do \
120+
python ./adjust-torch-versions.py $fpath ${PYTORCH_VERSION}; \
121+
done
122+
displayName: "Adjust dependencies"
123+
124+
- bash: |
125+
pip install -U -q -r .actions/requirements.txt
126+
python .actions/assistant.py copy_replace_imports --source_dir="./tests/tests_fabric" \
127+
--source_import="lightning.fabric" \
128+
--target_import="lightning_fabric"
129+
python .actions/assistant.py copy_replace_imports --source_dir="./examples/fabric" \
130+
--source_import="lightning.fabric" \
131+
--target_import="lightning_fabric"
132+
# without succeeded this could run even if the job has already failed
133+
condition: and(succeeded(), eq(variables['PACKAGE_NAME'], 'fabric'))
134+
displayName: "Adjust tests & examples"
135+
136+
- bash: |
137+
set -e
138+
extra=$(python -c "print({'lightning': 'fabric-'}.get('$(PACKAGE_NAME)', ''))")
139+
pip install -e ".[${extra}dev]" -U --upgrade-strategy=eager --extra-index-url="${TORCH_URL}"
140+
displayName: "Install package & dependencies"
141+
142+
- bash: |
143+
set -e
144+
python requirements/collect_env_details.py
145+
python -c "import torch ; mgpu = torch.cuda.device_count() ; assert mgpu == 2, f'GPU: {mgpu}'"
146+
python requirements/pytorch/check-avail-extras.py
147+
python -c "import bitsandbytes"
148+
displayName: "Env details"
149+
150+
- bash: python -m pytest lightning_fabric
151+
workingDirectory: src
152+
# without succeeded this could run even if the job has already failed
153+
condition: and(succeeded(), eq(variables['PACKAGE_NAME'], 'fabric'))
154+
displayName: "Testing: Fabric doctests"
155+
156+
- bash: python -m coverage run --source ${COVERAGE_SOURCE} -m pytest tests_fabric/ -v --durations=50
157+
workingDirectory: tests/
158+
displayName: "Testing: fabric standard"
159+
timeoutInMinutes: "10"
160+
161+
- bash: |
162+
wget https://raw.githubusercontent.com/Lightning-AI/utilities/main/scripts/run_standalone_tests.sh
163+
bash ./run_standalone_tests.sh "tests_fabric"
164+
workingDirectory: tests/
165+
env:
166+
PL_RUN_STANDALONE_TESTS: "1"
167+
displayName: "Testing: fabric standalone"
168+
timeoutInMinutes: "10"
169+
170+
- bash: |
171+
python -m coverage report
172+
python -m coverage xml
173+
python -m coverage html
174+
175+
# https://docs.codecov.com/docs/codecov-uploader
176+
curl -Os https://uploader.codecov.io/latest/linux/codecov
177+
chmod +x codecov
178+
./codecov --token=$(CODECOV_TOKEN) --commit=$(Build.SourceVersion) \
179+
--flags=gpu,pytest,${COVERAGE_SOURCE} --name="GPU-coverage" --env=linux,azure
180+
ls -l
181+
workingDirectory: tests/
182+
displayName: "Statistics"
183+
184+
- script: |
185+
set -e
186+
bash run_fabric_examples.sh --accelerator=cuda --devices=1
187+
bash run_fabric_examples.sh --accelerator=cuda --devices=2 --strategy ddp
188+
workingDirectory: examples/
189+
displayName: "Testing: fabric examples"

0 commit comments

Comments
 (0)