Skip to content

Commit 8705265

Browse files
amoreheadpre-commit-ci[bot]Bordalantiga
authored andcommitted
Seed NumPy using np.random.SeedSequence() in pl_worker_init_function() to robustly seed NumPy-dependent dataloader workers (#20369)
* Update seed.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update seed.py * Update seed.py * Update seed.py --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Luca Antiga <[email protected]>
0 parents  commit 8705265

File tree

704 files changed

+132496
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

704 files changed

+132496
-0
lines changed

.actions/assistant.py

Lines changed: 488 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#!/bin/bash
2+
3+
# Run this script from the project root.
4+
URL="https://pl-public-data.s3.amazonaws.com/legacy/checkpoints.zip"
5+
mkdir -p tests/legacy
6+
# wget is simpler but does not work on Windows
7+
python -c "from urllib.request import urlretrieve; urlretrieve('$URL', 'tests/legacy/checkpoints.zip')"
8+
ls -l tests/legacy/
9+
10+
unzip -o tests/legacy/checkpoints.zip -d tests/legacy/
11+
ls -l tests/legacy/checkpoints/

.actions/requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
jsonargparse >=4.16.0, <4.28.0
2+
requests
3+
packaging

.azure/README.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Creation GPU self-hosted agent pool
2+
3+
## Prepare the machine
4+
5+
This is a slightly modified version of the script from
6+
https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/docker
7+
8+
```bash
9+
apt-get update
10+
apt-get install -y --no-install-recommends \
11+
ca-certificates \
12+
curl \
13+
jq \
14+
git \
15+
iputils-ping \
16+
libcurl4 \
17+
libunwind8 \
18+
netcat \
19+
libssl1.0
20+
21+
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
22+
mkdir /azp
23+
```
24+
25+
## Stating the agents
26+
27+
```bash
28+
export TARGETARCH=linux-x64
29+
export AZP_URL="https://dev.azure.com/Lightning-AI"
30+
export AZP_TOKEN="xxxxxxxxxxxxxxxxxxxxxxxxxx"
31+
export AZP_POOL="lit-rtx-3090"
32+
33+
for i in {0..7..2}
34+
do
35+
nohup bash .azure/start.sh \
36+
"AZP_AGENT_NAME=litGPU-YX_$i,$((i+1))" \
37+
"CUDA_VISIBLE_DEVICES=$i,$((i+1))" \
38+
> "agent-$i.log" &
39+
done
40+
```
41+
42+
## Check running agents
43+
44+
```bash
45+
ps aux | grep start.sh
46+
```
47+
48+
# Machine maintenance
49+
50+
Since most of our jobs/checks are running in a Docker container, the OS/machine can become polluted and fail to run with errors such as:
51+
52+
```
53+
No space left on device : '/azp/agent-litGPU-21_0,1/_diag/pages/8bb191f4-a8c2-419a-8788-66e3f0522bea_1.log'
54+
```
55+
56+
In such cases, you need to log in to the machine and run `docker system prune`.
57+
58+
## Automated ways
59+
60+
Let's explore adding a cron job for periodically removing all Docker caches:
61+
62+
1. Open your user's cron tab for editing: `crontab -e`
63+
1. Schedule/add the command with the `--force` flag to force pruning without interactive confirmation:
64+
```bash
65+
# every day at 2:00 AM clean docker caches
66+
0 2 * * * docker system prune --force
67+
```
68+
1. Verify the entry: `crontab -l`
69+
70+
Note: You may need to add yourself to the Docker group by running `sudo usermod -aG docker <your_username>` to have permission to execute this command without needing `sudo` and entering the password.

.azure/gpu-benchmarks.yml

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Python package
2+
# Create and test a Python package on multiple Python versions.
3+
# Add steps that analyze code, save the dist with the build record, publish to a PyPI-compatible index, and more:
4+
# https://docs.microsoft.com/azure/devops/pipelines/languages/python
5+
6+
trigger:
7+
tags:
8+
include: ["*"]
9+
branches:
10+
include:
11+
- "master"
12+
- "release/*"
13+
- "refs/tags/*"
14+
15+
pr:
16+
branches:
17+
include:
18+
- "master"
19+
- "release/*"
20+
paths:
21+
include:
22+
- ".azure/gpu-benchmarks.yml"
23+
- "requirements/fabric/**"
24+
- "requirements/pytorch/**"
25+
- "src/lightning/fabric/**"
26+
- "src/lightning/pytorch/**"
27+
- "tests/parity_fabric/**"
28+
- "tests/parity_pytorch/**"
29+
exclude:
30+
- "requirements/*/docs.txt"
31+
- "*.md"
32+
- "**/*.md"
33+
34+
schedules:
35+
- cron: "0 0 * * *" # At the end of every day
36+
displayName: Daily midnight benchmark
37+
branches:
38+
include:
39+
- "master"
40+
41+
jobs:
42+
- job: benchmarks
43+
timeoutInMinutes: "90"
44+
cancelTimeoutInMinutes: "2"
45+
pool: lit-rtx-3090
46+
variables:
47+
DEVICES: $( python -c 'print("$(Agent.Name)".split("_")[-1])' )
48+
container:
49+
image: "pytorchlightning/pytorch_lightning:base-cuda-py3.12-torch2.5-cuda12.1.0"
50+
options: "--gpus=all --shm-size=32g"
51+
strategy:
52+
matrix:
53+
"pkg: Fabric":
54+
PACKAGE_NAME: "fabric"
55+
"pkg: Pytorch":
56+
PACKAGE_NAME: "pytorch"
57+
workspace:
58+
clean: all
59+
60+
steps:
61+
- bash: |
62+
echo "##vso[task.setvariable variable=CUDA_VISIBLE_DEVICES]$(DEVICES)"
63+
cuda_ver=$(python -c "import torch ; print(''.join(map(str, torch.version.cuda.split('.')[:2])))")
64+
echo "##vso[task.setvariable variable=TORCH_URL]https://download.pytorch.org/whl/cu${cuda_ver}/torch_stable.html"
65+
displayName: "set env. vars"
66+
67+
- bash: |
68+
echo $CUDA_VISIBLE_DEVICES
69+
echo $TORCH_URL
70+
whereis nvidia
71+
nvidia-smi
72+
which python && which pip
73+
python --version
74+
pip --version
75+
pip list
76+
displayName: "Image info & NVIDIA"
77+
78+
- bash: pip install -e .[dev] --find-links ${TORCH_URL}
79+
env:
80+
FREEZE_REQUIREMENTS: "1"
81+
displayName: "Install package"
82+
83+
- bash: |
84+
set -e
85+
python requirements/collect_env_details.py
86+
python -c "import torch ; mgpu = torch.cuda.device_count() ; assert mgpu == 2, f'GPU: {mgpu}'"
87+
displayName: "Env details"
88+
89+
- bash: |
90+
pip install -q -r .actions/requirements.txt
91+
python .actions/assistant.py copy_replace_imports --source_dir="./tests" \
92+
--source_import="lightning.fabric,lightning.pytorch" \
93+
--target_import="lightning_fabric,pytorch_lightning"
94+
displayName: "Adjust tests"
95+
96+
- bash: python -m pytest parity_$(PACKAGE_NAME) -v --durations=0
97+
env:
98+
PL_RUNNING_BENCHMARKS: "1"
99+
PL_RUN_CUDA_TESTS: "1"
100+
workingDirectory: tests/
101+
displayName: "Testing: benchmarks"
102+
103+
- bash: bash run_standalone_tasks.sh
104+
workingDirectory: tests/parity_fabric
105+
# without succeeded this could run even if the job has already failed
106+
condition: and(succeeded(), eq(variables['PACKAGE_NAME'], 'fabric'))
107+
env:
108+
PL_RUN_CUDA_TESTS: "1"
109+
displayName: "Testing: fabric standalone tasks"
110+
timeoutInMinutes: "10"

.azure/gpu-tests-fabric.yml

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# Python package
2+
# Create and test a Python package on multiple Python versions.
3+
# Add steps that analyze code, save the dist with the build record, publish to a PyPI-compatible index, and more:
4+
# https://docs.microsoft.com/azure/devops/pipelines/languages/python
5+
6+
trigger:
7+
tags:
8+
include: ["*"]
9+
branches:
10+
include:
11+
- "master"
12+
- "release/*"
13+
- "refs/tags/*"
14+
15+
pr:
16+
branches:
17+
include:
18+
- "master"
19+
- "release/*"
20+
paths:
21+
include:
22+
- ".actions/*"
23+
- ".azure/gpu-tests-fabric.yml"
24+
- "examples/fabric/**"
25+
- "examples/run_fabric_examples.sh"
26+
- "tests/run_standalone_*.sh"
27+
- "requirements/fabric/**"
28+
- "src/lightning/__init__.py"
29+
- "src/lightning/__setup__.py"
30+
- "src/lightning/__version__.py"
31+
- "src/lightning/fabric/**"
32+
- "src/lightning_fabric/*"
33+
- "tests/tests_fabric/**"
34+
- "pyproject.toml" # includes pytest config
35+
exclude:
36+
- "requirements/*/docs.txt"
37+
- "*.md"
38+
- "**/*.md"
39+
40+
jobs:
41+
- job: testing
42+
# how long to run the job before automatically cancelling
43+
timeoutInMinutes: "20"
44+
# how much time to give 'run always even if cancelled tasks' before stopping them
45+
cancelTimeoutInMinutes: "2"
46+
pool: lit-rtx-3090
47+
variables:
48+
DEVICES: $( python -c 'print("$(Agent.Name)".split("_")[-1])' )
49+
FREEZE_REQUIREMENTS: "1"
50+
PIP_CACHE_DIR: "/var/tmp/pip"
51+
PL_RUN_CUDA_TESTS: "1"
52+
container:
53+
image: $(image)
54+
# default shm size is 64m. Increase it to avoid:
55+
# 'Error while creating shared memory: unhandled system error, NCCL version 2.7.8'
56+
options: "--gpus=all --shm-size=2gb -v /var/tmp:/var/tmp"
57+
strategy:
58+
matrix:
59+
"Fabric | latest":
60+
image: "pytorchlightning/pytorch_lightning:base-cuda-py3.11-torch2.3-cuda12.1.0"
61+
PACKAGE_NAME: "fabric"
62+
"Lightning | latest":
63+
image: "pytorchlightning/pytorch_lightning:base-cuda-py3.12-torch2.5-cuda12.1.0"
64+
PACKAGE_NAME: "lightning"
65+
workspace:
66+
clean: all
67+
steps:
68+
- bash: |
69+
echo "##vso[task.setvariable variable=CUDA_VISIBLE_DEVICES]$(DEVICES)"
70+
cuda_ver=$(python -c "import torch ; print(''.join(map(str, torch.version.cuda.split('.')[:2])))")
71+
echo "##vso[task.setvariable variable=CUDA_VERSION_MM]$cuda_ver"
72+
echo "##vso[task.setvariable variable=TORCH_URL]https://download.pytorch.org/whl/cu${cuda_ver}/torch_stable.html"
73+
scope=$(python -c 'n = "$(PACKAGE_NAME)" ; print(dict(fabric="lightning_fabric").get(n, n))')
74+
echo "##vso[task.setvariable variable=COVERAGE_SOURCE]$scope"
75+
python_ver=$(python -c "import sys; print(f'{sys.version_info.major}{sys.version_info.minor}')")
76+
echo "##vso[task.setvariable variable=PYTHON_VERSION_MM]$python_ver"
77+
displayName: "set env. vars"
78+
- bash: |
79+
echo "##vso[task.setvariable variable=TORCH_URL]https://download.pytorch.org/whl/test/cu${CUDA_VERSION_MM}"
80+
echo "##vso[task.setvariable variable=TORCHVISION_URL]https://download.pytorch.org/whl/test/cu124/torchvision-0.19.0%2Bcu124-cp${PYTHON_VERSION_MM}-cp${PYTHON_VERSION_MM}-linux_x86_64.whl"
81+
condition: endsWith(variables['Agent.JobName'], 'future')
82+
displayName: "set env. vars 4 future"
83+
84+
- bash: |
85+
echo $(DEVICES)
86+
echo $CUDA_VISIBLE_DEVICES
87+
echo $CUDA_VERSION_MM
88+
echo $TORCH_URL
89+
echo $COVERAGE_SOURCE
90+
whereis nvidia
91+
nvidia-smi
92+
which python && which pip
93+
python --version
94+
pip --version
95+
pip list
96+
displayName: "Image info & NVIDIA"
97+
98+
- bash: |
99+
PYTORCH_VERSION=$(python -c "import torch; print(torch.__version__.split('+')[0])")
100+
pip install -q wget packaging
101+
python -m wget https://raw.githubusercontent.com/Lightning-AI/utilities/main/scripts/adjust-torch-versions.py
102+
for fpath in `ls requirements/**/*.txt`; do \
103+
python ./adjust-torch-versions.py $fpath ${PYTORCH_VERSION}; \
104+
done
105+
displayName: "Adjust dependencies"
106+
107+
- bash: |
108+
extra=$(python -c "print({'lightning': 'fabric-'}.get('$(PACKAGE_NAME)', ''))")
109+
pip install -e ".[${extra}dev]" pytest-timeout -U --find-links="${TORCH_URL}" --find-links="${TORCHVISION_URL}"
110+
displayName: "Install package & dependencies"
111+
112+
- bash: |
113+
set -e
114+
python requirements/collect_env_details.py
115+
python -c "import torch ; mgpu = torch.cuda.device_count() ; assert mgpu == 2, f'GPU: {mgpu}'"
116+
python -c "import bitsandbytes"
117+
displayName: "Env details"
118+
119+
- bash: python -m pytest lightning_fabric
120+
workingDirectory: src
121+
# without succeeded this could run even if the job has already failed
122+
condition: and(succeeded(), eq(variables['PACKAGE_NAME'], 'fabric'))
123+
displayName: "Testing: Fabric doctests"
124+
125+
- bash: |
126+
pip install -q -r .actions/requirements.txt
127+
python .actions/assistant.py copy_replace_imports --source_dir="./tests/tests_fabric" \
128+
--source_import="lightning.fabric" \
129+
--target_import="lightning_fabric"
130+
python .actions/assistant.py copy_replace_imports --source_dir="./examples/fabric" \
131+
--source_import="lightning.fabric" \
132+
--target_import="lightning_fabric"
133+
# without succeeded this could run even if the job has already failed
134+
condition: and(succeeded(), eq(variables['PACKAGE_NAME'], 'fabric'))
135+
displayName: "Adjust tests & examples"
136+
137+
- bash: python -m coverage run --source ${COVERAGE_SOURCE} -m pytest tests_fabric/ -v --durations=50
138+
workingDirectory: tests/
139+
displayName: "Testing: fabric standard"
140+
timeoutInMinutes: "10"
141+
142+
- bash: bash ./run_standalone_tests.sh "tests_fabric"
143+
workingDirectory: tests/
144+
env:
145+
PL_STANDALONE_TESTS_SOURCE: $(COVERAGE_SOURCE)
146+
displayName: "Testing: fabric standalone"
147+
timeoutInMinutes: "10"
148+
149+
- bash: |
150+
python -m coverage report
151+
python -m coverage xml
152+
python -m coverage html
153+
154+
# https://docs.codecov.com/docs/codecov-uploader
155+
curl -Os https://uploader.codecov.io/latest/linux/codecov
156+
chmod +x codecov
157+
./codecov --token=$(CODECOV_TOKEN) --commit=$(Build.SourceVersion) \
158+
--flags=gpu,pytest,${COVERAGE_SOURCE} --name="GPU-coverage" --env=linux,azure
159+
ls -l
160+
workingDirectory: tests/
161+
displayName: "Statistics"
162+
163+
- script: |
164+
set -e
165+
bash run_fabric_examples.sh --accelerator=cuda --devices=1
166+
bash run_fabric_examples.sh --accelerator=cuda --devices=2 --strategy ddp
167+
workingDirectory: examples/
168+
displayName: "Testing: fabric examples"

0 commit comments

Comments
 (0)