`TrainingEpochLoop._should_check_val_fx` discrepancy between continued run  <> restore from ckpt

## 🐛 Bug

Found a discrepancy between a continued run after checkpointing, and restoring from checkpoint
### Observation:
**training_batch / val_loop ordering upon checkpoint restoration not the same as original run after checkpoint saving.**

There are still the same amount of train steps, but the validation loops are interleaved at a single step later, which can cause the restored run to end up with one less validation loop (see colab)

### Assumption / expectation:
Zero difference between a training run after a checkpoint and a run continued from said checkpoint

### Investigation so far:
Im new to some of this lightning code, but IIUC:

#### Key:
**`TrainingEpochLoop`'s `self.batch_progress.increment_completed()` is called _after_ `"on_train_batch_end"` hooks, the latter kicking off checkpoint saving.**
1. upon restoring, the `TrainingEpochLoop.batch_progress.current.reset_on_restart()` will reset the `ready` back to `completed`
2. yet the `global_step`, which refers to `TrainingEpochLoop.batch_loop.optimizer_loop.optim_progress.optimizer.step.total.completed`, has been `increment_completed()` (called within `TrainingEpochLoop.batch_loop.run`) and thus upon restoring, `..optimizer.step.total.ready` is set to an up to date `optimizer.step.total.completed`, out of sync with the above
3. [simplification] in "`val_check_interval` mode", validation is triggered when `TrainingEpochLoop.batch_progress.current.ready % val_check_interval == 0` (through `TrainingEpochLoop.on_advance_end` -> `TrainingEpochLoop._should_check_val_fx`
4. combining the above three, the same `batch_progress.current` `ready`/`completed` counter for the continued and restored runs, end up aligned with different `global_step`s, and hence validation triggers at different `global_step`s

#### Another observation:
The following if statement seems to allow for a zero-difference restart, except that just like 4. above, `_should_check_val_fx` wouldnt trigger where in the original run on the checkpointing step it did (although there called in `on_advance_end`). Not sure if the original intention of this snippet included the current scope
```
class TrainingEpochLoop(loops.Loop[_OUTPUTS_TYPE]):
    ...
    def advance(self, data_fetcher: AbstractDataFetcher) -> None:  # type: ignore[override]
        ...
        if self.restarting and self._should_check_val_fx(self.batch_idx, self.batch_progress.is_last_batch):
            # skip training and run validation in `on_advance_end`
            return
```
PR's relevant to this line:
- #9681
- #9563

### Potential impact:

Assuming not too worrisome for the more default Lightning use cases:
- With `val_check_interval` >> 3 (colab example = 3), or that turned off relying instead on `check_val_every_n_epoch`

However, in theory it can influence all of the following:
- no 1:1 deterministic reproducibility
- affect the latest/best validation loss
  - affects any code flow / decision making based on that
- cause a "different usage order" of rngs (<- how I initially caught the issue, even with correctly restored rng states, if both validation and training steps use one theyll each end up with different random numbers as compared to the continued run)
- other

### To Reproduce



[customized google colab `bug_report_model.ipynb` with same observation on `BoringModel`](https://drive.google.com/file/d/18eolmKfmuQihs1ZYJJqAD-F54Xp8YtRs/view?usp=sharing)

### Expected behavior

Zero difference between a training run continued after a checkpoint and a run continued from said checkpoint

### Environment



Note:
- The below is from original investigation in our own code base, with pytorch lightning `v1.6.4`.
- The environment details from the BoringModel's reproduction are listed in the colab, with pytorch lighting `v1.7.4`
- I also browsed through the `master` branch last weeks and the relevant code seems unchanged

<details>
  <summary>Details</summary>
    <ul>
        <li>
            <details>
            <summary>CUDA:</summary>
                <ul>
                    <li>GPU:</li>
                    <ul>
                        <li>NVIDIA RTX A4000</li>
                        <li>NVIDIA RTX A4000</li>
                        <li>NVIDIA RTX A4000</li>
                        <li>NVIDIA RTX A4000</li>
                    </ul>
                    <li>available:         True</li>
                    <li>version:           11.0</li>
                </ul>
            </details>
        </li>
        <li>
            <details>
            <summary>Lightning:</summary>
                <ul>
                    <li>efficientnet-pytorch: 0.7.1</li>
                    <li>pytorch-lightning: 1.6.4</li>
                    <li>torch:             1.11.0.post1103</li>
                    <li>torchmetrics:      0.7.0</li>
                    <li>torchvision:       0.12.0a1110.post1103</li>
                </ul>
            </details>
        </li>
        <li>
            <details>
            <summary>Packages:</summary>
                <ul>
                    <li>absl-py:           0.15.0</li>
                    <li>adal:              1.2.7</li>
                    <li>adlfs:             2021.10.0</li>
                    <li>aiohttp:           3.7.4</li>
                    <li>applicationinsights: 0.11.10</li>
                    <li>argcomplete:       1.12.3</li>
                    <li>async-timeout:     3.0.1</li>
                    <li>attrdict:          2.0.0</li>
                    <li>attrs:             21.1.0</li>
                    <li>av:                8.0.3</li>
                    <li>azure-cli-core:    2.38.0</li>
                    <li>azure-cli-telemetry: 1.0.6</li>
                    <li>azure-common:      1.1.27</li>
                    <li>azure-core:        1.20.0</li>
                    <li>azure-datalake-store: 0.0.52</li>
                    <li>azure-identity:    1.10.0</li>
                    <li>azure-keyvault-secrets: 4.2.0</li>
                    <li>azure-mgmt-core:   1.2.2</li>
                    <li>azure-storage-blob: 12.11.0</li>
                    <li>backcall:          0.2.0</li>
                    <li>backoff:           1.10.0</li>
                    <li>bcrypt:            3.2.0</li>
                    <li>cachetools:        4.2.2</li>
                    <li>certifi:           2020.12.5</li>
                    <li>cffi:              1.14.5</li>
                    <li>chardet:           3.0.4</li>
                    <li>charset-normalizer: 2.0.12</li>
                    <li>click:             7.1.2</li>
                    <li>confluent-kafka:   1.7.0</li>
                    <li>cryptography:      3.4.8</li>
                    <li>cycler:            0.10.0</li>
                    <li>datadog:           0.44.0</li>
                    <li>decorator:         5.0.7</li>
                    <li>deepdiff:          5.5.0</li>
                    <li>deltalake:         0.5.8</li>
                    <li>docker-pycreds:    0.4.0</li>
                    <li>efficientnet-pytorch: 0.7.1</li>
                    <li>einops:            0.4.1</li>
                    <li>filelock:          3.7.1</li>
                    <li>fonttools:         4.37.1</li>
                    <li>frozendict:        2.3.2</li>
                    <li>fsspec:            2022.1.0</li>
                    <li>gitdb:             4.0.7</li>
                    <li>gitpython:         3.1.14</li>
                    <li>google-auth:       1.30.0</li>
                    <li>google-auth-oauthlib: 0.4.4</li>
                    <li>grpcio:            1.37.1</li>
                    <li>htmlmin:           0.1.12</li>
                    <li>humanfriendly:     10.0</li>
                    <li>idna:              2.10</li>
                    <li>imagehash:         4.2.1</li>
                    <li>inplace-abn:       1.1.0a1110.post1103</li>
                    <li>ipdb:              0.13.9</li>
                    <li>ipython:           7.23.1</li>
                    <li>isodate:           0.6.0</li>
                    <li>jedi:              0.18.0</li>
                    <li>jinja2:            3.1.2</li>
                    <li>jmespath:          0.10.0</li>
                    <li>joblib:            1.0.1</li>
                    <li>kafka-python:      2.0.2</li>
                    <li>kiwisolver:        1.3.1</li>
                    <li>knack:             0.9.0</li>
                    <li>markdown:          3.3.4</li>
                    <li>markupsafe:        2.0.1</li>
                    <li>matplotlib:        3.5.3</li>
                    <li>matplotlib-inline: 0.1.2</li>
                    <li>methodtools:       0.1.2</li>
                    <li>missingno:         0.5.0</li>
                    <li>msal:              1.16.0</li>
                    <li>msal-extensions:   0.3.0</li>
                    <li>msrest:            0.6.21</li>
                    <li>msrestazure:       0.6.4</li>
                    <li>multidict:         5.1.0</li>
                    <li>multimethod:       1.6</li>
                    <li>networkx:          2.5.1</li>
                    <li>numpy:             1.22.4</li>
                    <li>oauthlib:          3.1.0</li>
                    <li>opencv-python:     4.4.0.44</li>
                    <li>ordered-set:       4.0.2</li>
                    <li>packaging:         21.3</li>
                    <li>pandas:            1.4.3</li>
                    <li>pandas-profiling:  3.1.0</li>
                    <li>paramiko:          2.7.2</li>
                    <li>parso:             0.8.2</li>
                    <li>pathtools:         0.1.2</li>
                    <li>pexpect:           4.8.0</li>
                    <li>phik:              0.12.0</li>
                    <li>pickleshare:       0.7.5</li>
                    <li>pillow:            9.2.0</li>
                    <li>pip:               22.0.3</li>
                    <li>pkginfo:           1.7.0</li>
                    <li>polyline:          1.4.0</li>
                    <li>portalocker:       1.7.1</li>
                    <li>prometheus-client: 0.8.0</li>
                    <li>promise:           2.3</li>
                    <li>prompt-toolkit:    2.0.10</li>
                    <li>protobuf:          3.15.8</li>
                    <li>psutil:            5.9.1</li>
                    <li>psycopg2:          2.8.3</li>
                    <li>ptyprocess:        0.7.0</li>
                    <li>py:                1.10.0</li>
                    <li>py3nvml:           0.2.7</li>
                    <li>pyarrow:           9.0.0</li>
                    <li>pyasn1:            0.4.8</li>
                    <li>pyasn1-modules:    0.2.8</li>
                    <li>pycparser:         2.20</li>
                    <li>pydantic:          1.8.2</li>
                    <li>pydeprecate:       0.3.1</li>
                    <li>pygame:            2.1.2</li>
                    <li>pygments:          2.9.0</li>
                    <li>pyjwt:             1.7.1</li>
                    <li>pynacl:            1.4.0</li>
                    <li>pyntcloud:         0.1.6</li>
                    <li>pyopenssl:         20.0.1</li>
                    <li>pyparsing:         2.4.7</li>
                    <li>pyquaternion:      0.9.9</li>
                    <li>pysocks:           1.7.1</li>
                    <li>python-dateutil:   2.8.2</li>
                    <li>python-json-logger: 2.0.2</li>
                    <li>pytorch-lightning: 1.6.4</li>
                    <li>pytz:              2022.1</li>
                    <li>pywavelets:        1.1.1</li>
                    <li>pyyaml:            6.0</li>
                    <li>qrcode:            6.1</li>
                    <li>requests:          2.27.1</li>
                    <li>requests-oauthlib: 1.3.0</li>
                    <li>retry:             0.9.2</li>
                    <li>rsa:               4.7.2</li>
                    <li>runai:             0.3.0</li>
                    <li>scipy:             1.6.2</li>
                    <li>seaborn:           0.11.2</li>
                    <li>semver:            2.13.0</li>
                    <li>sentry-sdk:        1.9.4</li>
                    <li>setproctitle:      1.2.2</li>
                    <li>setuptools:        59.5.0</li>
                    <li>shapely:           1.8.0</li>
                    <li>shortuuid:         1.0.1</li>
                    <li>simplejpeg:        1.4.1</li>
                    <li>six:               1.16.0</li>
                    <li>slackclient:       2.9.4</li>
                    <li>smmap:             4.0.0</li>
                    <li>sqlalchemy:        1.3.24</li>
                    <li>tabulate:          0.8.9</li>
                    <li>tangled-up-in-unicode: 0.1.0</li>
                    <li>tensorboard:       2.6.0</li>
                    <li>tensorboard-data-server: 0.6.1</li>
                    <li>tensorboard-plugin-wit: 1.8.0</li>
                    <li>timm:              0.4.5</li>
                    <li>toml:              0.10.2</li>
                    <li>torch:             1.11.0.post1103</li>
                    <li>torchmetrics:      0.7.0</li>
                    <li>torchvision:       0.12.0a1110.post1103</li>
                    <li>tqdm:              4.60.0</li>
                    <li>traitlets:         5.3.0</li>
                    <li>transforms3d:      0.3.1</li>
                    <li>typing-extensions: 4.1.1</li>
                    <li>urllib3:           1.26.11</li>
                    <li>visions:           0.7.4</li>
                    <li>wandb:             0.12.14</li>
                    <li>wcwidth:           0.2.5</li>
                    <li>werkzeug:          1.0.1</li>
                    <li>wheel:             0.36.2</li>
                    <li>wirerope:          0.3.1</li>
                    <li>wrapt:             1.14.1</li>
                    <li>xmltodict:         0.12.0</li>
                    <li>xxhash:            1.4.1</li>
                    <li>yarl:              1.6.3</li>
                </ul>
            </details>
        </li>
        <li>
            <details>
            <summary>System:</summary>
                <ul>
                    <li>OS:                Linux</li>
                    <li>architecture:
                        <ul>
                        <li>64bit</li>
                        <li>ELF</li>
                        </ul>
                    </li>
                    <li>processor:         x86_64</li>
                    <li>python:            3.8.12</li>
                    <li>version:           /#138~18.04.1-Ubuntu SMP Fri Jun 24 14:14:03 UTC 2022</li>
                </ul>
            </details>
        </li>
    </ul>
</details>

### Additional context




cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj @carmocca @justusschock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`TrainingEpochLoop._should_check_val_fx` discrepancy between continued run <> restore from ckpt #14579

🐛 Bug

Observation:

Assumption / expectation:

Investigation so far:

Key:

Another observation:

Potential impact:

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TrainingEpochLoop._should_check_val_fx discrepancy between continued run <> restore from ckpt #14579

Description

🐛 Bug

Observation:

Assumption / expectation:

Investigation so far:

Key:

Another observation:

Potential impact:

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`TrainingEpochLoop._should_check_val_fx` discrepancy between continued run <> restore from ckpt #14579