Skip to content
This repository was archived by the owner on Oct 4, 2024. It is now read-only.

OSError: Request Entity Too Large #31

@awaelchli

Description

@awaelchli

🐛 Bug

When the train_work is about to stop after finishing training, we get a OSError: [Errno 5] An error occurred (413) when calling the PutObject operation: Request Entity Too Large error.

To Reproduce

Steps to reproduce the behavior:

lightning run app app.py --cloud  --name quick-start-3

Code sample

The app.py from this repo.

Error and logs

root.train_work] Epoch 9: 100% 12/12 [00:00<00:00, 16.72it/s, v_num=0]
[root.train_work] Traceback (most recent call last):
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/s3fs/core.py", line 112, in _error_wrapper
[root.train_work]     return await func(*args, **kwargs)
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/aiobotocore/client.py", line 358, in _make_api_call
[root.train_work]     raise error_class(parsed_response, operation_name)
[root.train_work] botocore.exceptions.ClientError: An error occurred (413) when calling the PutObject operation: Request Entity Too Large
[root.train_work] The above exception was the direct cause of the following exception:
[root.train_work] Traceback (most recent call last):
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/bin/lightning-cloud-launcher", line 8, in <module>
[root.train_work]     sys.exit(main())
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
[root.train_work]     return self.main(*args, **kwargs)
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/click/core.py", line 1055, in main
[root.train_work]     rv = self.invoke(ctx)
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
[root.train_work]     return _process_result(sub_ctx.command.invoke(sub_ctx))
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
[root.train_work]     return _process_result(sub_ctx.command.invoke(sub_ctx))
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
[root.train_work]     return ctx.invoke(self.callback, **ctx.params)
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/click/core.py", line 760, in invoke
[root.train_work]     return __callback(*args, **kwargs)
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/lightning_launcher/cli/__main__.py", line 87, in run_work
[root.train_work]     run_lightning_work(
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/lightning_launcher/utils.py", line 51, in wrapper
[root.train_work]     res = func(*args, **kwargs)
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/lightning_launcher/utils.py", line 77, in wrapper
[root.train_work]     res = func(*args, **kwargs)
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/lightning_launcher/launcher.py", line 181, in run_lightning_work
[root.train_work]     WorkRunner(
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/lightning/app/utilities/proxies.py", line 437, in __call__
[root.train_work]     raise e
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/lightning/app/utilities/proxies.py", line 418, in __call__
[root.train_work]     self.run_once()
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/lightning/app/utilities/proxies.py", line 582, in run_once
[root.train_work]     persist_artifacts(work=self.work)
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/lightning/app/utilities/proxies.py", line 722, in persist_artifacts
[root.train_work]     _copy_files(artifact_path, destination_path)
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/lightning/app/storage/copier.py", line 152, in _copy_files
[root.train_work]     fs.put(str(source_path), str(destination_path))
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/fsspec/asyn.py", line 113, in wrapper
[root.train_work]     return sync(self.loop, func, *args, **kwargs)
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/fsspec/asyn.py", line 98, in sync
[root.train_work]     raise return_result
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/fsspec/asyn.py", line 53, in _runner
[root.train_work]     result[0] = await coro
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/fsspec/asyn.py", line 523, in _put
[root.train_work]     return await _run_coros_in_chunks(
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/fsspec/asyn.py", line 269, in _run_coros_in_chunks
[root.train_work]     await asyncio.gather(*chunk, return_exceptions=return_exceptions),
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/asyncio/tasks.py", line 455, in wait_for
[root.train_work]     return await fut
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/s3fs/core.py", line 1073, in _put_file
[root.train_work]     await self._call_s3(
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/s3fs/core.py", line 339, in _call_s3
[root.train_work]     return await _error_wrapper(
[root.train_work]   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/s3fs/core.py", line 139, in _error_wrapper
[root.train_work]     raise err
[root.train_work] OSError: [Errno 5] An error occurred (413) when calling the PutObject operation: Request Entity Too Large

Environment

  • PyTorch Version (e.g., 1.0): 2.0.0
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source): -
  • Python version: 3.10
  • CUDA/cuDNN version: -
  • GPU models and configuration: -
  • Any other relevant information: Lightning 2.0

Additional context

Found while running #30

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions