-
-
Notifications
You must be signed in to change notification settings - Fork 33.6k
Description
Bug report
Bug description:
Hello,
I am currently debugging this issue.
I have noticed that the bug can be reproduced when the problematic file is truncated to 9 GiB B but it does not happen when truncated to 8 GiB.
The problem seems to be that the next member offset is computed wrong. It seems to point 512 B after the correct TAR header, which, in this case, points into the data for the extended attributes such as 30 mtime=1752348[...].
One of the differences seems to be this code part, which is not hit for the working case:
Lines 1562 to 1569 in 47b01da
| if "size" in pax_headers: | |
| # If the extended header replaces the size field, | |
| # we need to recalculate the offset where the next | |
| # header starts. | |
| offset = next.offset_data | |
| if next.isreg() or next.type not in SUPPORTED_TYPES: | |
| offset += next._block(next.size) | |
| tarfile.offset = offset |
While looking into the line above, i.e., into _apply_pax_info, I noticed that there is no definite order for applying the size even though it can appear multiple times!
Lines 1615 to 1634 in 47b01da
| def _apply_pax_info(self, pax_headers, encoding, errors): | |
| """Replace fields with supplemental information from a previous | |
| pax extended or global header. | |
| """ | |
| for keyword, value in pax_headers.items(): | |
| if keyword == "GNU.sparse.name": | |
| setattr(self, "path", value) | |
| elif keyword == "GNU.sparse.size": | |
| setattr(self, "size", int(value)) | |
| elif keyword == "GNU.sparse.realsize": | |
| setattr(self, "size", int(value)) | |
| elif keyword in PAX_FIELDS: | |
| if keyword in PAX_NUMBER_FIELDS: | |
| try: | |
| value = PAX_NUMBER_FIELDS[keyword](value) | |
| except ValueError: | |
| value = 0 | |
| if keyword == "path": | |
| value = value.rstrip("/") | |
| setattr(self, keyword, value) |
In the non-working case, the PAX headers look like this:
{'GNU.sparse.major': '1',
'GNU.sparse.minor': '0',
'GNU.sparse.name': 'userdata',
'GNU.sparse.realsize': '9663676416',
'atime': '1752349406.975921575',
'ctime': '1752349534.57652562',
'mtime': '1752349534.57652562',
'size': '9602318848'}I.e, the size member first gets set to GNU.sparse.realsize and then to size. The debug output looks like this:
[_apply_pax_info] SET SIZE to: 9663676416 from key: GNU.sparse.realsize
[_apply_pax_info] SET SIZE to: 9602318848 from key: size
[_apply_pax_info] SET key to: 1752349534.5765257 from key: mtime
Is it specified that the order of the PAX headers must always be this way? Else, one might just as well encounter it like this:
{'atime': '1752349406.975921575',
'ctime': '1752349534.57652562',
'mtime': '1752349534.57652562',
'size': '9602318848',
'GNU.sparse.major': '1',
'GNU.sparse.minor': '0',
'GNU.sparse.name': 'userdata',
'GNU.sparse.realsize': '9663676416'}and either one of these orders would be a bug.
The working case does not have this ambiguity:
{'GNU.sparse.major': '1',
'GNU.sparse.minor': '0',
'GNU.sparse.name': 'userdata',
'GNU.sparse.realsize': '8589934592',
'atime': '1752349538.445543898',
'ctime': '1752351104.53673501',
'mtime': '1752351104.53673501'}the debug output looks like this:
[_apply_pax_info] SET SIZE to: 8589934592 from key: GNU.sparse.realsize
[_apply_pax_info] SET key to: 1752351104.536735 from key: mtime
I.e., even if the is no ordering problem, there already are different semantics for the TarInfo.size member as one will contain GNU.sparse.realsize and the other will contain [PAXHeader.]size.
CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
Linked PRs
Metadata
Metadata
Assignees
Labels
Projects
Status