Skip to content

Commit 0e16bf4

Browse files
authored
enhancement: apply tar filters when using python 3.12 or above (#3124)
### Summary Applies tar filters when using Python 3.12 or above. This was added to the [Python `tarfile` library in 3.12](https://docs.python.org/3/library/tarfile.html#extraction-filters) and guards against malicious content being extracted from `.tar.gz` files. ### Testing Added smoke test. If this passes for all Python versions, we're good.
1 parent fdb2737 commit 0e16bf4

File tree

3 files changed

+29
-0
lines changed

3 files changed

+29
-0
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
### Enhancements
44

5+
* **Filtering for tar extraction** Adds tar filtering to the compression module for connectors to avoid decompression malicious content in `.tar.gz` files. This was added to the Python `tarfile` lib in Python 3.12. The change only applies when using Python 3.12 and above.
6+
57
### Features
68

79
### Fixes
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
import os
2+
import tarfile
3+
4+
from unstructured.ingest.utils.compression import uncompress_tar_file
5+
6+
7+
def test_uncompress_tar_file(tmpdir):
8+
tar_filename = os.path.join(tmpdir, "test.tar")
9+
filename = "example-docs/fake-text.txt"
10+
11+
with tarfile.open(tar_filename, "w:gz") as tar:
12+
tar.add(filename, arcname=os.path.basename(filename))
13+
14+
path = uncompress_tar_file(tar_filename, path=tmpdir.dirname)
15+
assert path == tmpdir.dirname

unstructured/ingest/utils/compression.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import copy
22
import os
3+
import sys
34
import tarfile
45
import zipfile
56
from dataclasses import dataclass
@@ -63,6 +64,17 @@ def uncompress_tar_file(tar_filename: str, path: Optional[str] = None) -> str:
6364
path = path if path else os.path.join(head, f"{tail}-tar-uncompressed")
6465
logger.info(f"extracting tar {tar_filename} -> {path}")
6566
with tarfile.open(tar_filename, "r:gz") as tfile:
67+
# NOTE(robinson: Mitigate against malicious content being extracted from the tar file.
68+
# This was added in Python 3.12
69+
# Ref: https://docs.python.org/3/library/tarfile.html#extraction-filters
70+
if sys.version_info >= (3, 12):
71+
tfile.extraction_filter = tarfile.tar_filter
72+
else:
73+
logger.warning(
74+
"Extraction filtering for tar files is available for Python 3.12 and above. "
75+
"Consider upgrading your Python version to improve security. "
76+
"See https://docs.python.org/3/library/tarfile.html#extraction-filters"
77+
)
6678
tfile.extractall(path=path)
6779
return path
6880

0 commit comments

Comments
 (0)