Security: Fix path traversal and symlink attacks in archive extraction#1225
Security: Fix path traversal and symlink attacks in archive extraction#1225
Conversation
…ialization Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
|
@copilot address the remaining symlink bug or if it is not possible, address that clearly in code comments. implements all the suggestions |
There was a problem hiding this comment.
Pull request overview
This PR introduces security hardening around corpus downloads and archive extraction, and adds documentation describing identified risks and mitigations (path traversal, insecure deserialization, SSL verification, MD5 usage).
Changes:
- Added
_is_within_directory(),_safe_extract_tar(), and_safe_extract_zip()inpythainlp.corpus.coreand wired them into the corpusdownload()flow instead of callingextractall()directly. - Updated
thai2fit.pyto use a context manager for pickle loading and added explicit security warnings about deserialization from corpus files. - Introduced dedicated security documentation (
SECURITY_NOTES.md,SECURITY_ANALYSIS_SUMMARY.md) and new unit tests intests/core/test_security.pyfor path traversal protection and archive extraction behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
pythainlp/corpus/core.py |
Adds safe archive extraction helpers, documents SSL verification and MD5-based integrity checks, and uses the new helpers in download() to mitigate straightforward ..-style path traversal. |
pythainlp/generate/thai2fit.py |
Clarifies the security implications of pickle.load() on corpus data, uses a context manager, and explicitly acknowledges the Bandit S301 finding. |
tests/core/test_security.py |
Introduces unit tests for _is_within_directory() and for safe/unsafe tar and zip extraction; does not yet cover symlink-based traversal and is not wired into tests/core/__init__.py’s aggregated suite. |
SECURITY_NOTES.md |
Documents the path traversal, insecure deserialization, SSL verification, and MD5 issues with CWE references and describes the implemented fixes/mitigations. |
SECURITY_ANALYSIS_SUMMARY.md |
Provides an executive and technical summary of the security findings, fixes, and testing, though it currently overstates that all path traversal vectors are fully fixed. |
Comments suppressed due to low confidence (1)
pythainlp/corpus/core.py:561
- This import of module tarfile is redundant, as it was previously imported on line 13.
import tarfile
- Add symlink target validation for tar and zip extraction - Use tarfile.data_filter for Python 3.12+ with fallback for 3.9-3.11 - Validate both file paths and symlink targets to prevent escape - Improve _is_within_directory to properly handle path separators - Add comprehensive tests for symlink attacks - Update security documentation with symlink attack scenarios Co-authored-by: bact <128572+bact@users.noreply.github.com>
Addressed the symlink vulnerability in commit 337ac1d. The fix includes: For Python 3.12+:
For Python 3.9-3.11:
Security improvements:
The implementation properly addresses the symlink bug while maintaining backward compatibility with Python 3.9+. |
|
@copilot apply changes based on the comments in this thread |
…undant imports, update docs - Add test_security to test_packages list in tests/core/__init__.py - Remove redundant tarfile and zipfile imports (already imported at top) - Update SECURITY_ANALYSIS_SUMMARY.md to clarify symlink protection is implemented - All tests passing including security tests now in main suite Co-authored-by: bact <128572+bact@users.noreply.github.com>
Applied all code review feedback in commit a24bd75: Comment 2729047939 (symlink validation): Already addressed in commit 337ac1d - both tar and zip extraction now validate symlink targets for Python 3.12+ (via Comment 2729047981 (symlink tests): Already present in commit 337ac1d - Comment 2729048016 (test suite integration): ✅ Fixed - Added Comment 2729048056 (documentation): ✅ Fixed - Updated SECURITY_ANALYSIS_SUMMARY.md to clarify symlink protection IS implemented with details on Python 3.12+ and 3.9-3.11 approaches. Comment 2729048091 (redundant imports): ✅ Fixed - Removed redundant All 7 security tests passing, including symlink attack validation. |
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
|




What does this changes
Fixes path traversal vulnerability (CWE-22) in corpus archive extraction including symlink-based attacks, and documents pickle deserialization security considerations (CWE-502).
What was wrong
Path Traversal (High Severity)
tar.extractall()andzipfile.extractall()called without path validation../../../.ssh/authorized_keyswould escape target directoryUndocumented Security Risks
Code Quality Issues
How this fixes it
Archive Extraction
_is_within_directory()to validate extraction paths with proper path separator handling_safe_extract_tar()and_safe_extract_zip()wrappers with comprehensive validationtarfile.data_filterfor robust protection against all archive attacksValueErroron any traversal or symlink escape attemptDocumentation
SECURITY_NOTES.mdwith technical analysis (CWE references, likelihood, impact, symlink attack scenarios)SECURITY_ANALYSIS_SUMMARY.mdfor non-technical stakeholders with clarification that symlink protection is fully implementedTesting
tests/core/test_security.pywith 7 comprehensive test casestests/core/__init__.py)Code Quality
import tarfileandimport zipfilestatements (already imported at module level)Your checklist for this pull request
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.