Skip to content

Security: Fix path traversal and symlink attacks in archive extraction#1225

Merged
bact merged 9 commits intodevfrom
copilot/identify-security-issues
Jan 26, 2026
Merged

Security: Fix path traversal and symlink attacks in archive extraction#1225
bact merged 9 commits intodevfrom
copilot/identify-security-issues

Conversation

Copy link
Contributor

Copilot AI commented Jan 26, 2026

What does this changes

Fixes path traversal vulnerability (CWE-22) in corpus archive extraction including symlink-based attacks, and documents pickle deserialization security considerations (CWE-502).

What was wrong

Path Traversal (High Severity)

  • tar.extractall() and zipfile.extractall() called without path validation
  • Malicious archives could write to arbitrary filesystem locations
  • Example: archive member ../../../.ssh/authorized_keys would escape target directory
  • Symlinks in archives could point outside extraction directory, enabling directory escape attacks

Undocumented Security Risks

  • Pickle deserialization executes arbitrary code if file is malicious
  • SSL certificate validation enabled by default but not explicit in code
  • MD5 usage for integrity checks (already acknowledged with noqa)

Code Quality Issues

  • Security tests not included in main test suite
  • Redundant module imports in extraction code

How this fixes it

Archive Extraction

  • Added _is_within_directory() to validate extraction paths with proper path separator handling
  • Added _safe_extract_tar() and _safe_extract_zip() wrappers with comprehensive validation
  • Python 3.12+: Uses built-in tarfile.data_filter for robust protection against all archive attacks
  • Python 3.9-3.11: Custom validation of all archive members including symlink target validation
  • Validates both file paths and symlink targets before extraction
  • Raises ValueError on any traversal or symlink escape attempt
def _safe_extract_tar(tar: tarfile.TarFile, path: str) -> None:
    if hasattr(tarfile, 'data_filter'):
        # Python 3.12+: use built-in filter
        try:
            tar.extractall(path=path, filter='data')
        except (tarfile.OutsideDestinationError, tarfile.LinkOutsideDestinationError) as e:
            raise ValueError(str(e))
    else:
        # Python 3.9-3.11: manual validation including symlink targets
        for member in tar.getmembers():
            # Validate file path
            member_path = os.path.join(path, member.name)
            if not _is_within_directory(path, member_path):
                raise ValueError(f"Attempted path traversal in tar file: {member.name}")
            
            # Validate symlink targets
            if member.issym() or member.islnk():
                link_target = member.linkname
                # ... resolve and validate symlink target ...
        tar.extractall(path=path)

Documentation

  • Added security warnings to pickle.load() explaining RCE risk from untrusted data
  • Noted corpus files come from trusted source with MD5 verification
  • Documented SSL certificate validation is enabled by default in Python 3.9+
  • Created SECURITY_NOTES.md with technical analysis (CWE references, likelihood, impact, symlink attack scenarios)
  • Created SECURITY_ANALYSIS_SUMMARY.md for non-technical stakeholders with clarification that symlink protection is fully implemented

Testing

  • Added tests/core/test_security.py with 7 comprehensive test cases
  • Integrated security tests into main test suite (tests/core/__init__.py)
  • Validates path traversal detection for both tar and zip
  • Validates symlink attack detection for malicious symlinks pointing outside directory
  • Verifies legitimate archives still extract correctly
  • Tests symlink path validation logic
  • All existing corpus tests passing

Code Quality

  • Removed redundant import tarfile and import zipfile statements (already imported at module level)
  • Improved code organization and maintainability

Your checklist for this pull request

  • Passed code styles and structures
  • Passed code linting checks and unit test
Original prompt

Find potential security issues in the codebase.

Explain why each of them could lead to security issues, how likely it can happen, the possible impact, and provide fixes.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 3 commits January 26, 2026 17:59
…ialization

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copilot AI changed the title [WIP] Identify potential security issues in codebase Security: Fix path traversal in archive extraction and document deserialization risks Jan 26, 2026
Copilot AI requested a review from bact January 26, 2026 18:03
@coveralls
Copy link

coveralls commented Jan 26, 2026

Coverage Status

coverage: 53.661% (+0.005%) from 53.656%
when pulling 5567f02 on copilot/identify-security-issues
into c441c66 on dev.

@bact bact marked this pull request as ready for review January 26, 2026 19:30
@bact bact requested a review from Copilot January 26, 2026 19:49
@bact
Copy link
Member

bact commented Jan 26, 2026

@copilot address the remaining symlink bug or if it is not possible, address that clearly in code comments.

implements all the suggestions

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces security hardening around corpus downloads and archive extraction, and adds documentation describing identified risks and mitigations (path traversal, insecure deserialization, SSL verification, MD5 usage).

Changes:

  • Added _is_within_directory(), _safe_extract_tar(), and _safe_extract_zip() in pythainlp.corpus.core and wired them into the corpus download() flow instead of calling extractall() directly.
  • Updated thai2fit.py to use a context manager for pickle loading and added explicit security warnings about deserialization from corpus files.
  • Introduced dedicated security documentation (SECURITY_NOTES.md, SECURITY_ANALYSIS_SUMMARY.md) and new unit tests in tests/core/test_security.py for path traversal protection and archive extraction behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pythainlp/corpus/core.py Adds safe archive extraction helpers, documents SSL verification and MD5-based integrity checks, and uses the new helpers in download() to mitigate straightforward ..-style path traversal.
pythainlp/generate/thai2fit.py Clarifies the security implications of pickle.load() on corpus data, uses a context manager, and explicitly acknowledges the Bandit S301 finding.
tests/core/test_security.py Introduces unit tests for _is_within_directory() and for safe/unsafe tar and zip extraction; does not yet cover symlink-based traversal and is not wired into tests/core/__init__.py’s aggregated suite.
SECURITY_NOTES.md Documents the path traversal, insecure deserialization, SSL verification, and MD5 issues with CWE references and describes the implemented fixes/mitigations.
SECURITY_ANALYSIS_SUMMARY.md Provides an executive and technical summary of the security findings, fixes, and testing, though it currently overstates that all path traversal vectors are fully fixed.
Comments suppressed due to low confidence (1)

pythainlp/corpus/core.py:561

  • This import of module tarfile is redundant, as it was previously imported on line 13.
                import tarfile

- Add symlink target validation for tar and zip extraction
- Use tarfile.data_filter for Python 3.12+ with fallback for 3.9-3.11
- Validate both file paths and symlink targets to prevent escape
- Improve _is_within_directory to properly handle path separators
- Add comprehensive tests for symlink attacks
- Update security documentation with symlink attack scenarios

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 26, 2026

@copilot address the remaining symlink bug or if it is not possible, address that clearly in code comments.

implements all the suggestions

Addressed the symlink vulnerability in commit 337ac1d. The fix includes:

For Python 3.12+:

  • Uses built-in tarfile.data_filter which comprehensively handles symlink attacks

For Python 3.9-3.11:

  • Custom validation of all tar/zip members before extraction
  • Validates both file paths and symlink targets
  • Rejects symlinks pointing outside extraction directory
  • Handles relative and absolute symlinks

Security improvements:

  • Prevents symlink-based directory escape attacks
  • Validates symlink targets resolve within the extraction directory
  • Added tests for symlink attacks (7 security tests total, all passing)
  • Updated documentation with symlink attack scenarios

The implementation properly addresses the symlink bug while maintaining backward compatibility with Python 3.9+.

Copilot AI changed the title Security: Fix path traversal in archive extraction and document deserialization risks Security: Fix path traversal and symlink attacks in archive extraction Jan 26, 2026
@bact
Copy link
Member

bact commented Jan 26, 2026

@copilot apply changes based on the comments in this thread

…undant imports, update docs

- Add test_security to test_packages list in tests/core/__init__.py
- Remove redundant tarfile and zipfile imports (already imported at top)
- Update SECURITY_ANALYSIS_SUMMARY.md to clarify symlink protection is implemented
- All tests passing including security tests now in main suite

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 26, 2026

@copilot apply changes based on the comments in this thread

Applied all code review feedback in commit a24bd75:

Comment 2729047939 (symlink validation): Already addressed in commit 337ac1d - both tar and zip extraction now validate symlink targets for Python 3.12+ (via data_filter) and Python 3.9-3.11 (custom validation).

Comment 2729047981 (symlink tests): Already present in commit 337ac1d - test_safe_extract_tar_rejects_symlink_escape validates rejection of malicious symlinks.

Comment 2729048016 (test suite integration): ✅ Fixed - Added tests.core.test_security to test_packages list so security tests run in main suite.

Comment 2729048056 (documentation): ✅ Fixed - Updated SECURITY_ANALYSIS_SUMMARY.md to clarify symlink protection IS implemented with details on Python 3.12+ and 3.9-3.11 approaches.

Comment 2729048091 (redundant imports): ✅ Fixed - Removed redundant import tarfile and import zipfile statements (already imported at top of file).

All 7 security tests passing, including symlink attack validation.

bact added 3 commits January 26, 2026 20:41
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
6 Security Hotspots
E Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@bact bact added the security label Jan 26, 2026
@bact bact merged commit 1ed8149 into dev Jan 26, 2026
30 of 32 checks passed
@bact bact added this to the 5.3 milestone Jan 30, 2026
@bact bact deleted the copilot/identify-security-issues branch January 30, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants