Skip to content

Windows: load() accepts drive-relative archive member names #2

@pomponchik

Description

@pomponchik

DirectoryIsolate.load() validates names taken from a tar archive before
writing their contents into a staging directory. It rejects ordinary absolute
paths such as /outside.txt, Windows absolute paths such as
C:/outside.txt, backslashes and .. traversal.

On Windows it does not reject a different Windows path form:

C:payload.txt

This is not a normal filename on Windows. It is a drive-relative path: it
refers to payload.txt relative to the current directory remembered for drive
C:. In particular, it is not the same as either:

payload.txt       # an ordinary relative member name
C:/payload.txt    # an absolute path on drive C:

The current validation uses PurePosixPath and only rejects a first path
component exactly equal to C:. Therefore it rejects C:/payload.txt, whose
first component is C:, but it accepts C:payload.txt, whose first component
is the longer string C:payload.txt.

Why This Matters

During load, the implementation constructs a destination path approximately as
follows:

target_path = staging_directory / normalized_archive_member_name

On Windows, joining a staging directory with a drive-relative member has
special behavior:

from pathlib import PureWindowsPath

stage = PureWindowsPath(r'C:\Temp\throng-load-123')

assert stage / 'C:payload.txt' == PureWindowsPath(r'C:\Temp\throng-load-123\payload.txt')
assert stage / 'D:payload.txt' == PureWindowsPath(r'D:payload.txt')

There are two observable problems:

  1. If the member names the same drive as the staging directory, the name is
    not preserved: an archive member named C:payload.txt is restored as
    payload.txt.
  2. If the member names a different available drive, the joined destination is
    no longer rooted in the staging directory. The write can occur relative to
    that drive's current directory instead of inside the temporary staging tree.

The second case is the serious one: loading an archive must never write outside
the staging directory before commit.

OS Boundary

This problem is Windows-specific.

On POSIX systems, C:payload.txt is an ordinary legal filename containing a
colon. It has no drive semantics and remains below the staging directory:

from pathlib import PurePosixPath

stage = PurePosixPath('/tmp/throng-load-123')
assert stage / 'C:payload.txt' == PurePosixPath('/tmp/throng-load-123/C:payload.txt')

Consequently, a fix should not globally reject colon-containing archive member
names on Linux or macOS. That would change valid POSIX behavior and prevent a
POSIX archive containing a legal file named C:payload.txt from round-tripping
on POSIX.

When such a POSIX-created archive is loaded on Windows, rejecting the member is
appropriate: its meaning on the destination operating system is unsafe or
ambiguous. Archive load validation must follow the rules of the system where
files will actually be written.

Situations Where This Can Occur

This can occur when all of the following are true:

  1. load() is executed on Windows.
  2. The loaded bytes are not guaranteed to have been produced by a trusted
    current version of DirectoryIsolate.dump(), for example:
    • an archive is received from another process, machine or user;
    • an archive was generated with another tar implementation;
    • an archive was created on POSIX and contains a colon in a filename;
    • persisted archive data has been modified or corrupted.
  3. The archive includes a regular-file member whose name has Windows drive
    syntax without a root separator, such as C:payload.txt or
    D:nested.txt.

Writing outside staging additionally requires that the member identifies a
different usable drive from the staging directory. For example, if staging is
on C: and drive D: is available, D:payload.txt has the escaping behavior.

Situations Where This Cannot Occur

This issue does not arise:

  • during load() on Linux or macOS, because those systems do not interpret
    drive-letter syntax;
  • for archives containing only ordinary portable relative members such as
    dir/file.txt;
  • for Windows absolute members such as C:/file.txt or C:\file.txt, because
    the current validation already rejects them;
  • when bytes are produced by the current dump() from a Windows directory,
    because Windows cannot create an ordinary file whose name contains :.

The final point does not make load() safe for external archives: load() is
already designed to validate hostile or malformed archive member names.

Minimal Reproduction Of The Validation Gap

The following is a non-destructive Windows reproduction. It uses the same
drive as the system temporary directory, so it demonstrates that the invalid
member is accepted and changed into another filename without intentionally
writing outside a temporary test tree.

from io import BytesIO
from pathlib import Path
from tarfile import TarInfo, open as open_tar
from tempfile import TemporaryDirectory, gettempdir

from throng.plugins.temporary_directory_throng import (
    TemporaryDirectoryIsolationConfig,
    TemporaryDirectoryThrong,
)


drive = Path(gettempdir()).drive
assert drive, 'Run this reproduction on Windows.'
member_name = f'{drive}payload.txt'

archive_buffer = BytesIO()
with open_tar(fileobj=archive_buffer, mode='w:') as archive:
    payload = b'outside-safe-reproduction'
    member = TarInfo(member_name)
    member.size = len(payload)
    archive.addfile(member, BytesIO(payload))

with TemporaryDirectory() as base_directory:
    isolate = TemporaryDirectoryThrong(
        config=TemporaryDirectoryIsolationConfig(
            base_directory=base_directory,
            compression='none',
        ),
    ).get_isolate()

    isolate.load(archive_buffer.getvalue())

    restored_names = [path.name for path in isolate.directory.iterdir()]
    print(member_name)                                # for example: C:payload.txt
    print(restored_names)                             # ['payload.txt']
    assert restored_names == ['payload.txt']
    assert (isolate.directory / 'payload.txt').read_bytes() == payload

Expected secure behavior is for load() to reject the archive before writing
any member. Current behavior accepts it and restores a differently named
file. The example deliberately checks iterdir() names: constructing
isolate.directory / member_name again would trigger Windows drive-relative
path interpretation again and would not inspect the literal archive name.

The outside-staging consequence can be demonstrated only in a controlled
Windows environment with two disposable accessible drives. An archive member
using the drive different from the staging directory must be used there; it
should not be used as a general CI reproduction because it deliberately tests
a write destination outside staging.

Minimal Regression Test

This test states the desired Windows-only contract without trying to create an
outside-staging write. It should fail against the current implementation and
pass after the validation is fixed.

from os import name as os_name

import pytest

from full_match import match
from tests.helpers import make_tar_bytes
from throng import ArchiveUnpackError


@pytest.mark.skipif(os_name != 'nt', reason='Windows drive-relative archive paths have no special meaning on POSIX')
def test_load_rejects_windows_drive_relative_member_path(temporary_isolate):
    """Verify that load rejects a Windows drive-relative path before it can leave staging."""
    isolate = temporary_isolate(compression='none')

    with pytest.raises(
        ArchiveUnpackError,
        match=match('Archive contains an unsafe Windows drive-relative path: C:payload.txt'),
    ):
        isolate.load(make_tar_bytes({'C:payload.txt': b'unsafe'}))

The test is intentionally Windows-only. A POSIX test requiring rejection of
the same string would impose a Windows filename rule on systems where the
filename is harmless and valid.

Proposed Fix

Preferred Plan: Windows-Only Validation

  1. Keep the existing POSIX-oriented checks unchanged for every operating
    system.
  2. When running on Windows, additionally parse the raw archive member name
    using Windows path rules.
  3. Reject any member with a drive component, including both:
    • absolute drive paths: C:/payload.txt;
    • drive-relative paths: C:payload.txt.
  4. Add the Windows-only regression test above.
  5. Keep existing tests for absolute, traversal and backslash paths.

For example, the additional check could be based on PureWindowsPath:

from os import name as os_name
from pathlib import PureWindowsPath

if os_name == 'nt' and PureWindowsPath(raw_name).drive:
    raise ArchiveUnpackError(f'Archive contains an unsafe Windows drive path: {raw_name}')

The exact error text can preserve the current distinction between absolute and
drive-relative paths if that is preferable for diagnostics. The significant
requirement is that the extra rejection is active only when extraction targets
Windows paths.

Alternative Plan: Reject Non-Portable Members Everywhere

The implementation could reject Windows drive syntax on every operating
system, making loaded archives portable across all supported platforms.

This is simpler to reason about, but it changes POSIX behavior: a legal POSIX
file named C:payload.txt would no longer be loadable. Because the bug arises
from Windows path interpretation, this alternative is not preferred unless the
project explicitly chooses a portable-subset archive format.

Acceptance Criteria

  • On Windows, load() rejects C:payload.txt and D:nested.txt before any
    content is written.
  • Existing rejection of /outside.txt, C:/outside.txt, backslash paths and
    traversal paths remains unchanged.
  • On POSIX, loading a regular member literally named C:payload.txt continues
    to work unless the project deliberately adopts the portable-subset
    alternative.
  • The Windows-only regression test is included and coverage remains complete.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions