[PATCH] urllib.parse: Restrict IPv6 ZoneID characters to RFC 6874-compliant set

mauricelambert · web-flow · commit 58f0d09e888c · 2025-07-27T14:50:44.000Z
The current parsing logic for IPv6 addresses with Zone Identifiers (ZoneIDs)
uses the `ipaddress` module, which validates ZoneIDs according to RFC 4007,
allowing any non-null string. However, when used in URLs, ZoneIDs must follow
the percent-encoded format defined in RFC 6874.

This patch adds a check to restrict ZoneIDs to the allowed characters:

  ALPHA / DIGIT / "-" / "." / "_" / "~" / "% HEXDIG HEXDIG"

RFC 6874 §2.1 specifies the format of an IPv6 address with a ZoneID in a URI as:
  `IPv6addrz = IPv6address "%25" ZoneID`

Additionally, RFC 6874 recommends accepting a bare `%` without hex digits as a
liberal extension, but that flexibility still requires ZoneID content to conform
to a safe character set. This patch enforces that ZoneIDs do not include
characters outside the permitted range.

### Before the fix:

```py
&gt;&gt;&gt; import urllib.parse
&gt;&gt;&gt; urllib.parse.urlparse("http://[::1%2|test]/path")
ParseResult(scheme='http', netloc='[::1%2|test]', path='/path', ...)
```

Invalid characters such as `|` were incorrectly accepted in ZoneIDs.

### After the fix:

```py
&gt;&gt;&gt; import urllib.parse
&gt;&gt;&gt; urllib.parse.urlparse("http://[::1%2|test]/path")
Traceback (most recent call last):
    ...
ValueError: IPv6 ZoneID is invalid
```

This patch ensures `urllib.parse` properly rejects ZoneIDs with invalid characters,
improving compliance with the URI standards and helping prevent subtle bugs
or security vulnerabilities.
diff --git a/Lib/urllib/parse.py b/Lib/urllib/parse.py
@@ -466,6 +466,8 @@ def _check_bracketed_host(hostname):
         ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
         if isinstance(ip, ipaddress.IPv4Address):
             raise ValueError(f"An IPv4 address cannot be in brackets")
+        if "%" in hostname and not re.match(r"\A(%[a-fA-F0-9]{2}|[\w\.~-])+\z", hostname.split("%", 1)[1]):
+            raise ValueError(f"IPv6 ZoneID is invalid")
 
 # typed=True avoids BytesWarnings being emitted during cache key
 # comparison since this API supports both bytes and str input.