Skip to content

Commit 2d53a49

Browse files
committed
Cherry picked 2f630e1, with some minor fixups by hand
1 parent ecd9946 commit 2d53a49

File tree

4 files changed

+99
-2
lines changed

4 files changed

+99
-2
lines changed

Doc/library/urllib.parse.rst

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -338,6 +338,9 @@ or on combining URL components into a URL string.
338338
.. versionchanged:: 3.7.17
339339
Leading WHATWG C0 control and space characters are stripped from the URL.
340340

341+
.. versionchanged:: 3.7.17.1
342+
Leading WHATWG C0 control and space characters are stripped from the URL.
343+
341344
.. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser
342345

343346
.. function:: urlunsplit(parts)
@@ -428,6 +431,35 @@ code before trusting a returned component part. Does that ``scheme`` make
428431
sense? Is that a sensible ``path``? Is there anything strange about that
429432
``hostname``? etc.
430433

434+
.. _url-parsing-security:
435+
436+
URL parsing security
437+
--------------------
438+
439+
The :func:`urlsplit` and :func:`urlparse` APIs do not perform **validation** of
440+
inputs. They may not raise errors on inputs that other applications consider
441+
invalid. They may also succeed on some inputs that might not be considered
442+
URLs elsewhere. Their purpose is for practical functionality rather than
443+
purity.
444+
445+
Instead of raising an exception on unusual input, they may instead return some
446+
component parts as empty strings. Or components may contain more than perhaps
447+
they should.
448+
449+
We recommend that users of these APIs where the values may be used anywhere
450+
with security implications code defensively. Do some verification within your
451+
code before trusting a returned component part. Does that ``scheme`` make
452+
sense? Is that a sensible ``path``? Is there anything strange about that
453+
``hostname``? etc.
454+
455+
What constitutes a URL is not universally well defined. Different applications
456+
have different needs and desired constraints. For instance the living `WHATWG
457+
spec`_ describes what user facing web clients such as a web browser require.
458+
While :rfc:`3986` is more general. These functions incorporate some aspects of
459+
both, but cannot be claimed compliant with either. The APIs and existing user
460+
code with expectations on specific behaviors predate both standards leading us
461+
to be very cautious about making API behavior changes.
462+
431463
.. _parsing-ascii-encoded-bytes:
432464

433465
Parsing ASCII Encoded Bytes

Lib/test/test_urlparse.py

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -719,6 +719,65 @@ def test_urlsplit_strip_url(self):
719719
self.assertEqual(p.scheme, "https")
720720
self.assertEqual(p.geturl(), "https://www.python.org/")
721721

722+
def test_urlsplit_strip_url(self):
723+
noise = bytes(range(0, 0x20 + 1))
724+
base_url = "http://User:[email protected]:080/doc/?query=yes#frag"
725+
726+
url = noise.decode("utf-8") + base_url
727+
p = urllib.parse.urlsplit(url)
728+
self.assertEqual(p.scheme, "http")
729+
self.assertEqual(p.netloc, "User:[email protected]:080")
730+
self.assertEqual(p.path, "/doc/")
731+
self.assertEqual(p.query, "query=yes")
732+
self.assertEqual(p.fragment, "frag")
733+
self.assertEqual(p.username, "User")
734+
self.assertEqual(p.password, "Pass")
735+
self.assertEqual(p.hostname, "www.python.org")
736+
self.assertEqual(p.port, 80)
737+
self.assertEqual(p.geturl(), base_url)
738+
739+
url = noise + base_url.encode("utf-8")
740+
p = urllib.parse.urlsplit(url)
741+
self.assertEqual(p.scheme, b"http")
742+
self.assertEqual(p.netloc, b"User:[email protected]:080")
743+
self.assertEqual(p.path, b"/doc/")
744+
self.assertEqual(p.query, b"query=yes")
745+
self.assertEqual(p.fragment, b"frag")
746+
self.assertEqual(p.username, b"User")
747+
self.assertEqual(p.password, b"Pass")
748+
self.assertEqual(p.hostname, b"www.python.org")
749+
self.assertEqual(p.port, 80)
750+
self.assertEqual(p.geturl(), base_url.encode("utf-8"))
751+
752+
# Test that trailing space is preserved as some applications rely on
753+
# this within query strings.
754+
query_spaces_url = "https://www.python.org:88/doc/?query= "
755+
p = urllib.parse.urlsplit(noise.decode("utf-8") + query_spaces_url)
756+
self.assertEqual(p.scheme, "https")
757+
self.assertEqual(p.netloc, "www.python.org:88")
758+
self.assertEqual(p.path, "/doc/")
759+
self.assertEqual(p.query, "query= ")
760+
self.assertEqual(p.port, 88)
761+
self.assertEqual(p.geturl(), query_spaces_url)
762+
763+
p = urllib.parse.urlsplit("www.pypi.org ")
764+
# That "hostname" gets considered a "path" due to the
765+
# trailing space and our existing logic... YUCK...
766+
# and re-assembles via geturl aka unurlsplit into the original.
767+
# django.core.validators.URLValidator (at least through v3.2) relies on
768+
# this, for better or worse, to catch it in a ValidationError via its
769+
# regular expressions.
770+
# Here we test the basic round trip concept of such a trailing space.
771+
self.assertEqual(urllib.parse.urlunsplit(p), "www.pypi.org ")
772+
773+
# with scheme as cache-key
774+
url = "//www.python.org/"
775+
scheme = noise.decode("utf-8") + "https" + noise.decode("utf-8")
776+
for _ in range(2):
777+
p = urllib.parse.urlsplit(url, scheme=scheme)
778+
self.assertEqual(p.scheme, "https")
779+
self.assertEqual(p.geturl(), "https://www.python.org/")
780+
722781
def test_attributes_bad_port(self):
723782
"""Check handling of invalid ports."""
724783
for bytes in (False, True):

Lib/urllib/parse.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -432,12 +432,15 @@ def urlsplit(url, scheme='', allow_fragments=True):
432432
Note that we don't break the components up in smaller bits
433433
(e.g. netloc is a single string) and we don't expand % escapes."""
434434
url, scheme, _coerce_result = _coerce_args(url, scheme)
435-
url = _remove_unsafe_bytes_from_url(url)
436-
scheme = _remove_unsafe_bytes_from_url(scheme)
437435
# Only lstrip url as some applications rely on preserving trailing space.
438436
# (https://url.spec.whatwg.org/#concept-basic-url-parser would strip both)
439437
url = url.lstrip(_WHATWG_C0_CONTROL_OR_SPACE)
440438
scheme = scheme.strip(_WHATWG_C0_CONTROL_OR_SPACE)
439+
440+
for b in _UNSAFE_URL_BYTES_TO_REMOVE:
441+
url = url.replace(b, "")
442+
scheme = scheme.replace(b, "")
443+
441444
allow_fragments = bool(allow_fragments)
442445
key = url, scheme, allow_fragments, type(url), type(scheme)
443446
cached = _parse_cache.get(key, None)
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
:func:`urllib.parse.urlsplit` now strips leading C0 control and space
2+
characters following the specification for URLs defined by WHATWG in
3+
response to CVE-2023-24329. Patch by Illia Volochii.

0 commit comments

Comments
 (0)