-
-
Notifications
You must be signed in to change notification settings - Fork 33.3k
gh-138284 : urllib.parse.parse_qsl now raises ValueError if illegal characters is passed, according to RFC 3986 #138291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Lib/urllib/parse.py
Outdated
| _UNSAFE_URL_BYTES_TO_REMOVE = ['\t', '\r', '\n'] | ||
|
|
||
| # Allowed valid characters in parse_qsl | ||
| _VALID_QUERY_CHARS = re.compile(r"^[A-Za-z0-9\-._~!$&'()*+,;=:@/?%]*$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be replaced with str.isascii, str.isdecimal and a strings with the others, this should be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I will do it and add new commit.
|
Please add a NEWS entry, and this does break existing code. |
…k for performance
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
|
@StanFromIreland I have added your suggestion. Can you please review it again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am the reporter of the issue which is being solved here. The changes look good to me, and I think they solve the issue very well. Approving from my end.
Note: @Davda-James do get it reviewed by Stan.
Suggestion: Squash your commits below to have a single commit. It is a good practice to have :)
cc: @StanFromIreland
|
I have requested the expert for this module.
Please do not, it confuses gh making it difficult to review. They will be squashed when merged anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that this change is expected, the following must be done:
- The documentation of
urllib.parse.parse_qslmust be updated accordingly. - Test coverage must be increased.
I'm not entirely sure that we necessarily need to consider this as a bug fix. The rationale is as follows:
The urlsplit() and urlparse() APIs do not perform validation of inputs. They may not raise errors on inputs that other applications consider invalid. They may also succeed on some inputs that might not be considered URLs elsewhere. Their purpose is for practical functionality rather than purity.
I do not know whether we should consider this is a pitfall or not.
Misc/NEWS.d/next/Library/2025-08-31-13-00-22.gh-issue-138284.6MOp4k.rst
Outdated
Show resolved
Hide resolved
Lib/urllib/parse.py
Outdated
| raise ValueError("bad query field: %r" % (name_value,)) | ||
| if strict_parsing: | ||
| # Validate RFC3986 characters | ||
| to_check = (name_value.decode() if isinstance(name_value, bytes) else name_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use _unquote as this handles the %-encoded values and takes care of the encoding parameter as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if strict_parsing:
# Validate RFC3986 characters
to_check = _unquote(name_value)
if isinstance(to_check, (bytes, bytearray)):
to_check = to_check.decode(encoding, errors)
if not _is_valid_rfc3986_query(to_check): using like this is it good as we need to decode back as _unquote returns bytes and _is_valid_rfc3986_query accepts the string ?
…MOp4k.rst Updated it according to suggestion Co-authored-by: Bénédikt Tran <[email protected]>
|
Hello @Davda-James any updates on the suggestion provided? |
|
@shloktech have changed according to suggestions but it was left upto @orsenthil if we needed this or not .. if acknowledgement comes from him, will then update docs and test coverage |
|
Hello @orsenthil can you please help with review and approval? CC: @StanFromIreland / @picnixz / @Davda-James |
|
Hello @orsenthil, Gentle Reminder 3: This request has been pending review on your end for nearly 2 months, so it would be great if we could expedite the process. CC: @StanFromIreland / @picnixz / @Davda-James Thank you for your time and support! |
|
Please avoid tagging unrelated users. Usually, we ask for experts to chime in and we have a huge PR backlog.
Some PRs sit for even longer and the words "expedite the process" won't make the process faster. I would also suggest not to use LLM-generated answers. Finally, as it was said, this PR is still not in a mergeable state as docs and tests are lacking (on purpose, so it's fine for now). @serhiy-storchaka What do you think about this one? I personally don't consider this as a bug because we explicitly say:
I'm inclined to say "well, why not" since this check is only performed if strict is true but OTOH, it could open a can of worms where users would want more and more checks. |
|
Hello @picnixz,
Sure, I will ensure to tag relevant users. I agree I have got this bad habit of tagging top contributors of every repo to which I contribute to it is a simple hack which I found to get things done quickly :) Will ensure to not do it for this repo.
As for the urlsplit() behavior, I agree with your point. let's wait for @serhiy-storchaka reply.
Yes, the previous message was enhanced using AI have ensured to not do it for this one :) Happy Sunday!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment on the issue.
As for this concrete code, I have two comments:
- Valid characters should be checked before calling
_unquote().%5eis valid even if^is not. - The code is not particularly efficient. Perhaps using regular expressions will be more efficient. You can also check the whole input instead of separate fields.
urllib.parse.parse_qsl earlier it was accepting the illegal characters as well.
Proof (that I reproduce) :

Closes issue : #138284
Proof (after fixing error):

I added the test for it as well.

Test for urlparse only :
All tests:

urllib.parse.parse_qslis accepting illegal characters #138284