Skip to content

fix UrlReader did not support URL like http://filename.txt?a=1&b=2 .#5782

Closed
levalup wants to merge 1 commit intobiolab:masterfrom
levalup:fix_url_with_params
Closed

fix UrlReader did not support URL like http://filename.txt?a=1&b=2 .#5782
levalup wants to merge 1 commit intobiolab:masterfrom
levalup:fix_url_with_params

Conversation

@levalup
Copy link
Contributor

@levalup levalup commented Jan 14, 2022

Issue
Description of changes

Fix UrlReader did not support URL like http://filename.txt?a=1&b=2 , witch did quoted to http://filename.txt%3Fa%3D1%26b%3D2

Includes
  • Code changes
  • Tests
  • Documentation

@CLAassistant
Copy link

CLAassistant commented Jan 14, 2022

CLA assistant check
All committers have signed the CLA.

@markotoplak
Copy link
Member

Thanks!

I believe the particular issue, #5723, is a signal of some deeper issue, so the fix should likely be more general. While this may fix the particular issue with Google Sheets, are there any other possibilities where where running quote on a string may destroy the address? Judging by documentation of the quote functions, they are. Did you look into this also?

quote('abc def') -> 'abc%20def'

    Each part of a URL, e.g. the path info, the query, etc., has a
    different set of reserved characters that must be quoted. The
    quote function offers a cautious (not minimal) way to quote a
    string for most of these parts.

    RFC 3986 Uniform Resource Identifier (URI): Generic Syntax lists
    the following (un)reserved characters.

    unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
    reserved      = gen-delims / sub-delims
    gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

    Each of the reserved characters is reserved in some component of a URL,
    but not necessarily in all of them.

    The quote function %-escapes all characters that are neither in the
    unreserved chars ("always safe") nor the additional chars set via the
    safe arg.

Also, could you explain what was your reasoning to handle ? as an exception and not add to the list of safe characters to quote?

Finally, we need better testing. #5412 was merged because is solved a particular problem, but I fail to understand tests towards its effect. We need to add a URL fails without quote but works with current version, just to ensure that we do not break anything additionally.

@codecov
Copy link

codecov bot commented Jan 14, 2022

Codecov Report

Merging #5782 (255c9f2) into master (5984653) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #5782   +/-   ##
=======================================
  Coverage   86.12%   86.13%           
=======================================
  Files         316      316           
  Lines       66400    66412   +12     
=======================================
+ Hits        57186    57201   +15     
+ Misses       9214     9211    -3     

@levalup
Copy link
Contributor Author

levalup commented Jan 14, 2022

@markotoplak
This problem starts with a download link from MinIO, witch like http://127.0.0.1:9000/data/d321fa3c6aba/a.txt?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=kier%2F20220114%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220114T103309Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=f7c67d3280b17255f15949755483e05f6d5d223a1f2aceeb3d2fec31e30ec511.
Well, The a real reason for this issue and #5723 is that an already encoded URL should not be encoded with quote redundantly again. Once a URL is encoded twice, the http://filename.txt?a=1&b=2 becames http://filename.txt%3Fa%3D1%26b%3D2, witch can't be parsed correctly anymore, since & and = lose their meaning.
The issue #5412 raised because the URL was not well encoded, so that quote need to be done.
Sadly, we don't have a very direct way to tell whether a URL has been encoded or NOT. As well as the URL should be encoded well as soon as it exists.
Based on existing test cases and following URL components. The code split the path, witch may include special chars and quote path but leave the rest parts not quote.

def urlparse(url, scheme='', allow_fragments=True):
    """Parse a URL into 6 components:
    <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
    Return a 6-tuple: (scheme, netloc, path, params, query, fragment).
    Note that we don't break the components up in smaller bits
    (e.g. netloc is a single string) and we don't expand % escapes."""
    url, scheme, _coerce_result = _coerce_args(url, scheme)
    splitresult = urlsplit(url, scheme, allow_fragments)
    scheme, netloc, url, query, fragment = splitresult
    if scheme in uses_params and ';' in url:
        url, params = _splitparams(url)
    else:
        params = ''
    result = ParseResult(scheme, netloc, url, params, query, fragment)
    return _coerce_result(result)

Maybe this is not a perfect solution then the nteloc or path includes delims like %. This is maybe the least significant change to fix above problems.
I added some new tests on UrlReader. Those might help me explain why I did this.

Love & peace; ^_^

@levalup
Copy link
Contributor Author

levalup commented Jan 14, 2022

Hi~
I think I've got the whole picture of the problem. I consider making another ”perfect“ solution to solve this problem. Now this PR can be closed as well. I will give another PR as soon as possible. ^_^

@levalup levalup closed this Jan 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants