fix UrlReader did not support URL like http://filename.txt?a=1&b=2 . by levalup · Pull Request #5782 · biolab/orange3

levalup · 2022-01-14T09:35:15Z

Issue

Description of changes

Fix UrlReader did not support URL like http://filename.txt?a=1&b=2 , witch did quoted to http://filename.txt%3Fa%3D1%26b%3D2

Includes

Code changes
Tests
Documentation

CLAassistant · 2022-01-14T09:35:20Z

All committers have signed the CLA.

markotoplak · 2022-01-14T09:58:51Z

Thanks!

I believe the particular issue, #5723, is a signal of some deeper issue, so the fix should likely be more general. While this may fix the particular issue with Google Sheets, are there any other possibilities where where running quote on a string may destroy the address? Judging by documentation of the quote functions, they are. Did you look into this also?

quote('abc def') -> 'abc%20def'

    Each part of a URL, e.g. the path info, the query, etc., has a
    different set of reserved characters that must be quoted. The
    quote function offers a cautious (not minimal) way to quote a
    string for most of these parts.

    RFC 3986 Uniform Resource Identifier (URI): Generic Syntax lists
    the following (un)reserved characters.

    unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
    reserved      = gen-delims / sub-delims
    gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

    Each of the reserved characters is reserved in some component of a URL,
    but not necessarily in all of them.

    The quote function %-escapes all characters that are neither in the
    unreserved chars ("always safe") nor the additional chars set via the
    safe arg.

Also, could you explain what was your reasoning to handle ? as an exception and not add to the list of safe characters to quote?

Finally, we need better testing. #5412 was merged because is solved a particular problem, but I fail to understand tests towards its effect. We need to add a URL fails without quote but works with current version, just to ensure that we do not break anything additionally.

codecov · 2022-01-14T10:45:16Z

Codecov Report

Merging #5782 (255c9f2) into master (5984653) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #5782   +/-   ##
=======================================
  Coverage   86.12%   86.13%           
=======================================
  Files         316      316           
  Lines       66400    66412   +12     
=======================================
+ Hits        57186    57201   +15     
+ Misses       9214     9211    -3

levalup · 2022-01-14T11:02:58Z

@markotoplak
This problem starts with a download link from MinIO, witch like http://127.0.0.1:9000/data/d321fa3c6aba/a.txt?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=kier%2F20220114%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220114T103309Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=f7c67d3280b17255f15949755483e05f6d5d223a1f2aceeb3d2fec31e30ec511.
Well, The a real reason for this issue and #5723 is that an already encoded URL should not be encoded with quote redundantly again. Once a URL is encoded twice, the http://filename.txt?a=1&b=2 becames http://filename.txt%3Fa%3D1%26b%3D2, witch can't be parsed correctly anymore, since & and = lose their meaning.
The issue #5412 raised because the URL was not well encoded, so that quote need to be done.
Sadly, we don't have a very direct way to tell whether a URL has been encoded or NOT. As well as the URL should be encoded well as soon as it exists.
Based on existing test cases and following URL components. The code split the path, witch may include special chars and quote path but leave the rest parts not quote.

def urlparse(url, scheme='', allow_fragments=True):
    """Parse a URL into 6 components:
    <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
    Return a 6-tuple: (scheme, netloc, path, params, query, fragment).
    Note that we don't break the components up in smaller bits
    (e.g. netloc is a single string) and we don't expand % escapes."""
    url, scheme, _coerce_result = _coerce_args(url, scheme)
    splitresult = urlsplit(url, scheme, allow_fragments)
    scheme, netloc, url, query, fragment = splitresult
    if scheme in uses_params and ';' in url:
        url, params = _splitparams(url)
    else:
        params = ''
    result = ParseResult(scheme, netloc, url, params, query, fragment)
    return _coerce_result(result)

Maybe this is not a perfect solution then the nteloc or path includes delims like %. This is maybe the least significant change to fix above problems.
I added some new tests on UrlReader. Those might help me explain why I did this.

Love & peace; ^_^

levalup · 2022-01-14T12:42:10Z

Hi~
I think I've got the whole picture of the problem. I consider making another ”perfect“ solution to solve this problem. Now this PR can be closed as well. I will give another PR as soon as possible. ^_^

fix UrlReader did not support URL like http://filename.txt?a=1&b=2 .

255c9f2

levalup closed this Jan 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix UrlReader did not support URL like http://filename.txt?a=1&b=2 .#5782

fix UrlReader did not support URL like http://filename.txt?a=1&b=2 .#5782
levalup wants to merge 1 commit intobiolab:masterfrom
levalup:fix_url_with_params

levalup commented Jan 14, 2022

Uh oh!

CLAassistant commented Jan 14, 2022 •

edited

Loading

Uh oh!

markotoplak commented Jan 14, 2022

Uh oh!

codecov bot commented Jan 14, 2022

Uh oh!

levalup commented Jan 14, 2022 •

edited

Loading

Uh oh!

levalup commented Jan 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

levalup commented Jan 14, 2022

Issue

Description of changes

Includes

Uh oh!

CLAassistant commented Jan 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markotoplak commented Jan 14, 2022

Uh oh!

codecov bot commented Jan 14, 2022

Codecov Report

Uh oh!

levalup commented Jan 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

levalup commented Jan 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Jan 14, 2022 •

edited

Loading

levalup commented Jan 14, 2022 •

edited

Loading