Spurious slash added to my base_url, causing 404 errors #2663

dubslow · 2023-04-16T02:59:41Z

dubslow
Apr 16, 2023

Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import httpx
>>> client = httpx.Client(base_url='https://example.com/db.php')
>>> req = client.get(url='', params={'key1': 'val1', 'key2': 'val2'})
>>> req.url
URL('https://example.com/db.php/?key1=val1&key2=val2')

This is not good, because this URL is clearly not what I told the library to target. I told it to target 'db.php' and suddenly, as if by magic, it's targeting 'db.php/', which doesn't exist.

My actual code is async, but same result. When I run this on my real target, I get 404 errors naturally:

DEBUG [2023-04-15 21:34:52] httpcore - http11.receive_response_headers.complete return_value=(b'HTTP/1.1', 404, b'Not Found', [(b'Server', b'NgxFence'), (b'Date', b'Sun, 16 Apr 2023 02:34:52 GMT'), (b'Content-Type', b'text/html'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'Content-Encoding', b'gzip'), (b'X-Cache', b'DYNAMIC')])
DEBUG [2023-04-15 21:34:52] httpx - HTTP Request: GET https://example.com/db.php/?key1=val1&key2=val2 "HTTP/1.1 404 Not Found"
...
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>

(And yes, I confirmed that when I discard base_url and just pass the URL whole to every get(), I do in fact get a proper response from the DB API -- the 404 is due exclusively to the spurious slash added by the Client.)

I struggle to imagine that this is anything other than a bug. I only ever query this one and simple url, I don't want the backend trying to think it's smarter than I am at basic string processing.

My use case, as you might guess, is only and entirely interacting with this DB API. My URL will never, ever change unless the remote server API changes. (I'm trying to build an Python interface to the API and scripts atop that to interact with it. In principle, I want to enable users to have dozens or hundreds of queries in parallel, all to this one and only URL.)

In fact at first I was even a little bit surprised that get() required url at all, when I thought I should be able to set Client(base_url=blah) once and forget it forever more, but I got over that. But then I was terribly surprised to find that my base_url was essentially worthless regardless, as I am effectively forced to just pass the actual, proper URL in every get(), quite defeating the purpose of base_url as far as I can see.

And certainly the documentation doesn't mention any sort of backend processing of URLs, I've read it multiple times before even writing code, and a couple more times since. What I see there:

base_url - (optional) A URL to use as the base when building request URLs.

...

The url argument is merged with any base_url set on the client.

...

For example, base_url allows you to prepend an URL to all outgoing requests:

>>> with httpx.Client(base_url='http://httpbin.org') as client:
...     r = client.get('/headers')
...
>>> r.request.url
URL('http://httpbin.org/headers')

...

with httpx.Client(app=app, base_url="http://testserver") as client:
    r = client.get("/")

All of this documentation leads me to believe that the final URL will merely be base_url+url, a simple string concatenation, with no attempt to second guess my / placements, but alas it appears the documentation misleads me.

A brief search did point me to #846, which seems fairly related, and that points to #1139. urljoin()'s semantics are pretty unintuitive to me (I put my slashes where I mean dammit, and I don't want any other code to implicitly add them willy nilly), but in any case the behavior I see here is compatible neither with the docs nor with urljoin(). And #1139 was apparently closed for "the docs appear adequate", which, well, no the heck they aren't. (To be fair, even urljoin()'s docs are abysmal, I'm not actually sure what its semantics are supposed to be either, those docs don't actually say what the function does... incredible.)

I'd be willing to make a pull request for the httpx docs on base_url, if that meets with maintainer approval, but frankly I think some behavior needs to change here as well. (I was inclined to directly file an issue, but the repo directs that issues cannot be filed without first filing a discussion.)

But hey what do I know. This is after all essentially my first foray into programmatic HTTP requests, so I'm very much a noob in this field... a very surprised noob

Edit: I see now also #843, and I can say that had that been accepted, it would have saved me the bother of writing this discussion post... also the relevant RFC absolutely should be mentioned, as that too would have saved me some trouble. I see now that that's what urljoin() is supposed to implement, altho again, the urljoin() docs should mention the RFC as much as the httpx docs. The relevant section of the RFC: https://www.rfc-editor.org/rfc/rfc3986#section-5.4.1

Actually, reading that now, that specifies that an empty string appended to an exisisting URL preserves the URL as-is, so the httpx behavior breaks the specification (presuming that it's intended to meet the specification, which the docs don't say one way or the other).

dubslow · 2023-04-17T08:38:16Z

dubslow
Apr 17, 2023
Author

Confirmed that urljoin does the right thing, at least in my case:

python
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urljoin
>>> urljoin('http://www.chessdb.cn/cdb.php', '')
'http://www.chessdb.cn/cdb.php'

Is there any reason that httpx doesn't simply delegate to urljoin ? If someone else already do it, why redo it.

1 reply

ods Apr 28, 2023

As httpx (along with starlette, fastapi) belongs to the projects where bug reports and PRs are not welcome, so I think it's low change to be fixed. Any ideas on the best way to overcome this bug?

lovelydinosaur · 2023-05-03T13:31:54Z

lovelydinosaur
May 3, 2023
Maintainer

Is there any reason that httpx doesn't simply delegate to urljoin ?

Because as you mention... "urljoin()'s semantics are pretty unintuitive to me"

The code comment in _merge_url helps here...

httpx/httpx/_client.py

Lines 378 to 388 in df5dbc0

    
           # To merge URLs we always append to the base URL. To get this 
        
           # behaviour correct we always ensure the base URL ends in a '/' 
        
           # separator, and strip any leading '/' from the merge URL. 
        
           # 
        
           # So, eg... 
        
           # 
        
           # >>> client = Client(base_url="https://www.example.com/subpath") 
        
           # >>> client.base_url 
        
           # URL('https://www.example.com/subpath/') 
        
           # >>> client.build_request("GET", "/path").url 
        
           # URL('https://www.example.com/subpath/path')

Either we need to...

Document the base_url behaviour clearly.
Deal with the edge case you're highlighting here of "url doesn't end with a / and "" is passed".

There might be some more useful context if you dig into the code history

I recall that the previous behaviour was "use urljoin" which resulted in a number of issues being raised.

1 reply

dubslow May 3, 2023
Author

urljoin is only unintuitive if one is unfamiliar with the relevant RFC, as I linked. Having read that RFC, I now better understand what the basic idea is (that being to preserve the folder part of the path, but replace the file part on demand). At the very least, base_url should document the fact that it breaks the industry-standard semantics. If using industry-standard semantics caused problems, then either it was bugged or else those users were unaware of the industry standard -- which is why I said that linking the RFC from the httpx docs is crucial.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spurious slash added to my base_url, causing 404 errors #2663

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Spurious slash added to my base_url, causing 404 errors #2663

Uh oh!

Uh oh!

dubslow Apr 16, 2023

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

dubslow Apr 17, 2023 Author

Uh oh!

ods Apr 28, 2023

Uh oh!

lovelydinosaur May 3, 2023 Maintainer

Uh oh!

Uh oh!

dubslow May 3, 2023 Author

dubslow
Apr 16, 2023

Replies: 2 comments 2 replies

dubslow
Apr 17, 2023
Author

lovelydinosaur
May 3, 2023
Maintainer

dubslow May 3, 2023
Author