-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
Description
I need to process big XML responses as a stream. The uncompressed responses can be multiple hundred megabytes in size, so loading them entirely into memory before handing them to the XML parser is not an option.
I'm using lxml to parse and I just hand the response.raw to its iterparse() function, as described somewhere in the requests docs. This works fine for uncompressed responses.
Unfortunately, the API I'm calling isn't particularly good. So it will sometimes return Content-Encoding: gzip even if I explicitly ask for uncompressed data. Also, the compression ratio on these extremely repetitive and verbose XML files is really good (10x+), so I'd really like to make use of compressed responses.
Is this possible with requests? I couldn't find it in the documentation. Researching deeper into urllib3, its HTTPResponse.read() method seems to support a decode_content parameter. If not set, urllib3 falls back to what's set in the constructor. When requests calls the constructor in requests.adapters.HTTPAdapter.send(), it explicitly sets decode_content to False.
Is there a reason why requests does that?
Strangely, iter_content() actually sets decode_content=True while reading. Why here? It all appears a bit arbitrary. I don't really understand the motivation for doing it one way here and another way there.
Personally, I can't really use iter_content() of course because I need a file-like object for lxml.
I previously wrote my own file-like object that I can hook in between requests and lxml, but of course buffering is hard and I feel like smarter people than me have written this before, so I'd prefer to not have to roll my own.
What's your advice how to handle this? Should requests be changed to default to setting decode_content=True in urllib3?