Skip to content

Files retrieved with urllib over https are truncated #129264

@electricworry

Description

@electricworry

Bug report

Bug description:

I am finding that some files downloaded with urllib are always truncated. I have a demonstration file which is 187527168 bytes of NULs.

If I download with wget it always is retrieved ok:

root@697bf25b6113:~# wget https://electricworry-public.s3.eu-west-1.amazonaws.com/test -O test-wget
--2025-01-24 14:41:27--  https://electricworry-public.s3.eu-west-1.amazonaws.com/test
Resolving electricworry-public.s3.eu-west-1.amazonaws.com (electricworry-public.s3.eu-west-1.amazonaws.com)... 52.218.90.80, 52.218.108.120, 3.5.72.214, ...
Connecting to electricworry-public.s3.eu-west-1.amazonaws.com (electricworry-public.s3.eu-west-1.amazonaws.com)|52.218.90.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 187527168 (179M) [binary/octet-stream]
Saving to: 'test-wget'

test-wget                                                   100%[=========================================================================================================================================>] 178.84M  5.57MB/s    in 33s     

2025-01-24 14:42:01 (5.38 MB/s) - 'test-wget' saved [187527168/187527168]

root@697bf25b6113:~# ls -l
total 183132
-rw-r--r-- 1 root root 187527168 Jan 24 14:31 test-wget

If I attempt the following python3 code I end up with a slightly truncated file:

import urllib.request
import shutil
request = urllib.request.Request("https://electricworry-public.s3.eu-west-1.amazonaws.com/test")
r = urllib.request.urlopen(request, None, 1000)
f = open("test-python", "wb")
shutil.copyfileobj(r, f)
f.close()

Here's what I end up with:

root@697bf25b6113:~# python3
Python 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> import shutil
>>> request = urllib.request.Request("https://electricworry-public.s3.eu-west-1.amazonaws.com/test")
>>> r = urllib.request.urlopen(request, None, 1000)
>>> f = open("test-python", "wb")
>>> shutil.copyfileobj(r, f)
>>> f.close()
>>> 
root@697bf25b6113:~# ls -l
total 363136
-rw-r--r-- 1 root root 184313073 Jan 24 14:43 test-python
-rw-r--r-- 1 root root 187527168 Jan 24 14:31 test-wget

I've tried this on several computers:

  • Physical host Dell XPS 13 running Ubuntu 24.04
  • Physical own-build workstation running Linux Mint 22.1 Xia
  • Docker container running debian:bookworm

A wireshark packet capture seems to indicate that the remote side completes and closes the connection (FIN, PSH, ACK) which it should as urllib by default sends "Connection: close" in the headers.

Is this a known problem? The problem doesn't happen when I switch from https to http.

CPython versions tested on:

3.11, 3.12

Operating systems tested on:

Linux

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions