Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions img2dataset/downloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,10 @@ def download_image(row, timeout, user_agent_token, disallowed_header_directives)
"""Download an image with urllib"""
key, url = row
img_stream = None
user_agent_string = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0"
user_agent_string = "img2dataset/1.x ("
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not into use {user_agent_token} rather than hard coding img2dataset here?

Copy link
Author

@ephphatha ephphatha Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference to the repository was hardcoded previously if any user-agent was specified, so it seemed appropriate to use it as the base tool name with the user-provided string added in the comment section.

edit: actually double-checking main() it looks like the default useragent token is None, not "img2dataset" as I thought for some reason. The old default UA does not identify the tool at all.
default UA: img2dataset/1.x (+https://github.com/rom1504/img2dataset)
user-provided UA: img2dataset/1.x (compatible; <user-provided>; +https://github.com/rom1504/img2dataset)

previous strings were:
default UA: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
user-provided UA: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0 (compatible; <user-provided>; +https://github.com/rom1504/img2dataset)

if user_agent_token:
user_agent_string += f" (compatible; {user_agent_token}; +https://github.com/rom1504/img2dataset)"
user_agent_string += f"compatible; {user_agent_token}; "
user_agent_string += "+https://github.com/rom1504/img2dataset)"
try:
request = urllib.request.Request(url, data=None, headers={"User-Agent": user_agent_string})
with urllib.request.urlopen(request, timeout=timeout) as r:
Expand Down