Skip to content

Bug in get_filename loses images #126

@ed2050

Description

@ed2050

I ran into a bug in get_filename that loses some images. The core problem seems to be that the dl counter is not threadsafe. Should be a simple fix, adding a threadlock around counter.

Issue

Here's the observed behavior. I ran the crawler asking for 50 google images:

    crawler = icrawler.builtin.GoogleImageCrawler (downloader_threads = 4 , storage = storage)
    crawler.crawl (keyword = f'happy face', max_num = 50)

Storage dir ends up with 48 images. Some numbers are missing: there's no 000003.jpg and no 000007.jpg.

Diagnosing

So I modify ImageDownloader.get_filename to save all url-filename pairs in a threadsafe dict . That gives the dict shown below under Output. There are several name collisions: four 1's, two 4's, and no 3's or 7's. Other numbers seem to be present.

ImageDownloader.get_filename uses an index number to assign a filename to each url. Index is generated with this line:

file_idx = self.fetched_num + self.file_idx_offset

file_idx_offset should be 0 as I didn't pass any value and storage dir is empty.

I didn't trace where self.fetched_num is generated but it appears to not be threadsafe. Looks like a thread race condition on the counter that generates fetched_num, with perhaps an initialization error (four threads, four 1's - seems like all threads start at 1).

Output

Here's the dict logged from modified get_filename.

{
    "https://wikimedia.com/contents/1909694/image.jpg": "000001.jpg",
    "https://wikimedia.com/contents/1774931/image.jpg": "000001.jpg",
    "https://wikimedia.com/1/1/image.jpg": "000001.jpg",
    "https://wikimedia.com/contents/1734139/image.jpg": "000001.jpg",
    "https://wikimedia.com/media/6765494/8.jpg": "000002.jpg",
    "https://wikimedia.com/10649496/1/0/image.jpg": "000004.jpg",
    "https://wikimedia.tv/contents/10738072/image.jpg": "000004.jpg",
    "https://wikimedia.com/260/8.jpg": "000005.jpg",
    "https://wikimedia.com/contents/9317509/image.jpg": "000006.jpg",
    "https://wikimedia.com/media/image.jpg": "000008.jpg",
    ...
    "https://www.wikimedia.com/contents/684404/image.jpg": "000049.jpg",
    "https://v.wikimedia.com/contents/99637/image.jpg": "000050.jpg",
}

Hope this helps.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions