-
Notifications
You must be signed in to change notification settings - Fork 180
Description
I ran into a bug in get_filename that loses some images. The core problem seems to be that the dl counter is not threadsafe. Should be a simple fix, adding a threadlock around counter.
Issue
Here's the observed behavior. I ran the crawler asking for 50 google images:
crawler = icrawler.builtin.GoogleImageCrawler (downloader_threads = 4 , storage = storage)
crawler.crawl (keyword = f'happy face', max_num = 50)Storage dir ends up with 48 images. Some numbers are missing: there's no 000003.jpg and no 000007.jpg.
Diagnosing
So I modify ImageDownloader.get_filename to save all url-filename pairs in a threadsafe dict . That gives the dict shown below under Output. There are several name collisions: four 1's, two 4's, and no 3's or 7's. Other numbers seem to be present.
ImageDownloader.get_filename uses an index number to assign a filename to each url. Index is generated with this line:
file_idx = self.fetched_num + self.file_idx_offset
file_idx_offset should be 0 as I didn't pass any value and storage dir is empty.
I didn't trace where self.fetched_num is generated but it appears to not be threadsafe. Looks like a thread race condition on the counter that generates fetched_num, with perhaps an initialization error (four threads, four 1's - seems like all threads start at 1).
Output
Here's the dict logged from modified get_filename.
{
"https://wikimedia.com/contents/1909694/image.jpg": "000001.jpg",
"https://wikimedia.com/contents/1774931/image.jpg": "000001.jpg",
"https://wikimedia.com/1/1/image.jpg": "000001.jpg",
"https://wikimedia.com/contents/1734139/image.jpg": "000001.jpg",
"https://wikimedia.com/media/6765494/8.jpg": "000002.jpg",
"https://wikimedia.com/10649496/1/0/image.jpg": "000004.jpg",
"https://wikimedia.tv/contents/10738072/image.jpg": "000004.jpg",
"https://wikimedia.com/260/8.jpg": "000005.jpg",
"https://wikimedia.com/contents/9317509/image.jpg": "000006.jpg",
"https://wikimedia.com/media/image.jpg": "000008.jpg",
...
"https://www.wikimedia.com/contents/684404/image.jpg": "000049.jpg",
"https://v.wikimedia.com/contents/99637/image.jpg": "000050.jpg",
}Hope this helps.