-
Notifications
You must be signed in to change notification settings - Fork 42
Description
Hi,
first of all, thanks for publishing the RadImageNet dataset!
While working with it, I discovered that there are quite some duplicate entries, when checking the MD5 hash of the files.
- Different pathology (i.e. different folder). This would then essentially be a multi-label setting, e.g.
CT/lung/interstitial_lung_disease/lung009382.pngandCT/lung/Nodule/lung009382.png(Note: same filename)MR/af/Plantar_plate_tear/foot040499.pngandMR/af/plantar_fascia_pathology/ankle027288.png(Note: different filename)
- Same pathology, e.g.
MR/af/hematoma/foot079779.pngandMR/af/hematoma/ankle053088.png - Neighboring samples, e.g.
US/gb/usn309850.pngandUS/gb/usn309851.png - Others, e.g.
US/ovary/usn326815.pngandUS/kidney/usn348701.png
So far, I haven't checked if the duplicates are across your utilized dataset split, but since you write in your paper that you split patient wise, this shouldn't be the case.
However, the following questions arise:
-
Since, from my understanding of the paper, this dataset is intended as a single-label, not a multi-label dataset, I am confused to find samples as in the first case. Now the question arises, can the dataset be considered as a multi-label dataset where all 165 pathologies are labeled in all images if present?
-
For the cases 2.-4. those duplicates are just creating an imbalance but don't provide additional information. Are you planning to remove them?
In total this results in:
Number of duplicate groups: 62751
Total duplicate files: 126074
I attached a duplicates.json with all the duplicates found.
It's a dictionary where each key is a MD5 hash and its value is a list of image paths with that hash.
Here is the script I wrote to detect the duplicates, to ensure reproducibility.
import hashlib
from pathlib import Path
from typing import Dict, List, Tuple
from tqdm import tqdm
import json
import argparse
def process_image_md5(image_path: Path):
"""
Generate MD5 hash for the given image.
Parameters:
image_path (Path): The path to the image file.
Returns:
tuple: A tuple containing the image path and its MD5 hash.
If an error occurs, the hash will be None.
"""
hash_md5: str = hashlib.md5()
with open(image_path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return image_path, hash_md5.hexdigest()
def find_duplicates(root_directory: Path):
"""
Finds duplicate images in the given directory based on MD5 hash.
Parameters:
root_directory (Path): The root directory to search for images.
Returns:
dict: A dictionary where each key is a MD5 hash and its value is a list of image paths with that hash.
"""
image_paths: List[Path] = list(root_directory.rglob("*.png"))
results: List[Tuple(Path, str)] = [
process_image_md5(image_path) for image_path in tqdm(image_paths)
]
hash_paths_dict: Dict[str, List[str]] = {}
for image_path, hash in results:
image_path: Path = str(image_path.relative_to(root_directory))
if hash:
if hash in hash_paths_dict:
hash_paths_dict[hash].append(image_path)
else:
hash_paths_dict[hash] = [image_path]
return hash_paths_dict
def save_duplicates_to_json(duplicates: Dict[str, List], filename: Path):
"""
Saves the duplicates dictionary to a JSON file.
Parameters:
duplicates (dict): The duplicates dictionary.
filename (Path): The path to the JSON file where the results will be saved.
"""
with open(filename, "w") as file:
json.dump(duplicates, file, indent=4)
def main():
"""
Main function to handle command line arguments and invoke duplicate finding and saving.
"""
parser = argparse.ArgumentParser(
description="Find and save duplicates in a dataset."
)
parser.add_argument(
"root_directory", type=Path, help="Root directory of the images"
)
parser.add_argument(
"json_filename", type=Path, help="Filename to save the duplicates JSON"
)
args = parser.parse_args()
print(
f"Searching for duplicates in {args.root_directory} and writing to {args.json_filename}"
)
hash_paths_dict: Dict[str, List[str]] = find_duplicates(args.root_directory)
duplicates: Dict[str, List[str]] = {
hash: paths for hash, paths in hash_paths_dict.items() if len(paths) > 1
}
save_duplicates_to_json(duplicates, args.json_filename)
# Number of duplicate groups
num_duplicate_groups: int = len(duplicates)
# Number of duplicate files
num_duplicate_files: int = sum(len(paths) for paths in duplicates.values())
print(f"Duplicates saved to {args.json_filename}")
print(f"Number of duplicate groups: {num_duplicate_groups}")
print(f"Total duplicate files: {num_duplicate_files}")
if __name__ == "__main__":
main()