Conversation
lorossi
commented
Dec 24, 2025
- Added image caching
- Images are now downloaded by the Reddit instance instead of being simply loaded by their url
There was a problem hiding this comment.
Pull request overview
This PR implements local image caching for the Telegram bot, transitioning from serving images directly via URLs to downloading and caching them locally. The main changes include downloading Reddit images to a local cache folder, serving images from file paths instead of URLs, and renaming the getUrl() method to getPhoto() to reflect this architectural change.
Key Changes:
- Implemented image downloading and local caching in the Reddit module with new async file operations
- Added cache folder configuration and management (creation/deletion)
- Updated the Telegram module to use
getPhoto()instead ofgetUrl()
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| corgos_telegram_bot/modules/reddit.py | Core implementation: added image download logic, cache folder management, and file-based queue system |
| corgos_telegram_bot/modules/telegram.py | Updated to call getPhoto() instead of getUrl() and renamed variables from url to photo |
| tests/test_reddit.py | Updated test to use new getPhoto() method name |
| settings.json.example | Added reddit_cache_folder configuration parameter and normalized golden corgo URL placeholder |
| tests/data/reddit_unit_test_settings.json | Added reddit_cache_folder test configuration |
| tests/data/telegram_unit_test_settings.json | Added reddit_cache_folder test configuration |
| corgos_telegram_bot/main.py | Changed default logging level from INFO to WARNING |
| .gitignore | Added .reddit_cache/ to ignore cached images |
Note: Several critical issues were identified related to concurrency, error handling, and resource management that should be addressed before merging. See detailed comments below.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def __del__(self) -> None: | ||
| """Destructor to clean up resources.""" | ||
|
|
There was a problem hiding this comment.
The destructor is defined but empty. If resources need cleanup on deletion, this should either be removed or properly implemented. Given that async cleanup is needed for the temp folder and Reddit connection, destructors are not appropriate for async cleanup - consider relying on explicit stop() calls instead.
| def __del__(self) -> None: | |
| """Destructor to clean up resources.""" | |
| async def _downloadImage(self, url: str, filepath: str) -> None: | ||
| """Download an image from the specified url to the specified filepath. | ||
|
|
||
| Args: | ||
| url (str): url of the image | ||
| filepath (str): filepath to save the image | ||
|
|
||
| """ | ||
| async with self._http_requests_semaphore: | ||
| logging.debug(f"Requesting url {url}") | ||
| logging.debug(f"Downloading image from url {url} to filepath {filepath}") | ||
| try: | ||
| async with aiohttp.ClientSession() as session: | ||
| async with session.get(url) as response: | ||
| logging.debug(f"Request to url {url} completed") | ||
| return response | ||
| async with session.get(url) as resp: | ||
| if resp.headers["content-type"] not in self._image_formats: | ||
| logging.debug(f"Url {url} is not an image, skipping download") | ||
| return | ||
|
|
||
| async with aiofiles.open(filepath, "wb") as f: | ||
| async for chunk in resp.content.iter_chunked(1024): | ||
| await f.write(chunk) | ||
| except Exception as e: | ||
| logging.error(f"Cannot download image from url {url}, error {e}") |
There was a problem hiding this comment.
The _http_requests_semaphore that was previously used to limit concurrent HTTP requests is no longer being used in _downloadImage. This could lead to too many concurrent downloads. Consider wrapping the download logic with async with self._http_requests_semaphore: to maintain the rate limiting.
| logging.debug("Downloading image %d/%d from post", x + 1, len(urls)) | ||
| filepath = await self._extractFilenameFromUrl(x, url) | ||
| await self._downloadImage(url, filepath) | ||
| self._temp_queue.add(filepath) |
There was a problem hiding this comment.
The _temp_queue.add(filepath) operation is not protected by the _temp_queue_lock. This creates a race condition where multiple concurrent tasks could modify the set simultaneously, potentially leading to data corruption or lost entries. Wrap this operation with async with self._temp_queue_lock:.
| self._temp_queue.add(filepath) | |
| async with self._temp_queue_lock: | |
| self._temp_queue.add(filepath) |
| if randint(1, 1000) == 1: | ||
| # if we are lucky enough, we get a golden corgo! | ||
| url = await self._settings.get("telegram_golden_corgo_url") | ||
| photo = await self._settings.get("telegram_golden_corgo_url") |
There was a problem hiding this comment.
The variable name was changed from url to photo, but telegram_golden_corgo_url still contains a URL (as evidenced by the _botCheckCommand which manipulates it as a URL with .replace(".jpg", "s.jpg")). The Telegram send_photo API can accept both URLs and file paths, but the inconsistent naming (calling it photo when it's still a URL) may be confusing. Consider either: 1) keeping the variable name as url for clarity, or 2) if golden corgo images should also be cached, update the logic to download and cache the golden corgo as well.
| photo = await self._settings.get("telegram_golden_corgo_url") | |
| golden_corgo_url = await self._settings.get("telegram_golden_corgo_url") | |
| photo = golden_corgo_url |
| _settings_path: str | ||
| _image_formats: tuple[str, ...] = ( | ||
| "image/png", | ||
| "image/jpeg", |
There was a problem hiding this comment.
The _image_formats tuple was updated to include "image/jpg", but "image/jpg" is not a standard MIME type. The correct MIME type for JPEG images is "image/jpeg". While some servers might return "image/jpg", it's non-standard. Consider removing "image/jpg" or adding a comment explaining why it's needed if you've encountered servers that use this non-standard type.
| "image/jpeg", | |
| "image/jpeg", | |
| # "image/jpg" is non-standard but included for compatibility with servers | |
| # that incorrectly use this MIME type for JPEG images. |
| async def _deleteTempFolder(self) -> None: | ||
| """Delete the temporary folder for caching images.""" | ||
| logging.info("Deleting temporary folder for caching images") | ||
| cache_folder = await self._settings.get("reddit_cache_folder") | ||
| for filename in await aiofiles.os.listdir(cache_folder): | ||
| filepath = path.join(cache_folder, filename) | ||
| await aiofiles.os.remove(filepath) | ||
| await aiofiles.os.rmdir(cache_folder) | ||
| logging.info(f"Temporary folder {cache_folder} deleted") |
There was a problem hiding this comment.
The _deleteTempFolder method will fail if any files remain in the cache folder (e.g., if a download is in progress or if the folder contains unexpected files). The cleanup loop at lines 74-76 only removes files, not subdirectories. If the cache folder somehow contains subdirectories, rmdir will fail. Consider adding error handling or using rmtree for more robust cleanup, or document that the cache folder must not contain subdirectories.
| async def _extractFilenameFromUrl(self, x: int, url: str) -> str: | ||
| """Extract the filename from the specified url. | ||
|
|
||
| Args: | ||
| url (str): url to be requested | ||
| x (int): index of the image | ||
| url (str): url to extract the filename from | ||
|
|
||
| Returns: | ||
| Response: response of the request | ||
| str: filename extracted from the url | ||
| """ | ||
| logging.debug(f"Extracting filename from url {url}") | ||
| folder = await self._settings.get("reddit_cache_folder") | ||
| filename = url.split("/")[-1].split("?")[0] | ||
| filepath = path.join(folder, f"{x}_{filename}") |
There was a problem hiding this comment.
The filename extraction uses url.split("/")[-1].split("?")[0] which may not handle all edge cases. URLs without a path component or with unusual structures could result in problematic filenames. Additionally, the index x is prefixed to make filenames unique, but if loadPostsAsync is called multiple times with the same URLs, this could lead to filename collisions across loads. Consider using a more robust filename generation strategy, such as hashing the URL or using a timestamp-based prefix.
| try: | ||
| async with aiohttp.ClientSession() as session: | ||
| async with session.get(url) as response: | ||
| logging.debug(f"Request to url {url} completed") | ||
| return response | ||
| async with session.get(url) as resp: | ||
| if resp.headers["content-type"] not in self._image_formats: | ||
| logging.debug(f"Url {url} is not an image, skipping download") | ||
| return | ||
|
|
||
| async with aiofiles.open(filepath, "wb") as f: | ||
| async for chunk in resp.content.iter_chunked(1024): | ||
| await f.write(chunk) | ||
| except Exception as e: | ||
| logging.error(f"Cannot download image from url {url}, error {e}") |
There was a problem hiding this comment.
When the response content-type is not an image format, the method returns early, but the file might have been partially created or opened. While aiofiles.open hasn't been called yet, consider checking the content-type before attempting to write to the file. Additionally, if an exception occurs during download, the partially written file will remain in the cache. Consider wrapping the file writing in a try-finally block that removes the file on failure.
| if self._queue.empty(): | ||
| logging.warning("Reddit queue is empty") | ||
| raise EmptyQueueException("Reddit queue is empty") | ||
| # rotate the queue | ||
| filepath = self._queue.get() | ||
| self._queue.put(filepath) | ||
|
|
||
| return filepath | ||
|
|
There was a problem hiding this comment.
The queue rotation logic (get then put) means the same files will be served repeatedly in the same order. While this is likely intentional for cycling through images, it also means that if a file is deleted from the filesystem (e.g., by an external process or disk cleanup), the filepath will remain in the queue and cause repeated errors when trying to send it. Consider adding file existence validation in getPhoto or implementing a mechanism to remove invalid filepaths from the queue when errors occur.
| if self._queue.empty(): | |
| logging.warning("Reddit queue is empty") | |
| raise EmptyQueueException("Reddit queue is empty") | |
| # rotate the queue | |
| filepath = self._queue.get() | |
| self._queue.put(filepath) | |
| return filepath | |
| # Iterate through the queue to find a valid, existing file. | |
| while not self._queue.empty(): | |
| # rotate the queue | |
| filepath = self._queue.get() | |
| if path.exists(filepath): | |
| # Re-queue valid filepath to maintain rotation behavior. | |
| self._queue.put(filepath) | |
| return filepath | |
| # Do not re-queue missing files; effectively remove them. | |
| logging.warning( | |
| "File '%s' no longer exists on disk; removing from Reddit queue", | |
| filepath, | |
| ) | |
| logging.warning("Reddit queue is empty") | |
| raise EmptyQueueException("Reddit queue is empty") |
| Args: | ||
| media_metadata (dict): media metadata of the gallery | ||
|
|
||
|
|
There was a problem hiding this comment.
There's an unnecessary extra blank line in the docstring. The standard Google-style docstring format uses a single blank line between sections. Remove the extra blank line.