fix(scrapy): async-thread startup race, shutdown lifecycle, and timeout setting#979
Draft
vdusek wants to merge 8 commits into
Draft
fix(scrapy): async-thread startup race, shutdown lifecycle, and timeout setting#979vdusek wants to merge 8 commits into
vdusek wants to merge 8 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #979 +/- ##
==========================================
+ Coverage 89.90% 91.14% +1.24%
==========================================
Files 49 49
Lines 3091 3118 +27
==========================================
+ Hits 2779 2842 +63
+ Misses 312 276 -36
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…lure or shutdown error
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes several defects in the Scrapy integration's background event-loop thread (
AsyncThread), the scheduler, and the HTTP cache storage.Fixes
run_corostartup race. The guardif not self._eventloop.is_running()fired spuriously during startup.__init__starts the loop thread and returns immediately, but callers (ApifyScheduler.open, the HTTP cache extension) invokerun_cororight after construction, before the thread reachesrun_forever(), sois_running()wasFalseand the run crashed (observed ~122/500 times inscheduler.open()). The guard now checksis_closed(). A coroutine submitted to a created-but-not-yet-running loop is queued byasyncio.run_coroutine_threadsafeand runs once the loop starts; only a genuinely closed loop raises.close()could leak the loop thread when task cancellation failed. If_shutdown_tasks()timed out or raised,close()returned before stopping the loop and joining the thread. The stop, join, and forced-shutdown fallback now run in afinally, so the thread is always torn down. The original error still propagates.close()raised on a second call. A repeated close (e.g.ApifyScheduler.open()closing on failure, then Scrapy closing again) called into the already-closed loop and raisedRuntimeError: Event loop is closed. Anis_closed()early-return makes a second close a no-op.close()ignored itstimeoutargument for the task-cancellation step (it used the constructor default). It now passes the caller'stimeoutto that step too.run_corotimeout left the coroutine running. On timeout it now cancels the future so the coroutine does not outlive the timeout.ApifyCacheStorage.close_spider, the expiration sweep ran outside thetry, so a failure there skippedAsyncThread.close(). The sweep now runs insidetry, withclose()in afinallyso it always runs.open_spiderstarted theAsyncThreadand then opened the key-value store, but on failure it did not close the thread (andclose_spidermay never run ifopen_spiderfails). It now closes the thread on failure, matchingApifyScheduler.open.APIFY_ASYNC_THREAD_TIMEOUT_SECSScrapy setting, wired into the scheduler (viafrom_crawler) and the cache storage, so theAsyncThreadtimeout is configurable.Error logging
run_corono longer logs-and-raises, so it no longer double-reports its own errors. Each call site keeps atraceback.print_exc(), so the raw traceback is still printed alongside Scrapy's own logging of the propagated exception.Tests
New
tests/unit/scrapy/test_async_thread.pycovers the startup race, normal run, run-after-close, timeout cancellation, the no-self-logging behaviour, idempotent close, the caller timeout reaching the shutdown step, and stop/join even when task cancellation fails. Plus additions to the scheduler and HTTP cache test modules: the timeout-setting wiring, closing the thread on open failure, and the cleanup-failure path still closing the thread.