Skip to content

Commit 6921e26

Browse files
Fix critical ingest processor issues: lock mechanism, timeout coordination, and error handling
Problems Identified: - Issue crocodilestick#652: Broken lock files causing "Failed to queue upload for processing" - Issue crocodilestick#654: lsof hanging with high file descriptor limits - Issue crocodilestick#656: Timeout mismatches preventing file ingestion - Race conditions in lock acquisition and cleanup - Database connection leaks without proper context management - Permission errors crashing upload process Changes Made: 1. Implemented robust ProcessLock class with PID tracking - Added fcntl-based file locking with proper acquisition/release - Implemented stale lock detection and cleanup with process validation - Fixed race conditions in lock file management - Added proper error handling for lock operations 2. Created cwa-process-recovery startup service - Cleans up stale lock files and orphaned processes on container start - Removes temporary files older than 1 hour - Resets stuck processing status automatically - Ensures clean slate after container restarts 3. Fixed timeout coordination between service and processor - Service uses safety timeout (3x configured timeout) as last resort - Processor handles internal timeout logic independently - Coordinated timeout values prevent conflicts - Added proper timeout error handling and file backup 4. Upgraded lsof to version 4.99.5 compiled from source - Resolves hanging issues with high file descriptor limits - Ensures reliable file-in-use detection for large systems 5. Enhanced database connection management - All SQLite connections now use context managers (with statements) - Automatic connection cleanup prevents resource leaks - Added 30-second timeouts to prevent deadlocks 6. Improved error handling and permission management - Fixed permission errors in ingest directory creation - Added graceful fallback for network share environments - Enhanced main() function argument validation - Comprehensive try-finally blocks ensure cleanup 7. Fixed bash syntax in process recovery service - Corrected file counting logic with proper conditionals - Improved error handling in cleanup operations Why These Changes Matter: - Eliminates the primary causes of failed upload processing - Prevents resource leaks and orphaned processes - Provides automatic recovery from stuck states - Ensures reliable operation in containerized environments - Maintains backward compatibility while fixing critical issues These fixes address the root causes of upload failures and provide a robust, self-healing ingest system that can recover from various failure scenarios.
1 parent 47e24ca commit 6921e26

File tree

8 files changed

+536
-170
lines changed

8 files changed

+536
-170
lines changed

CONTRIBUTORS

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
CONTRIBUTORS
22

33
This file is automatically generated. DO NOT EDIT MANUALLY.
4-
Generated on: 2025-09-23T07:36:59.860977Z
4+
Generated on: 2025-09-23T09:50:10.069578Z
55

66
Upstream project: https://github.com/janeczku/calibre-web
77
Fork project (Calibre-Web Automated, since 2024): https://github.com/crocodilestick/calibre-web-automated

cps/editbooks.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -319,11 +319,13 @@ def _get_ingest_path(uploaded_file, prefix_parts=None):
319319
if not nsm:
320320
# Set ownership to abc:abc (uid=1000, gid=1000)
321321
os.chown(ingest_dir, 1000, 1000)
322-
except OSError as e:
323-
logger.log.warning('Failed to set ownership of ingest directory %s: %s', ingest_dir, e)
324-
except Exception:
325-
# Silently ignore any other permission-related errors
326-
pass
322+
except (OSError, PermissionError) as e:
323+
# Log warning but don't crash the upload process
324+
log.warning('Failed to set ownership of ingest directory %s: %s', ingest_dir, e)
325+
log.warning("If you're using a network share, consider setting NETWORK_SHARE_MODE=true in your environment variables to skip this step.")
326+
except Exception as e:
327+
# Silently ignore any other permission-related errors but log for debugging
328+
log.debug('Other permission error setting ingest directory ownership: %s', e)
327329

328330
base_name = secure_filename(uploaded_file.filename)
329331
# CWA change: use timestamp for more predictable sorting vs uuid

root/etc/s6-overlay/s6-rc.d/cwa-ingest-service/run

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -94,8 +94,9 @@ process_retry_queue() {
9494
while IFS= read -r queued_file; do
9595
if [ -f "$queued_file" ]; then
9696
echo "[cwa-ingest-service] Retrying: $queued_file"
97-
local timeout=$(get_timeout_from_db) # Get timeout from database
98-
timeout $timeout python3 /app/calibre-web-automated/scripts/ingest_processor.py "$queued_file"
97+
local configured_timeout=$(get_timeout_from_db) # Get configured timeout from database
98+
local safety_timeout=$((configured_timeout * 3)) # Safety timeout is 3x the configured timeout
99+
timeout $safety_timeout python3 /app/calibre-web-automated/scripts/ingest_processor.py "$queued_file"
99100
local retry_exit=$?
100101

101102
if [ $retry_exit -eq 2 ]; then
@@ -128,7 +129,8 @@ process_retry_queue() {
128129

129130
handle_event() {
130131
local filepath="$1"
131-
local timeout=$(get_timeout_from_db) # Get timeout from database
132+
local configured_timeout=$(get_timeout_from_db) # Get configured timeout from database
133+
local safety_timeout=$((configured_timeout * 3)) # Safety timeout is 3x the configured timeout
132134
local filename=$(basename "$filepath")
133135

134136
# temp suffixes
@@ -141,19 +143,21 @@ handle_event() {
141143
fi
142144

143145
echo "[cwa-ingest-service] New file detected - $filepath - Starting Ingest Processor..."
146+
echo "[cwa-ingest-service] Configured timeout: ${configured_timeout}s, Safety timeout: ${safety_timeout}s"
144147
echo "processing:$filename:$(date '+%Y-%m-%d %H:%M:%S')" > "$STATUS_FILE"
145148

146-
# Add timeout protection to prevent hanging processes
147-
timeout $timeout python3 /app/calibre-web-automated/scripts/ingest_processor.py "$filepath"
149+
# Use safety timeout as last resort - processor should handle its own timeout internally
150+
timeout $safety_timeout python3 /app/calibre-web-automated/scripts/ingest_processor.py "$filepath"
148151
local exit_code=$?
149152

150153
if [ $exit_code -eq 124 ]; then
151-
echo "[cwa-ingest-service] TIMEOUT: $filepath took longer than ${timeout} seconds"
152-
echo "timeout:$filename:$(date '+%Y-%m-%d %H:%M:%S')" > "$STATUS_FILE"
154+
echo "[cwa-ingest-service] SAFETY TIMEOUT: $filepath took longer than safety timeout of ${safety_timeout} seconds"
155+
echo "[cwa-ingest-service] This indicates a serious issue - processor should have timed out internally at ${configured_timeout} seconds"
156+
echo "safety_timeout:$filename:$(date '+%Y-%m-%d %H:%M:%S')" > "$STATUS_FILE"
153157
# Move problematic file to failed backup with timestamp
154158
if [ -d "/config/processed_books/failed" ]; then
155159
local timestamp=$(date '+%Y%m%d_%H%M%S')
156-
local failed_filename="${timestamp}_timeout_${filename}"
160+
local failed_filename="${timestamp}_safety_timeout_${filename}"
157161
echo "[cwa-ingest-service] Moving $filename to failed backup as $failed_filename"
158162
cp "$filepath" "/config/processed_books/failed/$failed_filename" 2>/dev/null || true
159163
fi
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
cwa-init
Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
#!/usr/bin/with-contenv bash
2+
3+
echo "[cwa-process-recovery] Starting process recovery and cleanup..."
4+
5+
# Function to check if a process is running
6+
is_process_running() {
7+
local pid="$1"
8+
[ -n "$pid" ] && kill -0 "$pid" 2>/dev/null
9+
}
10+
11+
# Function to safely kill a process
12+
safe_kill_process() {
13+
local pid="$1"
14+
local name="$2"
15+
16+
if is_process_running "$pid"; then
17+
echo "[cwa-process-recovery] Killing stale process $name (PID: $pid)"
18+
19+
# Try SIGTERM first
20+
kill -TERM "$pid" 2>/dev/null || true
21+
sleep 2
22+
23+
# If still running, use SIGKILL
24+
if is_process_running "$pid"; then
25+
echo "[cwa-process-recovery] Process $pid didn't respond to SIGTERM, using SIGKILL"
26+
kill -KILL "$pid" 2>/dev/null || true
27+
fi
28+
29+
echo "[cwa-process-recovery] Process $name (PID: $pid) terminated"
30+
return 0
31+
else
32+
echo "[cwa-process-recovery] Process $name (PID: $pid) is not running"
33+
return 1
34+
fi
35+
}
36+
37+
# Clean up stale lock files and associated processes
38+
cleanup_stale_locks() {
39+
echo "[cwa-process-recovery] Checking for stale lock files and processes..."
40+
41+
local lock_files=(
42+
"/tmp/ingest_processor.lock"
43+
"/tmp/convert_library.lock"
44+
"/tmp/cover_enforcer.lock"
45+
"/tmp/kindle_epub_fixer.lock"
46+
)
47+
48+
local cleaned_count=0
49+
50+
for lock_file in "${lock_files[@]}"; do
51+
if [ -f "$lock_file" ]; then
52+
local lock_name=$(basename "$lock_file" .lock)
53+
echo "[cwa-process-recovery] Found stale lock file: $lock_file"
54+
55+
# Try to read PID from lock file
56+
local pid=""
57+
if [ -r "$lock_file" ]; then
58+
pid=$(cat "$lock_file" 2>/dev/null | head -1 | tr -d '\n' | grep -E '^[0-9]+$' || true)
59+
fi
60+
61+
if [ -n "$pid" ]; then
62+
# Kill the process if it's still running
63+
safe_kill_process "$pid" "$lock_name"
64+
fi
65+
66+
# Remove the lock file
67+
rm -f "$lock_file"
68+
echo "[cwa-process-recovery] Removed stale lock file: $lock_file"
69+
((cleaned_count++))
70+
fi
71+
done
72+
73+
if [ $cleaned_count -eq 0 ]; then
74+
echo "[cwa-process-recovery] No stale lock files found"
75+
else
76+
echo "[cwa-process-recovery] Cleaned up $cleaned_count stale lock file(s)"
77+
fi
78+
}
79+
80+
# Clean up stale temporary processing files
81+
cleanup_temp_files() {
82+
echo "[cwa-process-recovery] Cleaning up stale temporary files..."
83+
84+
local temp_dirs=(
85+
"/config/.cwa_conversion_tmp"
86+
"/tmp"
87+
)
88+
89+
local cleaned_count=0
90+
91+
for temp_dir in "${temp_dirs[@]}"; do
92+
if [ -d "$temp_dir" ]; then
93+
# Clean up old staging directories (older than 1 hour)
94+
if find "$temp_dir" -name "staging*" -type d -mmin +60 -exec rm -rf {} + 2>/dev/null; then
95+
((cleaned_count++))
96+
fi
97+
98+
# Clean up old conversion temp files (older than 1 hour)
99+
if find "$temp_dir" -name "*.tmp" -mmin +60 -delete 2>/dev/null; then
100+
((cleaned_count++))
101+
fi
102+
if find "$temp_dir" -name "*_converting_*" -mmin +60 -delete 2>/dev/null; then
103+
((cleaned_count++))
104+
fi
105+
fi
106+
done
107+
108+
if [ $cleaned_count -eq 0 ]; then
109+
echo "[cwa-process-recovery] No stale temporary files found"
110+
else
111+
echo "[cwa-process-recovery] Cleaned up $cleaned_count stale temporary file(s)"
112+
fi
113+
}
114+
115+
# Reset processing status if stuck
116+
reset_processing_status() {
117+
echo "[cwa-process-recovery] Checking processing status..."
118+
119+
local status_file="/config/cwa_ingest_status"
120+
121+
if [ -f "$status_file" ]; then
122+
local status_content=$(cat "$status_file" 2>/dev/null || echo "unknown")
123+
124+
# Check if status indicates processing but no processor is running
125+
if [[ "$status_content" == processing:* ]]; then
126+
# Extract the timestamp from the status
127+
local timestamp=$(echo "$status_content" | cut -d':' -f3- 2>/dev/null || echo "")
128+
129+
if [ -n "$timestamp" ]; then
130+
# Convert timestamp to epoch for comparison
131+
local status_epoch=$(date -d "$timestamp" +%s 2>/dev/null || echo "0")
132+
local current_epoch=$(date +%s)
133+
local age_minutes=$(( (current_epoch - status_epoch) / 60 ))
134+
135+
# If processing status is older than 30 minutes, reset it
136+
if [ $age_minutes -gt 30 ]; then
137+
echo "[cwa-process-recovery] Processing status is $age_minutes minutes old, resetting to idle"
138+
echo "idle" > "$status_file"
139+
else
140+
echo "[cwa-process-recovery] Processing status is $age_minutes minutes old, keeping as-is"
141+
fi
142+
else
143+
echo "[cwa-process-recovery] Invalid processing status timestamp, resetting to idle"
144+
echo "idle" > "$status_file"
145+
fi
146+
else
147+
echo "[cwa-process-recovery] Processing status is: $status_content"
148+
fi
149+
else
150+
echo "[cwa-process-recovery] No processing status file found, creating with idle status"
151+
echo "idle" > "$status_file"
152+
fi
153+
}
154+
155+
# Check for orphaned processes that might be related to CWA
156+
cleanup_orphaned_processes() {
157+
echo "[cwa-process-recovery] Checking for orphaned CWA processes..."
158+
159+
# Look for Python processes running CWA scripts
160+
local cwa_processes=$(pgrep -f "python.*ingest_processor.py\|python.*convert_library.py\|python.*kindle_epub_fixer.py" 2>/dev/null || true)
161+
162+
if [ -n "$cwa_processes" ]; then
163+
echo "[cwa-process-recovery] Found potential orphaned CWA processes:"
164+
echo "$cwa_processes" | while read -r pid; do
165+
if [ -n "$pid" ]; then
166+
local cmd=$(ps -p "$pid" -o args= 2>/dev/null || echo "unknown")
167+
echo "[cwa-process-recovery] PID $pid: $cmd"
168+
169+
# For now, just log them - we could add logic to kill very old ones
170+
# safe_kill_process "$pid" "orphaned-cwa-process"
171+
fi
172+
done
173+
else
174+
echo "[cwa-process-recovery] No orphaned CWA processes found"
175+
fi
176+
}
177+
178+
# Main recovery sequence
179+
echo "[cwa-process-recovery] ========== Starting Recovery Sequence =========="
180+
181+
cleanup_stale_locks
182+
cleanup_temp_files
183+
reset_processing_status
184+
cleanup_orphaned_processes
185+
186+
echo "[cwa-process-recovery] ========== Recovery Sequence Complete =========="
187+
echo "[cwa-process-recovery] Process recovery service finished successfully"
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
oneshot
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
cwa-process-recovery

0 commit comments

Comments
 (0)