Skip to content

Conversation

@pull
Copy link

@pull pull bot commented Dec 31, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

claude and others added 30 commits December 31, 2025 00:21
- Add _derive_persona_paths() in configset.py to automatically derive
  CHROME_USER_DATA_DIR and CHROME_EXTENSIONS_DIR from ACTIVE_PERSONA
  when not explicitly set. This allows plugins to use these paths
  without knowing about the persona system.

- Update chrome_utils.js launchChromium() to accept userDataDir option
  and pass --user-data-dir to Chrome. Also cleans up SingletonLock
  before launch.

- Update killZombieChrome() to clean up SingletonLock files from all
  persona chrome_user_data directories after killing zombies.

- Update chrome_cleanup() in misc/util.py to handle persona-based
  user data directories when cleaning up stale Chrome state.

- Simplify on_Crawl__20_chrome_launch.bg.js to use CHROME_USER_DATA_DIR
  and CHROME_EXTENSIONS_DIR from env (derived by get_config()).

Config priority flow:
  ACTIVE_PERSONA=WorkAccount (set on crawl/snapshot)
  -> get_config() derives:
     CHROME_USER_DATA_DIR = PERSONAS_DIR/WorkAccount/chrome_user_data
     CHROME_EXTENSIONS_DIR = PERSONAS_DIR/WorkAccount/chrome_extensions
  -> hooks receive these as env vars without needing persona logic
Documents 7-phase refactoring to use machine.Process as the core data
model for all subprocess management:

- Phase 1: Add parent FK and process_type to Process model
- Phase 2: Add lifecycle methods (launch, kill, poll, wait)
- Phase 3: Update hook system to create Process records
- Phase 4-5: Track workers/orchestrator/supervisord as Process
- Phase 6: Create root Process on CLI invocation
- Phase 7: Admin UI with tree visualization

Enables full process hierarchy tracking from CLI → binary execution.
Key addition: Process.current() class method (like Machine.current())
that auto-creates/retrieves the Process record for the current OS process.

Benefits:
- Uses PPID lookup to find parent Process automatically
- Detects process_type from sys.argv
- Cached with validation (like Machine.current())
- Eliminates need for thread-local context management

Simplified Phase 3 (workers) and Phase 4 (CLI) to just call
Process.current() instead of manual Process creation.
PIDs are recycled by OS, so all Process queries now:
- Filter by machine=Machine.current() (PIDs unique per machine)
- Filter by started_at within PID_REUSE_WINDOW (24h)
- Validate start time matches OS via psutil.Process.create_time()

Added:
- ProcessManager.get_by_pid() for safe PID lookups
- Process.cleanup_stale_running() to mark orphaned RUNNING as EXITED
- START_TIME_TOLERANCE (5s) for start time comparison
- Uses psutil.Process.create_time() for accurate started_at
Phase 2 now includes line-by-line mapping of:

- run_hook(): Create Process record, use Process.launch(), parse
  JSONL for child binary Process records
- process_is_alive(): Accept Path or Process, use Process.is_alive()
- kill_process(): Accept Path or Process, use Process.kill()
- ArchiveResult.run(): Pass self.process as parent_process to run_hook()
- ArchiveResult.update_from_output(): Read from Process.stdout/stderr
- Snapshot.cleanup(): Kill via Process model, fallback to PID files
- Snapshot.has_running_background_hooks(): Check via Process model

Hook JSONL contract updated to support {"type": "Process"} records
for tracking binary executions within hooks.
Phase 3.3 now includes:
- Module-level _supervisord_db_process variable
- start_new_supervisord_process(): Create Process record after Popen
- stop_existing_supervisord_process(): Update Process status on shutdown
- Process hierarchy diagram showing CLI → supervisord → workers chain

Key insight: PPID-based linking works because workers call Process.current()
in on_startup(), which finds supervisord's Process via PPID lookup.
New section 1.5 adds @Property proc that returns psutil.Process ONLY if:
- PID exists in OS
- OS start time matches our started_at (within tolerance)
- We're on the same machine

Safety features:
- Validates start time via psutil.Process.create_time()
- Optional command validation (binary name matches)
- Returns None instead of wrong process on PID reuse

Also adds convenience methods:
- is_running: Check via validated psutil
- get_memory_info(): RSS/VMS if running
- get_cpu_percent(): CPU usage if running
- get_children_pids(): Child PIDs from OS

Updated kill() to use self.proc for safe killing - never kills
a recycled PID since we validate start time first.
- Add comprehensive default CHROME_ARGS in config.json with 55+ flags
  for deterministic rendering, security, performance, and UI suppression

- Update chrome_utils.js launchChromium() to read CHROME_ARGS and
  CHROME_ARGS_EXTRA from environment variables (set by get_config())

- Add getEnvArray() helper to parse JSON arrays or comma-separated
  strings from environment variables

- Separate args into three categories:
  1. baseArgs: Static flags from CHROME_ARGS config (configurable)
  2. dynamicArgs: Runtime-computed flags (port, sandbox, headless, etc.)
  3. extraArgs: User overrides from CHROME_ARGS_EXTRA

- Add CHROME_SANDBOX config option to control --no-sandbox flag

Args are now configurable via:
  - config.json defaults
  - ArchiveBox.conf file
  - Environment variables
  - Per-crawl/snapshot config overrides
- Create Persona class in personas/models.py for managing browser
  profiles/identities used for archiving sessions

- Each Persona has:
  - chrome_user_data_dir: Chrome profile directory
  - chrome_extensions_dir: Installed extensions
  - cookies_file: Cookies for wget/curl
  - config_file: Persona-specific config overrides

- Add Persona methods:
  - cleanup_chrome(): Remove stale SingletonLock/SingletonSocket files
  - get_config(): Load persona config from config.json
  - save_config(): Save persona config to config.json
  - ensure_dirs(): Create persona directory structure
  - all(): Iterator over all personas
  - get_active(): Get persona based on ACTIVE_PERSONA config
  - cleanup_chrome_all(): Clean up all personas

- Update chrome_cleanup() in misc/util.py to use Persona.cleanup_chrome_all()
  instead of manual directory iteration

- Add convenience functions:
  - cleanup_chrome_for_persona(name)
  - cleanup_chrome_all_personas()
- Remove standalone convenience functions (cleanup_chrome_for_persona,
  cleanup_chrome_all_personas) to reduce LOC
- Change Persona.get_active(config) to accept config dict as argument
  instead of calling get_config() internally, since the caller needs
  to pass user/crawl/snapshot/archiveresult context for proper config
- Convert Persona from plain Python class to Django model with ModelWithConfig
- Add config JSONField for persona-specific config overrides
- Add get_derived_config() method that returns config with derived paths:
  - CHROME_USER_DATA_DIR, CHROME_EXTENSIONS_DIR, COOKIES_FILE, ACTIVE_PERSONA

- Update get_config() to accept persona parameter in merge chain:
  get_config(persona=crawl.persona, crawl=crawl, snapshot=snapshot)

- Remove _derive_persona_paths() - derivation now happens in Persona model

- Merge order (highest to lowest priority):
  1. snapshot.config
  2. crawl.config
  3. user.config
  4. persona.get_derived_config()  <- NEW
  5. environment variables
  6. ArchiveBox.conf file
  7. plugin defaults
  8. core defaults

Usage:
  config = get_config(persona=crawl.persona, crawl=crawl)
  config['CHROME_USER_DATA_DIR']  # derived from persona
Comprehensive plan for implementing JSONL-based CLI piping:
- Phase 1: Model prerequisites (ArchiveResult.from_json, tags_str fix)
- Phase 2: Extract shared apply_filters() to cli_utils.py
- Phase 3: Implement pass-through behavior for all create commands
- Phase 4-6: Test infrastructure with pytest-django, unit/integration tests

Key changes from original plan:
- ArchiveResult.from_json() identified as missing prerequisite
- Pass-through documented as new feature to implement
- archivebox run updated to create-or-update pattern
- conftest.py redesigned to use pytest-django with isolated tmp_path
- Standardized on tags_str field name across all models
- Reordered phases: implement before test
Multiple hooks in the same plugin directory were overwriting each
other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now
each hook uses filenames prefixed with its hook name:
- on_Snapshot__20_chrome_tab.bg.stdout.log
- on_Snapshot__20_chrome_tab.bg.stderr.log
- on_Snapshot__20_chrome_tab.bg.pid
- on_Snapshot__20_chrome_tab.bg.sh

Updated:
- hooks.py run_hook() to use hook-specific names
- core/models.py cleanup and update_from_output methods
- Plugin scripts to no longer write redundant hook.pid files
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Multiple hooks in the same plugin directory were overwriting each
other's stdout.log, stderr.log, hook.pid, and cmd.sh files. Now each
hook uses filenames prefixed with its hook name:
- on_Snapshot__20_chrome_tab.bg.stdout.log
- on_Snapshot__20_chrome_tab.bg.stderr.log
- on_Snapshot__20_chrome_tab.bg.pid
- on_Snapshot__20_chrome_tab.bg.sh

Updated:
- hooks.py run_hook() to use hook-specific names
- core/models.py cleanup and update_from_output methods
- Plugin scripts to no longer write redundant hook.pid files

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk


<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Prevented hook file collisions by giving each hook its own stdout,
stderr, pid, and cmd filenames. This fixes mixed logs and ensures
correct cleanup and status checks when multiple hooks run in the same
plugin directory.

- **Bug Fixes**
- hooks.py: write hook-specific stdout/stderr/pid/cmd files and exclude
them from new_files; derive cmd.sh from pid for safe kill.
- core/models.py: read hook-specific logs; exclude hook output files
when computing outputs; cleanup and background detection use *.pid.
- Plugins: stop writing redundant hook.pid files; minor chrome utils
cleanup.

<sup>Written for commit 754b096.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
Simplifies the comma-separated parsing logic to:
- If value contains '[', parse as JSON array
- Otherwise, parse as comma-separated values

This prevents incorrect splitting of arguments containing internal commas
when there's only one argument. For arguments with commas, users should
use JSON format: CHROME_ARGS='["--arg1,val", "--arg2"]'

Also exports getEnvArray in module.exports for consistency.

Co-authored-by: Nick Sweeting <[email protected]>
…ling logic on model methods (#1734)

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Added an implementation plan to centralize subprocess handling on the
machine.Process model. It covers process hierarchy, Process.current(),
safe lifecycle methods (launch/kill/wait), PID reuse protection, and
phased changes across hooks, workers, CLI, migrations, and admin.

<sup>Written for commit 3ae9410.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
…#1735)

Comprehensive plan for implementing JSONL-based CLI piping:
- Phase 1: Model prerequisites (ArchiveResult.from_json, tags_str fix)
- Phase 2: Extract shared apply_filters() to cli_utils.py
- Phase 3: Implement pass-through behavior for all create commands
- Phase 4-6: Test infrastructure with pytest-django, unit/integration
tests

Key changes from original plan:
- ArchiveResult.from_json() identified as missing prerequisite
- Pass-through documented as new feature to implement
- archivebox run updated to create-or-update pattern
- conftest.py redesigned to use pytest-django with isolated tmp_path
- Standardized on tags_str field name across all models
- Reordered phases: implement before test

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
This change consolidates duplicated logic between chrome_utils.js and
extension installer hooks, as well as between Python plugin tests:

JavaScript changes:
- Add getExtensionsDir() to centralize extension directory path calculation
- Add installExtensionWithCache() to handle extension install + cache workflow
- Add CLI commands for new utilities
- Refactor all 3 extension installers (ublock, istilldontcareaboutcookies,
  twocaptcha) to use shared utilities, reducing each from ~115 lines to ~60
- Update chrome_launch hook to use getExtensionsDir()

Python test changes:
- Add chrome_test_helpers.py with shared Chrome session management utilities
- Refactor infiniscroll and modalcloser tests to use shared helpers
- setup_chrome_session(), cleanup_chrome(), get_test_env() now centralized
- Add chrome_session() context manager for automatic cleanup

Net result: ~208 lines of code removed while maintaining same functionality.
- Update Crawl.output_dir_parent to use username instead of user_id
  for consistency with Snapshot paths
- Add domain from first URL to Crawl path structure for easier debugging:
  users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/
- Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab
  can find the shared Chrome session from the Crawl
- Update comment in chrome_tab hook to reflect new config source
- Update Crawl.output_dir_parent to use username instead of user_id for
consistency with Snapshot paths
- Add domain from first URL to Crawl path structure for easier
debugging: users/{username}/crawls/YYYYMMDD/{domain}/{crawl_id}/
- Add CRAWL_OUTPUT_DIR to config passed to Snapshot hooks so chrome_tab
can find the shared Chrome session from the Crawl
- Update comment in chrome_tab hook to reflect new config source

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
This change consolidates duplicated logic between chrome_utils.js and
extension installer hooks, as well as between Python plugin tests:

JavaScript changes:
- Add getExtensionsDir() to centralize extension directory path
calculation
- Add installExtensionWithCache() to handle extension install + cache
workflow
- Add CLI commands for new utilities
- Refactor all 3 extension installers (ublock,
istilldontcareaboutcookies, twocaptcha) to use shared utilities,
reducing each from ~115 lines to ~60
- Update chrome_launch hook to use getExtensionsDir()

Python test changes:
- Add chrome_test_helpers.py with shared Chrome session management
utilities
- Refactor infiniscroll and modalcloser tests to use shared helpers
- setup_chrome_session(), cleanup_chrome(), get_test_env() now
centralized
- Add chrome_session() context manager for automatic cleanup

Net result: ~208 lines of code removed while maintaining same
functionality.

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
- Add setup_test_env, launch_chromium_session, kill_chromium_session
  to chrome_test_helpers.py for extension tests
- Add chromium_session context manager for cleaner test code
- Refactor ublock, istilldontcareaboutcookies, twocaptcha tests to use
  shared helpers (~450 lines removed)
- Refactor screenshot, dom, pdf tests to use shared get_test_env
  and get_lib_dir (~60 lines removed)
- Net reduction: 228 lines of duplicate code
- Add get_machine_type() to chrome_test_helpers.py
- Update get_test_env() to include MACHINE_TYPE
- Refactor test_chrome.py to import from shared helpers
- Removes ~50 lines of duplicate code
- Import shared Chrome test helpers
- Add test_singlefile_with_chrome_session() to verify CDP connection
- Add test_singlefile_disabled_skips() for config testing
- Update existing test to use get_test_env()
claude and others added 26 commits December 31, 2025 09:02
New helpers in chrome_test_helpers.py:
- get_plugin_dir(__file__) - get plugin dir from test file path
- get_hook_script(dir, pattern) - find hook script by glob pattern
- run_hook() - run hook script and return (returncode, stdout, stderr)
- parse_jsonl_output() - parse JSONL from hook output
- run_hook_and_parse() - convenience combo of above two
- LIB_DIR, NODE_MODULES_DIR - lazy-loaded module constants
- _LazyPath class for deferred path resolution

Updated test files to use simpler patterns:
- screenshot/tests/test_screenshot.py
- dom/tests/test_dom.py
- pdf/tests/test_pdf.py
- singlefile/tests/test_singlefile.py

Before: PLUGIN_DIR = Path(__file__).parent.parent
After:  PLUGIN_DIR = get_plugin_dir(__file__)

Before: LIB_DIR = get_lib_dir(); NODE_MODULES_DIR = LIB_DIR / 'npm' / 'node_modules'
After:  from chrome_test_helpers import LIB_DIR, NODE_MODULES_DIR
Changed Snapshot.cleanup() to gracefully terminate background hooks:
1. Send SIGTERM to all background hook processes first
2. Wait up to each hook's plugin-specific timeout
3. Send SIGKILL only to hooks still running after their timeout

Added graceful_terminate_background_hooks() function in hooks.py that:
- Collects all .pid files from output directory
- Validates process identity using mtime
- Sends SIGTERM to all valid processes in phase 1
- Polls each process for up to its plugin-specific timeout
- Sends SIGKILL as last resort if timeout expires
- Returns status for each hook (sigterm/sigkill/already_dead/invalid)
- Add getMachineType, getLibDir, getNodeModulesDir, getTestEnv CLI commands to chrome_utils.js
  These are now the single source of truth for path calculations
- Update chrome_test_helpers.py with call_chrome_utils() dispatcher
- Add get_test_env_from_js(), get_machine_type_from_js(), kill_chrome_via_js() helpers
- Update cleanup_chrome and kill_chromium_session to use JS killChrome
- Remove unused Chrome binary search lists from singlefile hook (~25 lines)
- Update readability, mercury, favicon, title tests to use shared helpers
Added 10 practical examples demonstrating the JSONL piping architecture:
1. Basic archive with auto-cascade
2. Retry failed extractions (by status, plugin, domain)
3. Pinboard bookmark import with jq
4. GitHub repo filtering with jq regex
5. Selective extraction (screenshots only)
6. Bulk tag management
7. Deep documentation crawling
8. RSS feed monitoring
9. Archive audit with jq aggregation
10. Incremental backup with diff

Also added auto-cascade principle: `archivebox run` automatically
creates Snapshots from Crawls and ArchiveResults from Snapshots,
so intermediate commands are only needed for customization.
Extended graceful_terminate_background_hooks() to:
- Reap processes with os.waitpid() to get exit codes
- Write returncode to .returncode file for update_from_output()
- Return detailed result dict with status, returncode, and pid

Updated update_from_output() to:
- Read .returncode and .stderr.log files
- Determine status from returncode if no ArchiveResult JSONL record
- Include stderr in output_str for failed hooks
- Handle signal termination (negative returncodes like -9 for SIGKILL)
- Clean up .returncode files along with other hook output files
- get_machine_type() matches JS getMachineType()
- get_lib_dir() matches JS getLibDir()
- get_node_modules_dir() matches JS getNodeModulesDir()
- get_extensions_dir() matches JS getExtensionsDir()
- find_chromium() matches JS findChromium()
- kill_chrome() matches JS killChrome()
- get_test_env() matches JS getTestEnv()

All functions now try JS first (single source of truth) with Python fallback.
Added backward compatibility aliases for old names.
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
- Trimmed from 10 to 8 focused examples
- Emphasize CLI args for DB filtering (efficient), jq for transforms
- Added key examples showing `run` emits JSONL enabling chained processing:
  - #4: Retry failed with different binary/timeout via jq transform
  - #8: Recursive link following (run → jq filter → crawl → run)
- Removed redundant jq domain filtering (use --url__icontains instead)
- Updated summary table with "Retry w/ Changes" and "Chain Processing" patterns
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk


<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Switch background hook cleanup to a graceful termination flow using
plugin-specific timeouts, only SIGKILLing if needed. This improves
reliability and records accurate exit codes and stderr for better result
reporting.

- **Refactors**
- Added graceful_terminate_background_hooks(): send SIGTERM to all
hooks, wait per plugin timeout, SIGKILL remaining, reap with waitpid,
write .returncode files.
- Snapshot.cleanup() now uses merged config (get_config) to apply
plugin-specific timeouts and terminate hooks gracefully.
- update_from_output() reads .returncode and .stderr.log, infers status
when no JSONL (handles signals like -9/-15), includes stderr on
failures, and cleans up .returncode files.

<sup>Written for commit 524e8e9.
Summary will update on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
Phase 1: Model Prerequisites
- Add ArchiveResult.from_json() and from_jsonl() methods
- Fix Snapshot.to_json() to use tags_str (consistent with Crawl)

Phase 2: Shared Utilities
- Create archivebox/cli/cli_utils.py with shared apply_filters()
- Update 7 CLI files to import from cli_utils.py instead of duplicating

Phase 3: Pass-Through Behavior
- Add pass-through to crawl create (non-Crawl records pass unchanged)
- Add pass-through to snapshot create (Crawl records + others pass through)
- Add pass-through to archiveresult create (Snapshot records + others)
- Add create-or-update behavior to run command:
  - Records WITHOUT id: Create via Model.from_json()
  - Records WITH id: Lookup existing, re-queue
  - Outputs JSONL of all processed records for chaining

Phase 4: Test Infrastructure
- Create archivebox/tests/conftest.py with pytest-django fixtures
- Include CLI helpers, output assertions, database assertions

Phase 6: Config Update
- Update supervisord_util.py: orchestrator -> run command

This enables Unix-style piping:
  archivebox crawl create URL | archivebox run
  archivebox archiveresult list --status=failed | archivebox run
  curl API | jq transform | archivebox crawl create | archivebox run
This consolidates scattered subprocess management logic into the Process model:

- terminate(): Graceful SIGTERM → wait → SIGKILL (replaces stop_worker, etc.)
- kill_tree(): Kill process and all OS children (replaces os.killpg logic)
- kill_children_db(): Kill DB-tracked child processes
- get_running(): Query running processes by type (replaces get_all_worker_pids)
- get_running_count(): Count running processes (replaces get_running_worker_count)
- stop_all(): Stop all processes of a type
- get_next_worker_id(): Get next worker ID for spawning

Added Phase 8 to TODO documenting ~390 lines that can be deleted after
consolidation, including workers/pid_utils.py which becomes obsolete.

Also includes migration 0002 for parent FK and process_type fields.
DELETED:
- workers/pid_utils.py (-192 lines) - replaced by Process model methods

SIMPLIFIED:
- crawls/models.py Crawl.cleanup() (80 lines -> 10 lines)
- hooks.py: deleted process_is_alive() and kill_process() (-45 lines)

UPDATED to use Process model:
- core/models.py: Snapshot.cleanup() and has_running_background_hooks()
- machine/models.py: Binary.cleanup()
- workers/worker.py: Worker.on_startup/shutdown, get_running_workers, start
- workers/orchestrator.py: Orchestrator.on_startup/shutdown, is_running

All subprocess management now uses:
- Process.current() for registering current process
- Process.get_running() / get_running_count() for querying
- Process.cleanup_stale_running() for cleanup
- safe_kill_process() for validated PID killing

Total line reduction: ~250 lines
Adds a new CLI command `archivebox pluginmap` that displays:
- ASCII art diagrams of all core state machines (Crawl, Snapshot,
  ArchiveResult, Binary)
- Lists all auto-detected on_Modelname_xyz hooks grouped by model/event
- Shows hook execution order (step 0-9), plugin name, and background status

Usage:
  archivebox pluginmap              # Show all diagrams and hooks
  archivebox pluginmap -m Snapshot  # Filter to specific model
  archivebox pluginmap -a           # Include disabled plugins
  archivebox pluginmap -q           # Output JSON only
Add comprehensive unit tests for the CLI piping architecture:
- test_cli_crawl.py: crawl create/list/update/delete tests
- test_cli_snapshot.py: snapshot create/list/update/delete tests
- test_cli_archiveresult.py: archiveresult create/list/update/delete tests
- test_cli_run.py: run command create-or-update and pass-through tests

Extend tests_piping.py with:
- TestPassThroughBehavior: tests for pass-through behavior in all commands
- TestPipelineAccumulation: tests for accumulating records through pipeline

All tests use pytest fixtures from conftest.py with isolated DATA_DIR.
- Add pwd validation in Process.launch() to prevent crashes
- Fix psutil returncode handling (use wait() return value, not returncode attr)
- Add None check for proc.pid in cleanup_stale_running()
- Add stale process cleanup in Orchestrator.is_running()
- Ensure orchestrator process_type is correctly set to ORCHESTRATOR
- Fix KeyboardInterrupt handling (exit code 0 for graceful shutdown)
- Throttle cleanup_stale_running() to once per 30 seconds for performance
- Fix worker process_type to use TypeChoices.WORKER consistently
- Fix get_running_workers() API to return list of dicts (not Process objects)
- Only delete PID files after successful kill or confirmed stale
- Fix migration index names to match between SQL and Django state
- Remove db_index=True from process_type (index created manually)
- Update documentation to reflect actual implementation
- Add explanatory comments to empty except blocks
- Fix exit codes to use Unix convention (128 + signal number)

Co-authored-by: Nick Sweeting <[email protected]>
- Add prominent view mode switcher with List/Grid toggle buttons
- Improve filter sidebar CSS with modern styling, rounded corners
- Add live progress bar for in-progress snapshots showing hooks status
- Show plugin icons only when output directory has content
- Display archive result output_size sum from new field
- Show hooks succeeded/total count in size column
- Add get_progress_stats() method to Snapshot model
- Add CSS for progress spinner and status badges
- Update grid view template with progress indicator for archiving cards
- Add tests for admin views, search, and progress stats
Resolved conflicts by keeping Process model changes and accepting dev changes for unrelated files. Ensured pid_utils.py remains deleted as intended by this PR.

Co-authored-by: Nick Sweeting <[email protected]>
…anup, and migration

- Fix Process.current() to store psutil cmdline instead of sys.argv for accurate validation
- Fix worker process_type detection: explicitly set to WORKER after registration
- Fix ArchiveResultWorker.start() to use Process.TypeChoices.WORKER consistently
- Fix migration to be explicitly irreversible (SQLite doesn't support DROP COLUMN)
- Fix get_running_workers() to return process_id instead of incorrectly named worker_id
- Fix safe_kill_process() to wait for termination and escalate to SIGKILL if needed
- Fix migration to include all indexes in state_operations (parent_id, process_type)
- Fix documentation to use Machine.current() scoping and StatusChoices constants

Co-authored-by: Nick Sweeting <[email protected]>
@pull pull bot locked and limited conversation to collaborators Dec 31, 2025
@pull pull bot added the ⤵️ pull label Dec 31, 2025
@pull pull bot merged commit bbbfffd into CrazyForks:dev Dec 31, 2025
1 check passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants