Skip to content

Commit 54f91c1

Browse files
authored
Improve concurrency control between plugin hooks (ArchiveBox#1721)
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. ArchiveBox#123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk
2 parents 6d991a0 + 057b49a commit 54f91c1

30 files changed

+327
-127
lines changed

TODO_hook_concurrency.md

Lines changed: 48 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -310,37 +310,40 @@ archivebox/plugins/{plugin_name}/
310310
## Implementation Checklist
311311

312312
### Phase 1: Schema Migration ✅
313-
- [ ] Add `Snapshot.current_step` (IntegerField 0-9, default=0)
314-
- [ ] Add `ArchiveResult.hook_name` (CharField, nullable) - just filename
315-
- [ ] Create migration: `0033_snapshot_current_step_archiveresult_hook_name.py`
313+
- [x] Add `Snapshot.current_step` (IntegerField 0-9, default=0)
314+
- [x] Add `ArchiveResult.hook_name` (CharField, nullable) - just filename
315+
- [x] Create migration: `0034_snapshot_current_step.py`
316316

317-
### Phase 2: Core Logic Updates
318-
- [ ] Add `extract_step(hook_name)` utility in `archivebox/hooks.py`
317+
### Phase 2: Core Logic Updates
318+
- [x] Add `extract_step(hook_name)` utility in `archivebox/hooks.py`
319319
- Extract first digit from `__XX_` pattern
320320
- Default to 9 for unnumbered hooks
321-
- [ ] Update `Snapshot.create_pending_archiveresults()` in `archivebox/core/models.py`:
321+
- [x] Add `is_background_hook(hook_name)` utility in `archivebox/hooks.py`
322+
- Check for `.bg.` in filename
323+
- [x] Update `Snapshot.create_pending_archiveresults()` in `archivebox/core/models.py`:
322324
- Discover all hooks (not plugins)
323325
- Create one AR per hook with `hook_name` set
324-
- [ ] Update `ArchiveResult.run()` in `archivebox/core/models.py`:
326+
- [x] Update `ArchiveResult.run()` in `archivebox/core/models.py`:
325327
- If `hook_name` set: run single hook
326328
- If `hook_name` None: discover all plugin hooks (existing behavior)
327-
- [ ] Add `Snapshot.advance_step_if_ready()` method:
329+
- [x] Add `Snapshot.advance_step_if_ready()` method:
328330
- Check if all foreground ARs in current step finished
329331
- Increment `current_step` if ready
330332
- Ignore background hooks (.bg) in completion check
331-
- [ ] Integrate with `SnapshotMachine.is_finished()` in `archivebox/core/statemachines.py`:
333+
- [x] Integrate with `SnapshotMachine.is_finished()` in `archivebox/core/statemachines.py`:
332334
- Call `advance_step_if_ready()` before checking if done
333335

334-
### Phase 3: Worker Coordination
335-
- [ ] Update worker AR claiming query in `archivebox/workers/worker.py`:
336+
### Phase 3: Worker Coordination
337+
- [x] Update worker AR claiming query in `archivebox/workers/worker.py`:
336338
- Filter: `extract_step(ar.hook_name) <= snapshot.current_step`
337-
- Note: May need to denormalize or use clever query since step is derived
338-
- Alternative: Claim any AR in QUEUED state, check step in Python before processing
339+
- Claims ARs in QUEUED state, checks step in Python before processing
340+
- Orders by hook_name for deterministic execution within step
339341

340-
### Phase 4: Hook Renumbering
341-
- [ ] Renumber hooks per renumbering map below
342-
- [ ] Add `.bg` suffix to long-running hooks
343-
- [ ] Test all hooks still work after renumbering
342+
### Phase 4: Hook Renumbering ✅
343+
- [x] Renumber hooks per renumbering map below
344+
- [x] Add `.bg` suffix to long-running hooks (media, gallerydl, forumdl, papersdl)
345+
- [x] Move parse_* hooks to step 7 (70-79)
346+
- [x] Test all hooks still work after renumbering
344347

345348
## Migration Path
346349

@@ -353,25 +356,34 @@ No special migration needed:
353356

354357
### Renumbering Map
355358

356-
**Current → New:**
357-
```
358-
git/on_Snapshot__12_git.py → git/on_Snapshot__62_git.py
359-
media/on_Snapshot__51_media.py → media/on_Snapshot__63_media.bg.py
360-
gallerydl/on_Snapshot__52_gallerydl.py → gallerydl/on_Snapshot__64_gallerydl.bg.py
361-
forumdl/on_Snapshot__53_forumdl.py → forumdl/on_Snapshot__65_forumdl.bg.py
362-
papersdl/on_Snapshot__54_papersdl.py → papersdl/on_Snapshot__66_papersdl.bg.py
363-
364-
readability/on_Snapshot__52_readability.py → readability/on_Snapshot__55_readability.py
365-
mercury/on_Snapshot__53_mercury.py → mercury/on_Snapshot__56_mercury.py
366-
367-
singlefile/on_Snapshot__37_singlefile.py → singlefile/on_Snapshot__50_singlefile.py
368-
screenshot/on_Snapshot__34_screenshot.js → screenshot/on_Snapshot__51_screenshot.js
369-
pdf/on_Snapshot__35_pdf.js → pdf/on_Snapshot__52_pdf.js
370-
dom/on_Snapshot__36_dom.js → dom/on_Snapshot__53_dom.js
371-
title/on_Snapshot__32_title.js → title/on_Snapshot__54_title.js
372-
headers/on_Snapshot__33_headers.js → headers/on_Snapshot__55_headers.js
373-
374-
wget/on_Snapshot__50_wget.py → wget/on_Snapshot__61_wget.py
359+
**Completed Renames:**
360+
```
361+
# Step 5: DOM Extraction (sequential, non-background)
362+
singlefile/on_Snapshot__37_singlefile.py → singlefile/on_Snapshot__50_singlefile.py ✅
363+
screenshot/on_Snapshot__34_screenshot.js → screenshot/on_Snapshot__51_screenshot.js ✅
364+
pdf/on_Snapshot__35_pdf.js → pdf/on_Snapshot__52_pdf.js ✅
365+
dom/on_Snapshot__36_dom.js → dom/on_Snapshot__53_dom.js ✅
366+
title/on_Snapshot__32_title.js → title/on_Snapshot__54_title.js ✅
367+
readability/on_Snapshot__52_readability.py → readability/on_Snapshot__55_readability.py ✅
368+
headers/on_Snapshot__33_headers.js → headers/on_Snapshot__55_headers.js ✅
369+
mercury/on_Snapshot__53_mercury.py → mercury/on_Snapshot__56_mercury.py ✅
370+
htmltotext/on_Snapshot__54_htmltotext.py → htmltotext/on_Snapshot__57_htmltotext.py ✅
371+
372+
# Step 6: Post-DOM Extraction (background for long-running)
373+
wget/on_Snapshot__50_wget.py → wget/on_Snapshot__61_wget.py ✅
374+
git/on_Snapshot__12_git.py → git/on_Snapshot__62_git.py ✅
375+
media/on_Snapshot__51_media.py → media/on_Snapshot__63_media.bg.py ✅
376+
gallerydl/on_Snapshot__52_gallerydl.py → gallerydl/on_Snapshot__64_gallerydl.bg.py ✅
377+
forumdl/on_Snapshot__53_forumdl.py → forumdl/on_Snapshot__65_forumdl.bg.py ✅
378+
papersdl/on_Snapshot__54_papersdl.py → papersdl/on_Snapshot__66_papersdl.bg.py ✅
379+
380+
# Step 7: URL Extraction (parse_* hooks moved from step 6)
381+
parse_html_urls/on_Snapshot__60_parse_html_urls.py → parse_html_urls/on_Snapshot__70_parse_html_urls.py ✅
382+
parse_txt_urls/on_Snapshot__62_parse_txt_urls.py → parse_txt_urls/on_Snapshot__71_parse_txt_urls.py ✅
383+
parse_rss_urls/on_Snapshot__61_parse_rss_urls.py → parse_rss_urls/on_Snapshot__72_parse_rss_urls.py ✅
384+
parse_netscape_urls/on_Snapshot__63_parse_netscape_urls.py → parse_netscape_urls/on_Snapshot__73_parse_netscape_urls.py ✅
385+
parse_jsonl_urls/on_Snapshot__64_parse_jsonl_urls.py → parse_jsonl_urls/on_Snapshot__74_parse_jsonl_urls.py ✅
386+
parse_dom_outlinks/on_Snapshot__40_parse_dom_outlinks.js → parse_dom_outlinks/on_Snapshot__75_parse_dom_outlinks.js ✅
375387
```
376388

377389
## Testing Strategy

archivebox/cli/archivebox_status.py

Lines changed: 27 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -11,18 +11,6 @@
1111
from archivebox.config import DATA_DIR, CONSTANTS, ARCHIVE_DIR
1212
from archivebox.config.common import SHELL_CONFIG
1313
from archivebox.misc.legacy import parse_json_links_details
14-
from archivebox.misc.folders import (
15-
get_indexed_folders,
16-
get_archived_folders,
17-
get_invalid_folders,
18-
get_unarchived_folders,
19-
get_present_folders,
20-
get_valid_folders,
21-
get_duplicate_folders,
22-
get_orphaned_folders,
23-
get_corrupted_folders,
24-
get_unrecognized_folders,
25-
)
2614
from archivebox.misc.system import get_dir_size
2715
from archivebox.misc.logging_util import printable_filesize
2816

@@ -55,42 +43,40 @@ def status(out_dir: Path=DATA_DIR) -> None:
5543
size = printable_filesize(num_bytes)
5644
print(f' Size: {size} across {num_files} files in {num_dirs} directories')
5745

58-
num_indexed = len(get_indexed_folders(links, out_dir=out_dir))
59-
num_archived = len(get_archived_folders(links, out_dir=out_dir))
60-
num_unarchived = len(get_unarchived_folders(links, out_dir=out_dir))
61-
print(f' > indexed: {num_indexed}'.ljust(36), f'({get_indexed_folders.__doc__})')
62-
print(f' > archived: {num_archived}'.ljust(36), f'({get_archived_folders.__doc__})')
63-
print(f' > unarchived: {num_unarchived}'.ljust(36), f'({get_unarchived_folders.__doc__})')
64-
65-
num_present = len(get_present_folders(links, out_dir=out_dir))
66-
num_valid = len(get_valid_folders(links, out_dir=out_dir))
46+
# Use DB as source of truth for snapshot status
47+
num_indexed = links.count()
48+
num_archived = links.filter(status='archived').count() or links.exclude(downloaded_at=None).count()
49+
num_unarchived = links.filter(status='queued').count() or links.filter(downloaded_at=None).count()
50+
print(f' > indexed: {num_indexed}'.ljust(36), '(total snapshots in DB)')
51+
print(f' > archived: {num_archived}'.ljust(36), '(snapshots with archived content)')
52+
print(f' > unarchived: {num_unarchived}'.ljust(36), '(snapshots pending archiving)')
53+
54+
# Count directories on filesystem
55+
num_present = 0
56+
orphaned_dirs = []
57+
if ARCHIVE_DIR.exists():
58+
for entry in ARCHIVE_DIR.iterdir():
59+
if entry.is_dir():
60+
num_present += 1
61+
if not links.filter(timestamp=entry.name).exists():
62+
orphaned_dirs.append(str(entry))
63+
64+
num_valid = min(num_present, num_indexed) # approximate
6765
print()
68-
print(f' > present: {num_present}'.ljust(36), f'({get_present_folders.__doc__})')
69-
print(f' > [green]valid:[/green] {num_valid}'.ljust(36), f' ({get_valid_folders.__doc__})')
70-
71-
duplicate = get_duplicate_folders(links, out_dir=out_dir)
72-
orphaned = get_orphaned_folders(links, out_dir=out_dir)
73-
corrupted = get_corrupted_folders(links, out_dir=out_dir)
74-
unrecognized = get_unrecognized_folders(links, out_dir=out_dir)
75-
num_invalid = len({**duplicate, **orphaned, **corrupted, **unrecognized})
76-
print(f' > [red]invalid:[/red] {num_invalid}'.ljust(36), f' ({get_invalid_folders.__doc__})')
77-
print(f' > duplicate: {len(duplicate)}'.ljust(36), f'({get_duplicate_folders.__doc__})')
78-
print(f' > orphaned: {len(orphaned)}'.ljust(36), f'({get_orphaned_folders.__doc__})')
79-
print(f' > corrupted: {len(corrupted)}'.ljust(36), f'({get_corrupted_folders.__doc__})')
80-
print(f' > unrecognized: {len(unrecognized)}'.ljust(36), f'({get_unrecognized_folders.__doc__})')
66+
print(f' > present: {num_present}'.ljust(36), '(directories in archive/)')
67+
print(f' > [green]valid:[/green] {num_valid}'.ljust(36), ' (directories with matching DB entry)')
68+
69+
num_orphaned = len(orphaned_dirs)
70+
print(f' > [red]orphaned:[/red] {num_orphaned}'.ljust(36), ' (directories without matching DB entry)')
8171

8272
if num_indexed:
83-
print(' [violet]Hint:[/violet] You can list link data directories by status like so:')
84-
print(' [green]archivebox list --status=<status> (e.g. indexed, corrupted, archived, etc.)[/green]')
73+
print(' [violet]Hint:[/violet] You can list snapshots by status like so:')
74+
print(' [green]archivebox list --status=<status> (e.g. archived, queued, etc.)[/green]')
8575

86-
if orphaned:
76+
if orphaned_dirs:
8777
print(' [violet]Hint:[/violet] To automatically import orphaned data directories into the main index, run:')
8878
print(' [green]archivebox init[/green]')
8979

90-
if num_invalid:
91-
print(' [violet]Hint:[/violet] You may need to manually remove or fix some invalid data directories, afterwards make sure to run:')
92-
print(' [green]archivebox init[/green]')
93-
9480
print()
9581
print('[green]\\[*] Scanning recent archive changes and user logins:[/green]')
9682
print(f'[yellow] {CONSTANTS.LOGS_DIR}/*[/yellow]')

archivebox/core/migrations/0032_alter_archiveresult_binary_and_more.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Generated by Django 6.0 on 2025-12-28 05:12
22

33
import django.db.models.deletion
4-
import uuid
4+
from archivebox import uuid_compat
55
from django.conf import settings
66
from django.db import migrations, models
77

@@ -49,7 +49,7 @@ class Migration(migrations.Migration):
4949
migrations.AlterField(
5050
model_name='archiveresult',
5151
name='uuid',
52-
field=models.UUIDField(blank=True, db_index=True, default=uuid.uuid7, null=True),
52+
field=models.UUIDField(blank=True, db_index=True, default=uuid_compat.uuid7, null=True),
5353
),
5454
migrations.AddConstraint(
5555
model_name='snapshot',
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Generated by Django 6.0 on 2025-12-28
2+
# Add Snapshot.current_step field for hook step-based execution
3+
4+
from django.db import migrations, models
5+
6+
7+
class Migration(migrations.Migration):
8+
9+
dependencies = [
10+
('core', '0033_rename_extractor_add_hook_name'),
11+
]
12+
13+
operations = [
14+
migrations.AddField(
15+
model_name='snapshot',
16+
name='current_step',
17+
field=models.PositiveSmallIntegerField(
18+
default=0,
19+
db_index=True,
20+
help_text='Current hook step being executed (0-9). Used for sequential hook execution.'
21+
),
22+
),
23+
]

0 commit comments

Comments
 (0)