Skip to content

Commit 42b2f78

Browse files
committed
fix: examples accuracy, export stalling, missing images, third-party asset filtering
Docs & examples: - Fix install instructions (not on PyPI, use git clone + local install) - Fix invalid format name markdown-github -> html,md - Fix placeholder URLs (your-org -> 19-84, security@example.com) - Fix EXPORT_FORMATS docs to use valid names (html, md, hybrid) - Remove 9 deadwood config options ([export.html/markdown/github]) Docker: - Dockerfiles now pip install the package (not bare deps) - Fix healthcheck to use python -m chronicon.cli - Fix postgres compose: API volume not read-only, watch gets external network - Fix double entrypoint in postgres compose usage comment - Remove deprecated version key and duplicate restart in prod compose Export stalling: - Add iter_topics_batched() to database layer for paginated iteration - Rewrite search indexer to stream JSON to disk (single pass, no list accumulation) - Fix SEO context to use current page posts instead of loading all - Sitemap writes directly to file instead of building in-memory list - HTML and markdown exporters use batched topic iteration Missing images: - Extract data-src/data-original attributes (Discourse lazy loading) - Fix emoji URL filter (class="emoji" is sufficient, don't require "emoji" in path) - Normalize protocol-relative URLs (//domain) to https:// everywhere - Fix lightbox URL extraction doubling domain on // URLs - Log download failures with URL and error type - Handle filename collisions with hash suffix - Fix Tier 3 fallback: exact filename match instead of substring - Fix Windows path separator in get_assets_for_topic() - Remove bogus /images/favicon.ico and /images/logo.png fallback downloads - Filter third-party URLs before downloading (only forum domain + CDN) - Skip emoji class images in extract_image_sets (handled separately) CLI: - Add --base-url flag for canonical URL (was config-only) - Add --timeout, --retry-max, --posts-per-page flags - Wire timeout/retry into API client from config Tests: - Add test_search_indexer_streaming.py (10 tests) - Add test_examples_accuracy.py (21 e2e tests for examples) - Add data-src/data-original extraction tests - Add protocol-relative URL tests - Add iter_topics_batched and cross-platform path tests - Update emoji filter and site asset tests for new behavior
1 parent 59dd1d1 commit 42b2f78

31 files changed

+1052
-308
lines changed

.chronicon.toml.example

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -22,21 +22,6 @@ text_only = false
2222
# Leave commented out for offline-only archives
2323
# canonical_base_url = "https://example.github.io/forum-archive"
2424

25-
[export.html]
26-
theme_adaptation = "simplified" # simplified, full, minimal
27-
enable_search = true
28-
responsive = true
29-
30-
[export.markdown]
31-
convert_html = true
32-
preserve_formatting = true
33-
include_metadata_header = true
34-
35-
[export.github]
36-
generate_readme = true
37-
relative_image_paths = true
38-
gfm_syntax = true
39-
4025
# Continuous mode settings
4126
[continuous]
4227
polling_interval_minutes = 10

Dockerfile

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,14 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
1717
RUN python3 -m venv /opt/venv
1818
ENV PATH="/opt/venv/bin:$PATH"
1919

20-
# Copy only requirements first for caching
20+
# Copy project files for install
2121
WORKDIR /build
2222
COPY pyproject.toml ./
23+
COPY src/ ./src/
2324

24-
# Install dependencies only (not the package itself)
25+
# Install package with all dependencies
2526
RUN pip install --no-cache-dir --upgrade pip setuptools wheel && \
26-
pip install --no-cache-dir beautifulsoup4 html2text jinja2 rich
27+
pip install --no-cache-dir .
2728

2829
# Stage 2: Runtime stage (minimal)
2930
FROM python:3.12-slim
@@ -67,7 +68,7 @@ RUN chmod -R go-w /app 2>/dev/null || true
6768

6869
# Health check
6970
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
70-
CMD chronicon validate --output-dir /archives || exit 1
71+
CMD python -m chronicon.cli validate --output-dir /archives || exit 1
7172

7273
# Labels for metadata
7374
LABEL maintainer="Chronicon" \

Dockerfile.alpine

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,14 @@ RUN apk add --no-cache \
1616
RUN python3 -m venv /opt/venv
1717
ENV PATH="/opt/venv/bin:$PATH"
1818

19-
# Copy only requirements first for caching
19+
# Copy project files for install
2020
WORKDIR /build
2121
COPY pyproject.toml ./
22+
COPY src/ ./src/
2223

23-
# Install dependencies only (not the package itself)
24+
# Install package with all dependencies
2425
RUN pip install --no-cache-dir --upgrade pip setuptools wheel && \
25-
pip install --no-cache-dir beautifulsoup4 html2text jinja2 rich
26+
pip install --no-cache-dir .
2627

2728
# Stage 2: Runtime stage (minimal)
2829
FROM python:3.12-alpine3.20
@@ -67,7 +68,7 @@ RUN chmod -R go-w /app 2>/dev/null || true
6768

6869
# Health check (for validation/status commands)
6970
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
70-
CMD chronicon validate --output-dir /archives || exit 1
71+
CMD python -m chronicon.cli validate --output-dir /archives || exit 1
7172

7273
# Labels for metadata
7374
LABEL maintainer="Chronicon" \

README.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -32,17 +32,18 @@ If you find this useful, consider giving it a star on GitHub — it helps others
3232
## Installation
3333

3434
```bash
35-
# Using uv (recommended)
36-
curl -LsSf https://astral.sh/uv/install.sh | sh
37-
uv tool install chronicon
35+
# Clone and install with uv (recommended)
36+
git clone https://github.com/19-84/chronicon.git
37+
cd chronicon
38+
uv sync --all-extras
3839

39-
# Or with pip
40-
pip install chronicon
40+
# Or install with pip from source
41+
git clone https://github.com/19-84/chronicon.git
42+
cd chronicon
43+
pip install .
4144

4245
# With PostgreSQL support (optional)
43-
pip install chronicon[postgres]
44-
# or
45-
uv tool install chronicon[postgres]
46+
pip install ".[postgres]"
4647
```
4748

4849
## Quick Start

examples/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ chronicon archive \
1515
--urls https://meta.discourse.org \
1616
--categories 61 \
1717
--output-dir ./my-archive \
18-
--formats html,markdown-github \
18+
--formats html,md \
1919
--search-backend static
2020

2121
# Export with canonical URLs for GitHub Pages

examples/docker/.env.postgres.example

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@
33
POSTGRES_PASSWORD=change_me_in_production
44

55
# Export Formats
6-
# Options: html, markdown, html,markdown (hybrid)
6+
# Valid format names: html, md, hybrid
7+
# Comma-separate for multiple: html,md
78
EXPORT_FORMATS=html
89

910
# Git Push Integration (optional)

examples/docker/.env.sqlite.example

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,8 @@
22
# No database credentials needed — SQLite uses a local file at /archives/archive.db
33

44
# Export Formats
5-
# Options: html, markdown, html,markdown (hybrid)
5+
# Valid format names: html, md, hybrid
6+
# Comma-separate for multiple: html,md
67
EXPORT_FORMATS=html
78

89
# Git Push Integration (optional)

examples/docker/Dockerfile.alpine

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,14 @@ RUN apk add --no-cache \
1616
RUN python3 -m venv /opt/venv
1717
ENV PATH="/opt/venv/bin:$PATH"
1818

19-
# Copy only requirements first for caching
19+
# Copy project files for install
2020
WORKDIR /build
2121
COPY pyproject.toml ./
22+
COPY src/ ./src/
2223

23-
# Install dependencies only (not the package itself)
24+
# Install package with all dependencies
2425
RUN pip install --no-cache-dir --upgrade pip setuptools wheel && \
25-
pip install --no-cache-dir beautifulsoup4 html2text jinja2 rich
26+
pip install --no-cache-dir .
2627

2728
# Stage 2: Runtime stage (minimal)
2829
FROM python:3.12-alpine3.20
@@ -67,7 +68,7 @@ RUN chmod -R go-w /app 2>/dev/null || true
6768

6869
# Health check (for validation/status commands)
6970
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
70-
CMD chronicon validate --output-dir /archives || exit 1
71+
CMD python -m chronicon.cli validate --output-dir /archives || exit 1
7172

7273
# Labels for metadata
7374
LABEL maintainer="Chronicon" \

examples/docker/Dockerfile.alpine-watch

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,14 @@ RUN apk add --no-cache \
1616
RUN python3 -m venv /opt/venv
1717
ENV PATH="/opt/venv/bin:$PATH"
1818

19-
# Copy only requirements first for caching
19+
# Copy project files for install
2020
WORKDIR /build
2121
COPY pyproject.toml ./
22+
COPY src/ ./src/
2223

23-
# Install dependencies only (not the package itself)
24+
# Install package with all dependencies
2425
RUN pip install --no-cache-dir --upgrade pip setuptools wheel && \
25-
pip install --no-cache-dir beautifulsoup4 html2text jinja2 rich
26+
pip install --no-cache-dir .
2627

2728
# Stage 2: Runtime stage (minimal)
2829
FROM python:3.12-alpine3.20

examples/docker/README-ALPINE.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -461,12 +461,12 @@ docker-compose -f docker-compose.prod.yml up -d
461461
- Watch Mode: `../../WATCH_MODE.md`
462462

463463
### Issues
464-
Report security issues privately to: security@example.com
465-
Report bugs: https://github.com/your-org/chronicon/issues
464+
Report security issues privately: https://github.com/19-84/chronicon/security/advisories
465+
Report bugs: https://github.com/19-84/chronicon/issues
466466

467467
### Community
468-
- Discussions: https://github.com/your-org/chronicon/discussions
469-
- Discord: https://discord.gg/chronicon (if available)
468+
- Discussions: https://github.com/19-84/chronicon/discussions
469+
- Issues: https://github.com/19-84/chronicon/issues
470470

471471
## License
472472

0 commit comments

Comments
 (0)