docflow (Local Intranet)

docflow automates your personal document pipeline (Instapaper posts, podcasts, Markdown notes, PDFs, images, and tweets) and serves everything locally from BASE_DIR.

Podcast snippets are typically captured in Snipd and then exported into this pipeline.

Architecture overview

What this repo does now

Single local source of truth: BASE_DIR (resolved from DOCFLOW_BASE_DIR, typically in ~/.docflow_env).
Single local server: python utils/docflow_server.py.
Static site output under BASE_DIR/_site.
Local state under BASE_DIR/state.
No remote deploy flow in this repository.

BASE_DIR Location (Important)

BASE_DIR is no longer hardcoded in config.py.
BASE_DIR now comes from environment variable DOCFLOW_BASE_DIR.
Canonical place to set it: ~/.docflow_env.
If DOCFLOW_BASE_DIR is missing, importing config.py fails with a clear error.
For direct commands from this repo, load your environment first:

source ~/.docflow_env

Recommended ~/.docflow_env snippet:

export DOCFLOW_BASE_DIR="/path/to/BASE_DIR"
export INTRANET_BASE_DIR="$DOCFLOW_BASE_DIR"
export HIGHLIGHTS_DAILY_DIR="/path/to/Obsidian/Subrayados"
export DONE_LINKS_FILE="/path/to/Obsidian/Leidos.md"

Main folders

BASE_DIR is expected to contain:

Incoming/
Posts/Posts <YEAR>/
Tweets/Tweets <YEAR>/
Podcasts/Podcasts <YEAR>/
Pdfs/Pdfs <YEAR>/
Images/Images <YEAR>/
_site/ (generated)
state/ (generated)

Requirements

Python 3.10+
Core dependencies:

pip install requests beautifulsoup4 markdownify openai pillow pytest markdown

Optional for X likes queue:

pip install "playwright>=1.55"
playwright install chromium

Quick start

Configure environment variables (as needed):

export OPENAI_API_KEY=...
export INSTAPAPER_USERNAME=...
export INSTAPAPER_PASSWORD=...
export DOCFLOW_BASE_DIR="/path/to/BASE_DIR"
export TWEET_LIKES_STATE="$HOME/.secrets/docflow/x_state.json"
export TWEET_LIKES_URL=https://x.com/<user>/likes
export TWEET_LIKES_MAX=50
export HIGHLIGHTS_DAILY_DIR="/path/to/Obsidian/Subrayados"
export DONE_LINKS_FILE="/path/to/Obsidian/Leidos.md"

Keep TWEET_LIKES_STATE outside the repo so cleanup operations do not delete it.

Run the processing pipeline:

python process_documents.py all --year 2026

Build local intranet pages:

python utils/build_browse_index.py --base-dir "$DOCFLOW_BASE_DIR"
python utils/build_reading_index.py --base-dir "$DOCFLOW_BASE_DIR"
python utils/build_working_index.py --base-dir "$DOCFLOW_BASE_DIR"
python utils/build_done_index.py --base-dir "$DOCFLOW_BASE_DIR"

Run local server:

source ~/.docflow_env
python utils/docflow_server.py --base-dir "$DOCFLOW_BASE_DIR" --host localhost --port 8080

Optional full rebuild at startup:

source ~/.docflow_env
python utils/docflow_server.py --base-dir "$DOCFLOW_BASE_DIR" --rebuild-on-start

Preferred day-to-day usage is the LaunchAgent-managed intranet service:

launchctl kickstart -k "gui/$(id -u)/com.domingo.docflow.intranet"

Useful service commands:

launchctl print "gui/$(id -u)/com.domingo.docflow.intranet" | rg 'state =|pid =|last exit code ='
curl -s -o /dev/null -w '%{http_code}\n' http://127.0.0.1:8080/

If the agent is not loaded yet in a new environment:

launchctl bootstrap "gui/$(id -u)" ~/Library/LaunchAgents/com.domingo.docflow.intranet.plist

Full document ingestion runner (`bin/docflow.sh`)

Use this command to download/process all document types:

bash bin/docflow.sh all

Behavior:

Loads ~/.docflow_env if present.
Runs process_documents.py with your arguments (all for full ingestion).
Rebuilds intranet browse/reading/working/done pages (utils/build_browse_index.py, utils/build_reading_index.py, utils/build_working_index.py, and utils/build_done_index.py) when processing succeeds.

Optional override:

INTRANET_BASE_DIR="/path/to/base" bash bin/docflow.sh all

Daily tweet consolidation runner (`bin/docflow_tweet_daily.sh`)

Use this dedicated process for daily tweet consolidation:

bash bin/docflow_tweet_daily.sh

Behavior:

Loads ~/.docflow_env if present.
Runs bin/build_tweet_consolidated.sh --yesterday.
Rebuilds intranet browse/reading/working/done pages when consolidation succeeds.

Intranet server API

utils/docflow_server.py serves:

Static files from BASE_DIR/_site
Raw files from BASE_DIR routes (/posts/raw/..., /tweets/raw/..., etc.)
browse list default ordering: by file recency (items in Reading/Working/Done are hidden from browse)
browse pages include a top Highlights first toggle to prioritize highlighted items
reading list ordering: by reading_at (oldest first)
working list ordering: by working_at (newest first)
done list ordering: by done_at (newest first)
to-done can be triggered from Browse, Reading, or Working, and preserves stage start metadata in state/done.json when available (reading_started_at, working_started_at)
JSON API actions:
- POST /api/to-reading
- POST /api/to-working
- POST /api/to-done
- POST /api/to-browse
- POST /api/reopen
- POST /api/delete
- POST /api/rebuild
- POST /api/rebuild-file
- GET /api/export-pdf?path=<rel_path>
- GET /api/highlights?path=<rel_path>
- PUT /api/highlights?path=<rel_path>

If DONE_LINKS_FILE is set, each POST /api/to-done transition appends a Markdown link entry to that file.

Local state files

All state is stored under BASE_DIR/state/:

reading.json: per-path reading_at timestamp.
working.json: per-path working_at timestamp.
done.json: per-path done_at timestamp and optional transition metadata copied on to-done:
- reading_started_at (from reading_at when moving from Reading to Done)
- working_started_at (from working_at when moving from Working to Done)

These fields allow post-hoc lead-time calculations for completed items (for example done_at - working_started_at).

Tweet pipeline

Queue from likes feed:

python process_documents.py tweets

One-time browser state creation:

python utils/create_x_state.py --state-path "$HOME/.secrets/docflow/x_state.json"

Daily consolidated tweets helper:

bash bin/build_tweet_consolidated.sh
bash bin/build_tweet_consolidated.sh --day 2026-02-13
bash bin/build_tweet_consolidated.sh --all-days
bash bin/build_tweet_consolidated.sh --all-days --cleanup-existing

By default, daily grouping for tweet source files uses a local rollover hour at 03:00 to include just-after-midnight downloads in the previous day. Override with DOCFLOW_TWEET_DAY_ROLLOVER_HOUR (0-23) when needed.

--cleanup-existing removes only source tweet .html files for consolidated days and keeps source .md.

Daily highlights report helper:

python utils/build_daily_highlights_report.py --day 2026-02-13 --output "/tmp/highlights-2026-02-13.md"

Daily highlights report runner (previous day to Obsidian Subrayados):

bash bin/docflow_highlights_daily.sh

Clipboard Markdown helper (`bin/mdclip`)

Use this helper to clean rich content from the macOS clipboard and convert it to compact Markdown (optimized for Obsidian paste behavior):

bin/mdclip

Behavior:

Reads HTML from clipboard when available (pbpaste -Prefer html, then macOS pasteboard fallbacks).
Converts to Markdown and removes extra blank lines between list items.
Writes cleaned Markdown back to clipboard by default.

Useful flags:

bin/mdclip --print
bin/mdclip --no-copy
bin/mdclip --from-stdin --no-copy --print < /path/to/input.html

Keyboard shortcut bindings (for example cmd+shift+L) are configured outside this repo (Shortcuts/automation tool). The versioned command to invoke is bin/mdclip.

Tests

Run all tests:

pytest -v

Targeted example:

pytest tests/test_docflow_server.py -q

Optional remote access

You can expose the local intranet through a private VPN (for example, Tailscale).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docflow (Local Intranet)

Architecture overview

What this repo does now

BASE_DIR Location (Important)

Main folders

Requirements

Quick start

Full document ingestion runner (`bin/docflow.sh`)

Daily tweet consolidation runner (`bin/docflow_tweet_daily.sh`)

Intranet server API

Local state files

Tweet pipeline

Clipboard Markdown helper (`bin/mdclip`)

Tests

Optional remote access

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

docflow (Local Intranet)

Architecture overview

What this repo does now

BASE_DIR Location (Important)

Main folders

Requirements

Quick start

Full document ingestion runner (bin/docflow.sh)

Daily tweet consolidation runner (bin/docflow_tweet_daily.sh)

Intranet server API

Local state files

Tweet pipeline

Clipboard Markdown helper (bin/mdclip)

Tests

Optional remote access

Full document ingestion runner (`bin/docflow.sh`)

Daily tweet consolidation runner (`bin/docflow_tweet_daily.sh`)

Clipboard Markdown helper (`bin/mdclip`)