docflow (Local Intranet)

docflow automates your personal document pipeline (Instapaper posts, podcasts, Markdown notes, PDFs, images, and tweets) and serves everything locally from BASE_DIR.

Podcast snippets are typically captured in Snipd and then exported into this pipeline.

Architecture overview

What this repo does now

Single local source of truth: BASE_DIR (resolved from DOCFLOW_BASE_DIR, typically in ~/.docflow_env).
Single local server: python utils/docflow_server.py.
Static site output under BASE_DIR/_site.
Local state under BASE_DIR/state.
No remote deploy flow in this repository.

BASE_DIR Location (Important)

BASE_DIR is no longer hardcoded in config.py.
BASE_DIR now comes from environment variable DOCFLOW_BASE_DIR.
Canonical place to set it: ~/.docflow_env.
If DOCFLOW_BASE_DIR is missing, importing config.py fails with a clear error.
For direct commands from this repo, load your environment first:

source ~/.docflow_env

Recommended ~/.docflow_env snippet:

export DOCFLOW_BASE_DIR="/path/to/BASE_DIR"
export INTRANET_BASE_DIR="$DOCFLOW_BASE_DIR"
export HIGHLIGHTS_DAILY_DIR="/path/to/Obsidian/Subrayados"
export DONE_LINKS_FILE="/path/to/Obsidian/Leidos.md"

Main folders

BASE_DIR is expected to contain:

Incoming/
Posts/Posts <YEAR>/
Tweets/Tweets <YEAR>/
Podcasts/Podcasts <YEAR>/
Pdfs/Pdfs <YEAR>/
Images/Images <YEAR>/
_site/ (generated)
state/ (generated)

Requirements

Python 3.10+
Core dependencies:

pip install requests beautifulsoup4 markdownify openai pillow pytest markdown

Optional for X likes queue:

pip install "playwright>=1.55"
playwright install chromium

Quick start

Configure environment variables (as needed):

export OPENAI_API_KEY=...
export INSTAPAPER_USERNAME=...
export INSTAPAPER_PASSWORD=...
export DOCFLOW_BASE_DIR="/path/to/BASE_DIR"
export TWEET_LIKES_STATE="$HOME/.secrets/docflow/x_state.json"
export TWEET_LIKES_URL=https://x.com/<user>/likes
export TWEET_LIKES_MAX=50
export HIGHLIGHTS_DAILY_DIR="/path/to/Obsidian/Subrayados"
export DONE_LINKS_FILE="/path/to/Obsidian/Leidos.md"

Keep TWEET_LIKES_STATE outside the repo so cleanup operations do not delete it.

Run the processing pipeline:

python process_documents.py all --year 2026

Build local intranet pages:

python utils/build_browse_index.py --base-dir "$DOCFLOW_BASE_DIR"
python utils/build_reading_index.py --base-dir "$DOCFLOW_BASE_DIR"
python utils/build_working_index.py --base-dir "$DOCFLOW_BASE_DIR"
python utils/build_done_index.py --base-dir "$DOCFLOW_BASE_DIR"

Run local server:

source ~/.docflow_env
python utils/docflow_server.py --base-dir "$DOCFLOW_BASE_DIR" --host localhost --port 8080

Optional full rebuild at startup:

source ~/.docflow_env
python utils/docflow_server.py --base-dir "$DOCFLOW_BASE_DIR" --rebuild-on-start

Preferred day-to-day usage is the LaunchAgent-managed intranet service:

launchctl kickstart -k "gui/$(id -u)/com.domingo.docflow.intranet"

Useful service commands:

launchctl print "gui/$(id -u)/com.domingo.docflow.intranet" | rg 'state =|pid =|last exit code ='
curl -s -o /dev/null -w '%{http_code}\n' http://127.0.0.1:8080/

If the agent is not loaded yet in a new environment:

launchctl bootstrap "gui/$(id -u)" ~/Library/LaunchAgents/com.domingo.docflow.intranet.plist

Full document ingestion runner (`bin/docflow.sh`)

Use this command to download/process all document types:

bash bin/docflow.sh all

Behavior:

Loads ~/.docflow_env if present.
Runs process_documents.py with your arguments (all for full ingestion).
Rebuilds intranet browse/reading/working/done pages (utils/build_browse_index.py, utils/build_reading_index.py, utils/build_working_index.py, and utils/build_done_index.py) when processing succeeds.

Optional override:

INTRANET_BASE_DIR="/path/to/base" bash bin/docflow.sh all

Daily tweet consolidation runner (`bin/docflow_tweet_daily.sh`)

Use this dedicated process for daily tweet consolidation:

bash bin/docflow_tweet_daily.sh

Behavior:

Loads ~/.docflow_env if present.
Runs bin/build_tweet_consolidated.sh --yesterday.
Rebuilds intranet browse/reading/working/done pages when consolidation succeeds.

Intranet server API

utils/docflow_server.py serves:

Static files from BASE_DIR/_site
Raw files from BASE_DIR routes (/posts/raw/..., /tweets/raw/..., etc.)
browse list default ordering: by file recency (items in Reading/Working/Done are hidden from browse)
browse pages include a top Highlights first toggle to prioritize highlighted items
reading list ordering: by reading_at (oldest first)
working list ordering: by working_at (newest first)
done list ordering: by done_at (newest first)
to-done can be triggered from Browse, Reading, or Working, and preserves stage start metadata in state/done.json when available (reading_started_at, working_started_at)
JSON API actions:
- POST /api/to-reading
- POST /api/to-working
- POST /api/to-done
- POST /api/to-browse
- POST /api/reopen
- POST /api/delete
- POST /api/rebuild
- POST /api/rebuild-file
- GET /api/export-pdf?path=<rel_path>
- GET /api/highlights?path=<rel_path>
- PUT /api/highlights?path=<rel_path>

If DONE_LINKS_FILE is set, each POST /api/to-done transition appends a Markdown link entry to that file.

Local state files

All state is stored under BASE_DIR/state/:

reading.json: per-path reading_at timestamp.
working.json: per-path working_at timestamp.
done.json: per-path done_at timestamp and optional transition metadata copied on to-done:
- reading_started_at (from reading_at when moving from Reading to Done)
- working_started_at (from working_at when moving from Working to Done)

These fields allow post-hoc lead-time calculations for completed items (for example done_at - working_started_at).

Tweet pipeline

Queue from likes feed:

python process_documents.py tweets

One-time browser state creation:

python utils/create_x_state.py --state-path "$HOME/.secrets/docflow/x_state.json"

Daily consolidated tweets helper:

bash bin/build_tweet_consolidated.sh
bash bin/build_tweet_consolidated.sh --day 2026-02-13
bash bin/build_tweet_consolidated.sh --all-days
bash bin/build_tweet_consolidated.sh --all-days --cleanup-existing

By default, daily grouping for tweet source files uses a local rollover hour at 03:00 to include just-after-midnight downloads in the previous day. Override with DOCFLOW_TWEET_DAY_ROLLOVER_HOUR (0-23) when needed.

--cleanup-existing removes only source tweet .html files for consolidated days and keeps source .md.

Daily highlights report helper:

python utils/build_daily_highlights_report.py --day 2026-02-13 --output "/tmp/highlights-2026-02-13.md"

Daily highlights report runner (previous day to Obsidian Subrayados):

bash bin/docflow_highlights_daily.sh

Clipboard Markdown helper (`bin/mdclip`)

Use this helper to clean rich content from the macOS clipboard and convert it to compact Markdown (optimized for Obsidian paste behavior):

bin/mdclip

Behavior:

Reads HTML from clipboard when available (pbpaste -Prefer html, then macOS pasteboard fallbacks).
Converts to Markdown and removes extra blank lines between list items.
Writes cleaned Markdown back to clipboard by default.

Useful flags:

bin/mdclip --print
bin/mdclip --no-copy
bin/mdclip --from-stdin --no-copy --print < /path/to/input.html

Keyboard shortcut bindings (for example cmd+shift+L) are configured outside this repo (Shortcuts/automation tool). The versioned command to invoke is bin/mdclip.

Tests

Run all tests:

pytest -v

Targeted example:

pytest tests/test_docflow_server.py -q

Optional remote access

You can expose the local intranet through a private VPN (for example, Tailscale).

Name		Name	Last commit message	Last commit date
Latest commit History 427 Commits
bin		bin
docs		docs
tests		tests
utils		utils
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
config.py		config.py
image_processor.py		image_processor.py
instapaper_processor.py		instapaper_processor.py
markdown_processor.py		markdown_processor.py
md_to_html.py		md_to_html.py
openai_client.py		openai_client.py
path_utils.py		path_utils.py
pdf_processor.py		pdf_processor.py
pipeline_manager.py		pipeline_manager.py
podcast_processor.py		podcast_processor.py
process_documents.py		process_documents.py
title_ai.py		title_ai.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docflow (Local Intranet)

Architecture overview

What this repo does now

BASE_DIR Location (Important)

Main folders

Requirements

Quick start

Full document ingestion runner (`bin/docflow.sh`)

Daily tweet consolidation runner (`bin/docflow_tweet_daily.sh`)

Intranet server API

Local state files

Tweet pipeline

Clipboard Markdown helper (`bin/mdclip`)

Tests

Optional remote access

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

docflow (Local Intranet)

Architecture overview

What this repo does now

BASE_DIR Location (Important)

Main folders

Requirements

Quick start

Full document ingestion runner (bin/docflow.sh)

Daily tweet consolidation runner (bin/docflow_tweet_daily.sh)

Intranet server API

Local state files

Tweet pipeline

Clipboard Markdown helper (bin/mdclip)

Tests

Optional remote access

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Full document ingestion runner (`bin/docflow.sh`)

Daily tweet consolidation runner (`bin/docflow_tweet_daily.sh`)

Clipboard Markdown helper (`bin/mdclip`)

Packages