Good—let’s turn this into a tight, execution-ready sprint. You’ll keep your Python pipeline, drop Sanity from Stablo, stand up a thin read API, add a robust fetcher, and ship a public site. I’ll lay out scope tiers, a workplan, infra choices with Apache, and acceptance criteria—plus the sharp edges you’ll hit.
- Target: automated news site live, with hourly ingestion → synthesis → publish → website render (ISR + revalidation).
- Plan B (acceptable): site live pulling from API, manual/limited publish path; automation runs for a subset of topics.
- Plan C (still useful): static site showing articles ingested from your legacy artifacts; API and fetcher wired; publishing manual.
- Pipeline (Python): ingestion/digest/synthesis, scraper, publisher.
- API (FastAPI): read endpoints; internal publish endpoint.
- Web (Next.js/Stablo): strip Sanity, connect to API, ISR + revalidation.
- Infra/Ops: Postgres + Meilisearch; Apache reverse proxy + TLS; systemd services and logs.
- DB: Postgres (Neon or local VM). Tables:
articles,citations. - Search: Meilisearch (or Typesense) for
/searchand fast filters. - API: FastAPI (uvicorn) read endpoints + internal
POST /publish(calls Next.js revalidate). - Web: Next.js (Stablo UI without Sanity). ISR (e.g., 120s).
POST /api/revalidate. - Pipeline: your existing stages + (a) canonical URL resolver, (b) extractor (requests→trafilatura→Playwright fallback), (c) QA gates, (d) publisher writing to DB & search, then hitting revalidate.
-
IDs & models
- Add
stable_index_id(title, source, url); unifydigest_id_hour=YYYYMMDDTHH. - Introduce Pydantic models:
ScrapeRecord,ArticleDraft,Article.
- Add
-
DB & search
-
Postgres schema:
create table articles ( article_id text primary key, slug text unique not null, status text not null check (status in ('draft','ready','published','archived')), lang text not null, headline text not null, dek text, body_html text not null, topic text not null, tags text[] not null default '{}', citations jsonb not null default '[]', first_seen_at timestamptz not null, published_at timestamptz, cluster_id text, version int not null default 1, meta jsonb not null default '{}' ); create index on articles (status); create index on articles (topic, published_at desc); create index on articles using gin (tags);
-
Meilisearch index
articleswith fields:id, slug, headline, dek, topic, tags, published_at, lang.
-
-
FastAPI
-
Public:
GET /v1/articles?page=&limit=&topic=&tag=&lang=→ teasers + paginationGET /v1/articles/{slug}→ full article (onlystatus='published')GET /v1/search?q=&page=&limit=&topic=→ proxy to Meilisearch
-
Internal:
POST /internal/articles(upsert draft/ready)POST /internal/articles/{id}/publish→ setpublished, push to search, POST to Next/api/revalidatewith["/", "/tema/{topic}", "/articulo/{slug}"]
-
-
Publisher step (Python)
- Take validated
ArticleDraft→Articlerow → push to DB → push to search → call revalidate webhook. - Idempotent upsert by
slug/article_id.
- Take validated
-
Strip Sanity
-
Remove
lib/sanity/*, GROQ, Sanity deps. -
Replace data calls with fetches to your API:
/→GET /v1/articles/articulo/[slug]→GET /v1/articles/{slug}/tema/[topic]→GET /v1/articles?topic=.../search(basic) →GET /v1/search?q=...
-
Add ISR on pages (
export const revalidate = 120;).
-
-
Revalidation endpoint (Next)
POST /api/revalidatewithREVALIDATE_SECRET, callrevalidatePathfor paths in payload.
-
Fetcher
- Resolve Google News redirect →
final_url. requests+trafilatura(fast path). Ifmain_text_len < threshold/ odd status → Playwright fallback (page.content()then extract).- Pydantic-validate to
ScrapeRecord; write JSONL.
- Resolve Google News redirect →
-
QA gates
- Input:
main_text_len ≥ 1200,lang in {'es','en'}, domain allow-list, canonical present. - Draft: headline length ≤ 90, at least 2 distinct citations, no empty
<p>. - Publish only if gates pass.
- Input:
-
Apache reverse proxy
-
You can keep Apache. Run Next.js and FastAPI as services, and proxy through Apache:
# /etc/apache2/sites-available/news.conf <VirtualHost *:80> ServerName yourdomain.example ProxyPreserveHost On ProxyPass /api http://127.0.0.1:8000 ProxyPassReverse /api http://127.0.0.1:8000 ProxyPass / http://127.0.0.1:3000/ ProxyPassReverse / http://127.0.0.1:3000/ </VirtualHost>
- Use Let’s Encrypt for TLS; then switch
<VirtualHost *:443>. - If you can, prefer Vercel for Next.js and keep Apache for redirects; but Apache proxy works.
- Use Let’s Encrypt for TLS; then switch
-
-
Systemd services
- FastAPI (uvicorn) and Next.js (production build +
node server.jsornext start) supervised by systemd or PM2. Log to journald.
- FastAPI (uvicorn) and Next.js (production build +
-
Smoke tests
- Publish one article via internal endpoint → verify appears on
/and/articulo/slugwithin ISR window or after revalidate. - Check
/search?q=palabrareturns expected items.
- Publish one article via internal endpoint → verify appears on
- API:
GET /v1/articles,GET /v1/articles/{slug},GET /v1/searchworking against Postgres + search. - Website: Stablo UI, no Sanity, pages render API data; ISR active;
/api/revalidatesecured and functional. - Pipeline: scraper produces validated
ScrapeRecord; publisher promotes draft → published, upserts DB+search, triggers revalidate. - Ops: Apache proxy in front; TLS; systemd units;
.envdocumented; basic runbook.
Pass if:
- Publishing a new article shows up within ≤2 minutes (ISR=120) or instantly after revalidate.
- Search returns that article by headline/keywords.
- At least one topic page paginates correctly.
- QA gate rejects a purposely short or citation-free draft.
Next.js service (systemd)
# /etc/systemd/system/next.service
[Unit]
Description=Next.js
After=network.target
[Service]
WorkingDirectory=/srv/news-web
Environment=NODE_ENV=production
Environment=NEWS_API=https://yourdomain.example/api
Environment=REVALIDATE_SECRET=...
ExecStart=/usr/bin/npm run start -- --port 3000
Restart=always
User=www-data
Group=www-data
[Install]
WantedBy=multi-user.targetFastAPI service (systemd)
# /etc/systemd/system/news-api.service
[Unit]
Description=News API (FastAPI)
After=network.target
[Service]
WorkingDirectory=/srv/news-api
Environment=DATABASE_URL=postgres://...
Environment=MEILI_URL=http://127.0.0.1:7700
Environment=MEILI_KEY=...
Environment=NEXT_REVALIDATE_URL=https://yourdomain.example/api/revalidate
Environment=REVALIDATE_SECRET=...
ExecStart=/usr/bin/uvicorn api:app --host 0.0.0.0 --port 8000 --workers 2
Restart=always
User=www-data
Group=www-data
[Install]
WantedBy=multi-user.targetPublisher revalidate call (Python)
import requests, os
def revalidate(paths: list[str]):
url = os.getenv("NEXT_REVALIDATE_URL")
secret = os.getenv("REVALIDATE_SECRET")
requests.post(url, json={"secret": secret, "paths": paths}, timeout=10)- Apache only hosting: Next.js needs Node; proxying through Apache is fine. If constraints bite (memory, PM2 flakiness), move web to Vercel and keep API behind your server.
- Playwright headless on server: install dependencies (Chromium). Keep it as fallback only to control CPU/RAM.
- ISR confusion: without revalidation you wait for cache to expire. Ensure publisher calls revalidate immediately on publish.
- Data drift: enforce Pydantic at write time; refuse invalid JSONL; log quarantined items.
- Full editorial CMS, user auth, comments.
- Fancy personalization or per-user feeds.
- Multi-region deployment.
Get the pipeline publishing, the API serving, the site rendering, and the revalidation firing. That’s the kernel of a durable system.
If you want, I can produce a file-by-file checklist for the Stablo fork (which files to delete/edit) and a Docker Compose with Postgres + Meilisearch to get local parity fast.