Data Layer Lineage Memory

This document is the durable operating memory for the textbook project. Before any data rebuild, deploy, rollback, or search-quality debugging, read this file first.

It records:

what data exists
what each count actually means
how source PDFs become OCR, DB, vectors, page images, and runtime assets
how Docker and VPS consume those assets
which external services are part of the real runtime path
which update steps are mandatory and must not be skipped

This file is intended to be the first stop before future updates so the project does not lose critical context between iterations.

Read-first rules

Before changing anything, answer these questions from this document:

Am I looking at local pending state or production current state?
Is the problem in primary DB, primary FAISS, supplemental page index, supplemental FAISS, page images, frontend rendering, or AI gateway?
Which count am I quoting:
- source PDFs
- cleaned PDFs
- primary searchable books
- page-image books
- supplemental manifest books
- visible /api/books books
- DB chunks
- primary vectors
- supplemental pages
- supplemental vectors
Did a mapping rule change happen without rebuilding supplemental vectors?
Did production receive only the DB while FAISS / supplemental assets stayed old?
Are frontend version markers, backend behavior, and GitHub docs aligned for the same release?
Is the result showing the right book identity, edition, and page-image source?
If a query is wrong, is it a data identity problem, a retrieval problem, or a UI labeling problem?
Has textbook_version_manifest.json been regenerated, and are unresolved_primary_books=0, duplicate_primary_identity_groups=0, and safe_merge_candidates=0?

Canonical states

Two states must always be tracked separately:

Production current: what sun.bdfz.net is actually serving
Local pending rollout: what has been rebuilt locally but not yet deployed

As of 2026-03-10:

Layer	Production current	Local pending rollout	Notes
Main DB	`textbook_mineru_fts.db`, `21925` rows	same DB file currently in local `data/index/`	Runtime startup auto-syncs only this DB
Main dense vectors	`17896` vectors, loaded	same local primary FAISS	Physical DB filter is `source != 'gaokao'`, not `source='textbook'`
Supplemental page index	corrected build loaded: `books=176`, `primary_books=52`, `supplemental_only_books=124`, `pages=22844`	rebuilt product-scoped source: `books=175`, `primary_books=57`, `supplemental_only_books=118`, `pages=2843`, `source_pages=31170`, `unsupported_pages_omitted=17603`	Local pending rollout now keeps only the public support scope (`人教版全部` + `英语·北师大版` + `化学·鲁科版`) in the runtime supplemental corpus
Supplemental vectors	loaded on production: `22844` vectors, manifest present, health `loaded=true`	local rebuild verified: `2843` vectors against the scoped `2843`-page source	These assets still must be explicitly transported; GitHub deploy does not pull them automatically from local `data/index/`
Frontend version marker	`2026.03.10-r23`	local code is prepared for `2026.03.10-r27`	Frontend marker must move with the support-scope rollout so book labels and runtime behavior stay in sync

Never mix production current counts with local rebuilt counts in release notes or debugging conclusions.

Project topology

Main runtime project

GitHub/repo runtime project: platform/
Production container app:
- FastAPI backend under platform/backend/
- static frontend under platform/frontend/
Deploy workflow:
- GitHub Actions clones a fresh checkout on the VPS
- platform/scripts/deploy_vps.sh builds the image and cuts over

Important repository boundary:

platform/ is the GitHub-tracked deploy repo
the workspace root /Users/ylsuen/textbook_ai_migration/scripts contains local data-processing tooling, but that tree is not part of the platform/ Git repository
changes under the workspace-root scripts/ tree do not reach production through git push; they affect local rebuild capability only unless separately copied or mirrored into a tracked repo
as of 2026-03-10, the local workspace-root scripts scripts/pdf_to_pages.py and scripts/33_rebuild_mineru_chunks_from_content_list.py were updated to understand textbook_version_manifest.json schema v2 (by_content_id + by_book_key); future local page-image or chunk rebuilds should preserve that compatibility

External runtime dependencies

Production site: sun.bdfz.net
Image CDN: img.rdfzer.com
AI gateway canonical domain: ai.bdfz.net
AI gateway implementation currently routes to Cloudflare Worker service apis / production
Worker code lives outside this repo at:
- /Users/ylsuen/CF/upgrade_staging/apis/apis.js

Runtime host role

Production VPS hosts Docker runtime only
Production VPS is not the place to run OCR, MinerU, or FAISS rebuild jobs
Large rebuilds belong on a local or offline processing machine

Environment matrix

Local machine roles

code editing, data inspection, manifest checks, and lightweight validation
primary Python preference for general project operations: /Users/ylsuen/.venv
current supplemental vector rebuild path uses a dedicated local environment:
- /Users/ylsuen/textbook_ai_migration/.venv-vector
- reason: isolate heavy sentence-transformers build execution on macOS and avoid the earlier local crash path

Local network environment

The primary workstation runs sing-box.app with TUN mode enabled.
Current confirmed route checks on 2026-03-10:
- route -n get github.com -> interface: utun8
- route -n get 23.19.231.173 -> interface: utun8
Practical consequence:
- large uploads and deploy traffic from the workstation are expected to traverse the sing-box TUN path
- when a transfer is unexpectedly slow, verify route first before assuming application-layer failure
- do not hardcode one artifact path forever; benchmark the available paths for the current session before choosing
- candidate paths for large runtime artifacts:
  - direct workstation -> VPS over SSH / scp / rsync
  - workstation -> R2, then VPS -> curl
- current measured outcome on 2026-03-10:
  - direct SSH upload to VPS over utun8 was slower and less stable
  - R2 -> VPS curl was the more reliable choice for this rollout
Route and process retrieval points:
- ps -axo pid,etime,command | rg 'sing-box|singbox|tun'
- ifconfig | rg -n 'utun|tun'
- route -n get github.com
- route -n get 23.19.231.173

Production runtime environment

base image: python:3.13-slim
single-container FastAPI runtime
host-mounted:
- /data
- /state
shared HF cache root inside container environment:
- /state/cache/huggingface/hub

Environment rule

Do not assume that a local artifact exists on the VPS just because local code can see it. Local data/index/ and production /data/index/ are different asset stores. Also do not assume that an upload problem is a code problem before confirming whether the transfer path is using the expected sing-box TUN route.

Identity model

The project has multiple identity layers. Mixing them causes bad mappings and search corruption.

Source identity

subject
normalized title
edition
content_id when available
file path lineage

content_id is the strongest identity signal and should win whenever present.

Runtime book identity

book_key
- primary books use stable SmartEdu-derived book keys
- supplemental-only books use synthetic suppbook:*
display_title
- user-facing title with edition or disambiguating suffix
short_key
- only for books that have page-image mapping in book_map.json

Page identity

physical page index: zero-based page used by R2 page images
logical page number: printed page number in the book, when known
page_num in runtime/page-image code should be treated as physical page index unless a separate logical-page field is explicitly present

Chunk identity

DB chunks.id: physical row identity
DB source: current physical values are mineru and gaokao
Runtime “textbook” is a logical concept built on source != 'gaokao'
Analytics helper tables may use logical labels instead of physical DB labels
- for example, keyword_counts.source currently uses textbook / gaokao

Important: when debugging DB counts, do not filter local rows by source='textbook'. The current DB stores textbook rows as source='mineru'.

Canonical directories and current local counts

All counts below are current local filesystem counts and are not interchangeable.

Source and preprocessing trees

Directory	Meaning	Current count / size
`data/raw_pdf`	source PDFs downloaded from textbook acquisition flows	`53` recursive PDFs, about `5.0G`
`data/clean_pdfs`	curated / normalized PDFs used for later processing	`63` PDFs, about `6.4G`
`data/parsed`	earlier Markdown and JSON parsing output	`315` Markdown + `315` JSON, about `42G`
`data/mineru_output`	primary OCR corpus for main searchable books	`69` Markdown + `207` JSON + embedded PDFs, about `23G`
`data/mineru_output_backup`	backup OCR corpus for supplemental page-level recall	`253` Markdown + `759` JSON + embedded PDFs, about `80G`

Notes:

Recursive PDF counts inside mineru_output and mineru_output_backup are not book counts. These trees include embedded PDFs and processing artifacts. Use Markdown-file count or manifest counts as the book proxy there.
The public textbook download library count mentioned elsewhere, such as “316 textbooks”, is not the same thing as the current local actively processed runtime corpus. Treat that as a separate public-library metric and re-verify independently before using it publicly.

Runtime-oriented local artifacts

Artifact	Meaning	Current local status
`data/index/textbook_mineru_fts.db`	main runtime DB	about `56M`
`data/index/textbook_chunks.index`	primary FAISS	about `70M`, `17896` vectors
`data/index/textbook_chunks.manifest.json`	primary FAISS manifest	present
`platform/backend/supplemental_textbook_pages.jsonl.gz`	supplemental page index source bundled in repo	about `1.8M`, `2843` searchable rows (`31170` merged source pages before omission/dedupe)
`platform/backend/supplemental_textbook_pages.manifest.json`	supplemental page manifest bundled in repo	about `110K`
`data/index/supplemental_textbook_pages.index`	supplemental FAISS target path	about `11M`, verified against the scoped `2843`-page source
`data/index/supplemental_textbook_pages.vector.manifest.json`	supplemental FAISS manifest	present, verify required before release

Other data directories that matter operationally:

data/dict_pages
- dictionary page images staged for R2
data/gaokao_raw
data/gaokao_scraped
data/gaokao_exam_images
data/_img_tmp
- temporary page-image working tree

Canonical runtime asset ledger

Use this ledger before every upload, sync, or rollback. Do not transfer a runtime asset until its path, size, and SHA256 have been matched against this table or intentionally refreshed.

Local canonical release artifacts

Relative path	Role	Size (bytes)	SHA256
`data/index/textbook_mineru_fts.db`	main runtime DB	`58892288`	`5a92fff4f33c4891a7b6916ce26eda69b413c8a3f852e1b8687c70e75fa45c71`
`data/index/textbook_chunks.index`	primary FAISS index	`73445274`	`2c5a5aa221c6e42ae0e3ca6e841c1a8dbe7b40fba606d5cf2345e59eccde0331`
`data/index/textbook_chunks.manifest.json`	primary FAISS manifest	`891`	`394d69870d116106fdcf7a5f17af9aa0275139340c41a8b029bb7a43f1664155`
`platform/backend/supplemental_textbook_pages.jsonl.gz`	bundled supplemental page source	`1889049`	`16937cbe0a7034ccc40b29e7db65fc12f3db5ecf0c0f29cfd431c07e9a75344e`
`platform/backend/supplemental_textbook_pages.manifest.json`	bundled supplemental page manifest	`111821`	`bd071850079a96976ad6b495c91143ae8fca0bd1f6dd70bc41d6066cb72f2a9c`
`data/index/supplemental_textbook_pages.index`	supplemental FAISS index	`11644973`	`879a02d544e999bfc31813eab76bbd5bf1b8b91a7ec70fb3f4cd65e2d2c5f4ca`
`data/index/supplemental_textbook_pages.vector.manifest.json`	supplemental FAISS manifest	`717`	`f2fbbbb91988d58f891d90ab8677c7a05732de3308bfde7a1222c53e8bc9425f`
`platform/frontend/assets/pages/book_map.json`	page-image identity map	`40261`	`b005aabe1a5f5ce3311fc849999de54db3ae3bf2b0afb10cc2921b7aeba7485d`
`platform/backend/textbook_version_manifest.json`	version-label manifest	`72847`	`674494d6de4c0acca0e4e9a2f3c265e80d8865bfd4167f6ea9a021e934c88c93`
`platform/frontend/assets/version.json`	public frontend version ledger	`6303`	`5369c2d567f9223647d5a34857d2098fbab25c3c0ce5a6007a235afb74c0bc74`

Important DB note:

textbook_mineru_fts.db is a mutable runtime file because local smoke tests and production traffic can append telemetry tables and log rows.
File-level SHA256 for this DB is therefore not a stable content identity signal by itself.
For release verification, pair the DB file SHA with a stable runtime identity fingerprint derived from the search-critical content tables plus FTS shadow table counts, while excluding mutable telemetry tables such as search_logs and ai_chat_logs.

Production runtime destinations

These are the runtime destinations that must match the intended release artifact set:

/root/cross-subject-knowledge/data/index/textbook_mineru_fts.db
/root/cross-subject-knowledge/data/index/textbook_chunks.index
/root/cross-subject-knowledge/data/index/textbook_chunks.manifest.json
/root/cross-subject-knowledge/data/index/supplemental_textbook_pages.jsonl.gz
/root/cross-subject-knowledge/data/index/supplemental_textbook_pages.manifest.json
/root/cross-subject-knowledge/data/index/supplemental_textbook_pages.index
/root/cross-subject-knowledge/data/index/supplemental_textbook_pages.vector.manifest.json

Artifact verification rule

Before any transfer:

confirm the exact source path you are about to copy
compute its size and SHA256
compare that output to this ledger or to a deliberately updated replacement ledger
only then copy it to R2 or directly to the VPS
after the remote copy completes, recompute remote size and SHA256 before cutover

This rule exists because a Git checkout, a local build output, and a repo-bundled fallback file may have the same filename while representing different release states.

Current-vector verification note for this round:

platform/scripts/build_supplemental_vector_index.py verify has passed against the current platform/backend/supplemental_textbook_pages.jsonl.gz
verified result: 2843 vector rows, 1024 dimensions, fingerprint 5d2ecfc643d22aa32026b4f94dba14dd320d797c3f1408f7d6c9fb8768886948

Current corpus counts and relationships

Main runtime corpus

Local main DB facts:

DB total rows: 21925
textbook-runtime rows: 17896
- physical DB filter: source != 'gaokao'
gaokao rows: 4029
distinct primary textbook book_keys in DB: 69
distinct gaokao book_keys: 416
books in platform/frontend/assets/pages/book_map.json: 69
platform/backend/textbook_version_manifest.json schema: 2
version manifest by_book_key entries: 69
version manifest by_content_id entries: 33
unresolved primary editions in version manifest: 0
duplicate primary identities in version manifest: 0
remaining safe merge candidates after rebuild: 0

Main DB subject row counts for source != 'gaokao':

数学: 5116
英语: 3580
化学: 1736
物理: 1629
思想政治: 1250
语文: 1228
地理: 1153
生物学: 1151
历史: 1053

Do not confuse:

distinct books in DB
books with page-image maps
books in the version manifest
books with real content_id entries in the version manifest

Those are four different sets.

Supplemental page corpus

Corrected local supplemental manifest facts:

indexed source files: 251 / 251
manifest books: 175
searchable runtime pages: 2843
merged source pages before omission/dedupe: 31170
books safely merged back to primary book_key: 57
supplemental-only books in the identity manifest: 118
supported supplemental-only visible books in the current public release: 27
primary-bound duplicate pages omitted from runtime search: 10724
unsupported-version supplemental pages omitted from runtime search: 17603
duplicate OCR pages collapsed inside the release-scoped supplemental corpus: 0
unresolved books: 0
unresolved pages: 0
edition conflicts: 0
remaining same-identity cross-source conflicts: 0 (checked by audit, and future rebuilds should fail if this rises above 0)

Supplemental page row facts:

total searchable rows: 2843
searchable rows with content_id: 2843
searchable rows without content_id: 0
searchable rows bound to primary page images: 0
searchable rows carrying an explicit book_map_key field: 0
empty-text rows: 0

Supplemental-only source PDF coverage facts:

books with *_origin.pdf: 27 / 27
books with *_layout.pdf: 27 / 27
books with *_span.pdf: 27 / 27
all 27 supported supplemental-only books in the current public release have generated page-image products and valid book_map.json entries
release rule: do not remap unsupported parallel editions to a supported primary book merely to surface a page image
future expansion note: unsupported parallel editions remain preserved in the audit data layer and can be productized later as a separate page-image rollout

Supplemental page counts by subject:

英语: 869
地理: 714
物理: 767
化学: 367
生物学: 126

Subjects intentionally absent from searchable supplemental rows after identity cleanup:

数学、语文、历史、思想政治 currently contribute 0 searchable supplemental rows in the current public release scope

Supplemental manifest book counts by subject:

英语: 42
数学: 37
地理: 31
物理: 17
化学: 15
生物学: 15
思想政治: 8
历史: 5
语文: 5

Supplemental manifest edition distribution highlights:

人教版: 48
沪教版: 19
中图版: 10
沪科版: 16
苏教版: 14
湘教版: 9
北师大版: 8
鄂教版: 7
B版: 7
上外教版: 7
重大版: 7

Relationship rule that must stay explicit:

supplemental manifest books: 175
supplemental books merged into primary: 57
supplemental-only books in the identity manifest: 118
supported supplemental-only books currently released: 27
visible /api/books total in the current public release: 35 + 27 = 62

Never report 175 or 118 as the current public /api/books total.

Count and terminology audit points

These are the recurring caliber problems that must be checked every round.

Non-interchangeable counts

The following must never be conflated:

source PDF count
cleaned PDF count
primary OCR Markdown count
backup OCR Markdown count
primary searchable books in the DB
books with page-image mapping in book_map.json
books covered by textbook_version_manifest.json.by_book_key
books covered by textbook_version_manifest.json.by_content_id
supplemental manifest books
supplemental-only visible books
visible /api/books total
DB textbook-runtime rows
primary FAISS vector rows
supplemental page rows
supplemental FAISS vector rows

Primary corpus wording

When describing the main searchable textbook corpus:

logical runtime wording: “textbook corpus” or “non-gaokao textbook rows”
physical DB filter: source != 'gaokao'
physical row label in chunks: mineru

Do not write “source='textbook' in the main DB” unless the DB schema actually changes to that.

Analytics wording

For analytics helper tables, verify the table’s own semantics first.

keyword_counts.source currently uses logical labels:
- textbook
- gaokao
that does not match the physical chunks.source labels

Book totals wording

When describing books:

69 = full primary books present in the runtime DB and page-image registry
35 = supported primary books in the current public release
175 = corrected supplemental manifest books after identity audit
118 = supplemental-only books in the full identity manifest after removing the 57 books merged back into primary identities
27 = supported supplemental-only books in the current public release
62 = expected visible /api/books total in the current public release

Production-vs-local wording

Before writing any status update, release note, or GitHub summary:

explicitly say whether the number is:
- production current
- local pending rollout

Never silently switch between the two.

Data lineage: source to runtime

flowchart TD
    A["Source PDFs / acquisition metadata"] --> B["clean_pdfs / curated inputs"]
    B --> C["parsed / early md+json output"]
    B --> D["MinerU primary OCR -> data/mineru_output"]
    B --> E["MinerU backup OCR -> data/mineru_output_backup"]
    D --> F["Unified DB build -> textbook_mineru_fts.db"]
    F --> G["Primary FAISS -> textbook_chunks.index"]
    D --> H["Page mapping -> book_map.json + R2 page images"]
    E --> I["Supplemental page build -> supplemental_textbook_pages.jsonl.gz"]
    I --> J["Supplemental FAISS -> supplemental_textbook_pages.index"]
    F --> K["FastAPI runtime"]
    G --> K
    I --> K
    J --> K
    H --> L["img.rdfzer.com page CDN"]
    K --> M["sun.bdfz.net"]
    N["ai.bdfz.net / Worker apis"] --> K

Stage 1: acquisition

Primary scripts:

scripts/01_download_textbooks.sh
scripts/01_download_textbooks_via_images.py

Outputs:

data/raw_pdf

Stage 2: early parsing and OCR intermediates

Primary scripts:

scripts/02_pdf_to_md.py
scripts/06_ocr_pages_to_jsonl.py
scripts/07_ocr_fullpage.py
scripts/08_mineru_batch.py

Outputs:

data/parsed
data/mineru_output
data/mineru_output_backup
data/index/mineru_chunks.jsonl and related intermediate JSONL files

Stage 3: main search DB and concept data

Primary scripts:

scripts/09_build_unified_index.py
scripts/19_build_concept_map.py

Outputs:

data/index/textbook_mineru_fts.db

Stage 4: primary dense vectors

Primary script:

scripts/21_build_vector_index.py

Outputs:

data/index/textbook_chunks.index
data/index/textbook_chunks.manifest.json

Stage 5: page mapping and page-image delivery

Primary scripts:

scripts/31_generate_page_maps.py
scripts/32_apply_page_mapping.py
scripts/upload_pages_r2.py

Outputs:

platform/frontend/assets/pages/book_map.json
platform/frontend/assets/pages/{short_key}/p{N}.webp
R2 pages/{short_key}/p{N}.webp

Current operational boundary:

the current page-image product covers the 69 primary books in book_map.json
it does not yet cover the 118 supplemental-only visible books, even though those books already have origin/layout/span PDFs in data/mineru_output_backup
therefore, missing 查看原文 on a supplemental-only result should currently be interpreted as “page-image product not generated for this edition yet”, not “the OCR text was mapped to the wrong primary book”

Stage 6: supplemental page index

Primary script:

platform/scripts/build_supplemental_textbook_index.py

Inputs:

data/mineru_output_backup
data/index/textbook_mineru_fts.db
platform/frontend/assets/pages/book_map.json

Outputs:

platform/backend/supplemental_textbook_pages.jsonl.gz
platform/backend/supplemental_textbook_pages.manifest.json

Current safe mapping rules:

prefer direct content_id match
require edition consistency if edition_hint exists
if no edition hint exists, allow title-based match only when (subject, normalized title) resolves uniquely
otherwise generate independent suppbook:*

These rules exist to prevent cross-edition corruption while still allowing safe rebinding to primary page-image books.

Stage 7: supplemental dense vectors

Primary script:

platform/scripts/build_supplemental_vector_index.py

Outputs:

data/index/supplemental_textbook_pages.index
data/index/supplemental_textbook_pages.vector.manifest.json

Runtime rule:

if the supplemental page source fingerprint or source sha256 no longer matches the vector manifest, the supplemental FAISS must be treated as stale and disabled until rebuilt

Docker and runtime data contract

What is inside the image

platform/Dockerfile copies only:

backend/
frontend/
runtime Python dependencies

The image does not bake the heavy runtime data tree under data/.

Repo-bundled fallback assets currently inside platform/backend/:

supplemental_textbook_pages.jsonl.gz
supplemental_textbook_pages.manifest.json
optionally supplemental vector files if explicitly copied there for a release

What is runtime-mounted

Required host-mounted roots:

/data
/state

Required runtime assets:

/data/index/textbook_mineru_fts.db
/data/index/textbook_chunks.index
/data/index/textbook_chunks.manifest.json
/data/index/supplemental_textbook_pages.jsonl.gz
/data/index/supplemental_textbook_pages.manifest.json

Conditionally required runtime assets:

/data/index/supplemental_textbook_pages.index
/data/index/supplemental_textbook_pages.vector.manifest.json

Startup behavior from platform/backend/entrypoint.sh:

optionally run sync_db.py only when RUNTIME_DB_SYNC_MODE is explicitly enabled
run preflight.py
start uvicorn

The runtime does not rebuild FAISS or supplemental assets on the VPS.

DB drift rule

platform/backend/sync_db.py is now an explicit emergency path only. When manually enabled, it can sync only:

textbook_mineru_fts.db

By default, production startup keeps RUNTIME_DB_SYNC_MODE=disabled, so it does not auto-sync any runtime DB.

It never auto-syncs:

textbook_chunks.index
textbook_chunks.manifest.json
supplemental_textbook_pages.jsonl.gz
supplemental_textbook_pages.manifest.json
supplemental_textbook_pages.index
supplemental_textbook_pages.vector.manifest.json

Therefore the DB can move ahead while FAISS and supplemental assets stay old. This is one of the main failure modes of the current release model.

Cloudflare and image-storage contract

R2 / CDN naming

Page-image naming is part of the data contract.

Textbook page images:

local source: platform/frontend/assets/pages/{short_key}/p{N}.webp
remote R2/CDN path: pages/{short_key}/p{N}.webp
CDN base: https://img.rdfzer.com/pages/{short_key}/p{N}.webp

Dictionary page images:

remote protected dirs include:
- pages/dict_xuci/
- pages/dict_changyong/
- pages/dict_ciyuan/

Inline book-origin images shown in results:

https://img.rdfzer.com/orig/{urlencoded_book_key}/{filename}

Gaokao images:

https://img.rdfzer.com/gaokao/{filename}

R2 sync rule

scripts/upload_pages_r2.py stages textbook and dictionary page trees together before rclone sync.

Do not sync only textbook page roots to the pages/ prefix. That would delete remote dictionary assets.

AI integration contract

Backend defaults in platform/backend/main.py currently point to:

AI service URL default: https://apis.bdfz.workers.dev/
AI label default: Gemini
AI model default: gemini-flash-latest

External worker implementation currently has its own defaults in:

/Users/ylsuen/CF/upgrade_staging/apis/apis.js

Current worker defaults include:

generic text/chat default model: gemini-3.1-flash-lite-preview
vision fallback model: gemini-flash-latest

This means model naming and default behavior must be checked on both sides when debugging AI output drift. Do not assume backend-requested model and worker-internal fallback are the same thing.

Parameter surfaces and retrieval points

These are the main parameter/control surfaces that must be checked before release. If a parameter changes, its retrieval point must be verified in the corresponding file or endpoint.

Runtime roots and filesystem parameters

Retrieval points:

platform/Dockerfile
- PROJECT_ROOT
- DATA_ROOT
- STATE_ROOT
- PORT
- HF_HOME
- SENTENCE_TRANSFORMERS_HOME
- TRANSFORMERS_CACHE
platform/backend/main.py
- local/runtime path resolution
- bundled vs runtime supplemental asset discovery
platform/backend/preflight.py
- required runtime assets
platform/backend/sync_db.py
- DB auto-sync source and target paths
platform/scripts/deploy_vps.sh
- runtime mount destinations and rollout gate behavior

Search and retrieval parameters

Retrieval points in platform/backend/main.py:

SQLITE_BUSY_TIMEOUT_MS
FAISS_SCORE_THRESHOLD
SUPPLEMENTAL_VECTOR_ENABLED
SUPPLEMENTAL_VECTOR_SCORE_THRESHOLD
QUERY_TERM_PLAN_LIMIT
SUPPLEMENTAL_FALLBACK_LIMIT
RERANKER_ENABLED
RERANKER_PRELOAD
RERANKER_MAX_CANDIDATES
RERANKER_FINAL_LIMIT
GRAPH_RAG_ENABLED
GRAPH_RAG_MAX_RELATIONS
evidence-span cache and semantic cache parameters

AI gateway parameters

Retrieval points:

platform/backend/main.py
- AI_SERVICE_URL
- AI_SERVICE_LABEL
- AI_SERVICE_MODEL
- AI_SERVICE_TIMEOUT_SEC
- AI_SERVICE_RETRIES
- AI_SERVICE_RETRY_DELAY_SEC
- AI_SERVICE_ORIGIN
- AI_SERVICE_REFERER
- AI_SERVICE_USER_AGENT
- AI_SERVICE_PROJECT
- AI_SERVICE_TASK_TYPE
- AI_SERVICE_THINKING_LEVEL
- AI_INTERNAL_TOKEN
external worker implementation:
- CF/upgrade_staging/apis/apis.js
- check worker default model and fallback model there

Supplemental data gates

Retrieval points:

platform/backend/main.py
- SUPPLEMENTAL_REQUIRED
- supplemental source fallback order
- supplemental vector source fallback order
platform/backend/preflight.py
- SUPPLEMENTAL_REQUIRED
- SUPPLEMENTAL_VECTOR_REQUIRED
platform/scripts/deploy_vps.sh
- SUPPLEMENTAL_VECTOR_BUNDLED
- HEALTH_REQUIRE_RERANKER
- HEALTH_REQUIRE_SUPPLEMENTAL_VECTOR

Frontend version and asset markers

Retrieval points:

platform/frontend/assets/version.json
- public version history and current version marker
platform/frontend/index.html
- cache-buster query strings for style.css and app.js
platform/frontend/assets/app.js
- IMG_CDN
- API request behavior and UI rendering assumptions

Image/CDN contract

Retrieval points:

platform/backend/main.py
- IMG_CDN
- /api/page-image
platform/frontend/assets/app.js
- textbook inline image path
- gaokao image path
scripts/upload_pages_r2.py
- textbook page upload roots
- dictionary protected roots
- final R2 path contract

Health and live runtime retrieval points

Retrieval points:

live endpoint: /api/health
live endpoint: /assets/version.json
live endpoint: /api/books
live endpoint: /api/search
live endpoint: /api/page-image

Important /api/health fields to check:

status
db.chunks
faiss.ok
faiss.vectors
faiss.manifest.vector_rows
model.ok
reranker.loaded
supplemental.ok
supplemental.source
supplemental.manifest.source_files_total
supplemental.manifest.source_files_indexed
supplemental.manifest.books
supplemental.manifest.primary_books
supplemental.manifest.supplemental_only_books
supplemental.manifest.pages
supplemental.manifest.unresolved_books
supplemental.manifest.unresolved_pages
supplemental.manifest.edition_conflicts
supplemental_vectors.enabled
supplemental_vectors.loaded
supplemental_vectors.vectors
supplemental_vectors.reason

Production runtime facts

Current production VPS facts already confirmed:

root filesystem: 99G total, about 42G available
/root/cross-subject-knowledge/data/index: about 244M
/root/cross-subject-knowledge/state/cache/huggingface/hub: about 5.4G
/var/lib/docker: about 2.7G
memory: 5.8Gi, available about 3.3Gi
swap: 0

Implications:

production has enough disk for the pending supplemental assets
production should still not be used for OCR or FAISS rebuilds
lack of swap means Docker build spikes or large model warmup should stay conservative

Release and deploy contract

Current deploy path

platform/.github/workflows/deploy.yml does this:

SSH into the VPS
create a fresh temporary release checkout
git clone --depth 1 --branch main ...
run platform/scripts/deploy_vps.sh

This means:

the VPS deploy does not see arbitrary local files unless they are committed to the repo or copied to the runtime host by a separate step
a locally built supplemental vector under local data/index/ will not magically appear on the VPS
docs-only pushes are now expected to be filtered by workflow paths-ignore; release pushes that should touch production must include runtime-affecting files

Mandatory asset groups

If the main DB changes, verify at least:

textbook_mineru_fts.db
textbook_chunks.index
textbook_chunks.manifest.json

If supplemental mapping or supplemental page source changes, rebuild and ship all of:

supplemental_textbook_pages.jsonl.gz
supplemental_textbook_pages.manifest.json
supplemental_textbook_pages.index
supplemental_textbook_pages.vector.manifest.json

If frontend presentation of new data behavior changes, update together:

frontend/index.html cache-buster
frontend/assets/version.json
frontend code
backend behavior
GitHub docs

Health gate rule

Do not treat /api/health status=ok alone as release success.

For releases expecting semantic supplemental recall, release validation must also confirm:

supplemental manifest loaded
supplemental vectors loaded
reranker loaded when rerank is required

Mandatory pre-update review checklist

Before every rebuild, refactor, deploy, or rollback, check all of the following.

A. Working tree and release scope

git status --short
git rev-parse --short HEAD
identify whether the change touches:
- DB
- primary FAISS
- supplemental page index
- supplemental FAISS
- page-image mapping
- frontend version markers
- deploy scripts
- AI gateway defaults
identify whether the release is code-only, data-only, or mixed

B. Current local data state

Check or regenerate:

platform/backend/supplemental_textbook_pages.manifest.json
data/index/textbook_chunks.manifest.json
DB row counts from data/index/textbook_mineru_fts.db
platform/frontend/assets/pages/book_map.json
platform/backend/textbook_version_manifest.json
local supplemental vector manifest if the vector exists

Local caliber points to confirm explicitly:

primary DB books
primary vector rows
supplemental manifest books
supplemental merged-primary books
supplemental-only visible books
visible /api/books target total
unresolved books/pages
edition conflicts
content-id-missing supplemental books
blank-title duplicate groups
if page-image scope changed in the release, whether new supplemental-edition page images were actually regenerated locally rather than only relabeled in metadata

C. Current production state

Check live:

/api/health
/assets/version.json
/api/books
search regression queries on live production
if any large artifact was shipped separately, confirm the remote size and SHA256 against the intended release source before restart

At minimum, compare production current against local pending for:

DB row count
primary vector count
supplemental manifest counts
supplemental vector loaded state
frontend version marker
page-image scope:
- book_map.json book count
- whether any supplemental-only editions are intended to gain page images in this release
- whether that change is reflected both in local page assets and in R2/CDN

D. Deploy-path feasibility

Before release, confirm whether the changed artifacts will actually reach the VPS.

Check:

.github/workflows/deploy.yml
platform/scripts/deploy_vps.sh
benchmark the current-session transfer options for artifacts larger than about 50M
- direct workstation -> VPS over SSH / scp / rsync
- workstation -> R2, then VPS -> curl
- choose based on the current network route and measured throughput; do not hardcode R2 forever
source and destination paths for:
- supplemental page index
- supplemental page manifest
- supplemental vector index
- supplemental vector manifest
whether the changed asset is:
- committed into the repo checkout
- copied separately to the VPS runtime root
- or not transported at all

E. VPS capacity and runtime constraints

Before shipping large artifacts, re-check:

free disk
available memory
swap presence
data/index size
HF cache size
Docker storage pressure

F. Public-doc and version sync

Before release, confirm whether these need updating together:

platform/frontend/assets/version.json
platform/frontend/index.html
platform/README.md
platform/docs/runtime_operations_overview.md
platform/docs/data_layer_lineage_memory.md

G. AI and external-service alignment

Before release, confirm:

backend AI defaults
worker AI defaults
current canonical public AI domain
image CDN path contract
R2 upload path contract

Mandatory post-update verification checklist

After any update, run all relevant checks below.

A. Local build and syntax checks

Run the project-relevant checks, including at least:

python3 -m py_compile for changed Python modules
node --check platform/frontend/assets/app.js if frontend JS changed
bash -n platform/scripts/deploy_vps.sh if deploy script changed
git diff --check

B. Artifact integrity checks

If data changed, verify:

supplemental manifest values are the intended ones
supplemental vectors were rebuilt if the supplemental page source changed
supplemental vector verify passes against the current source
DB / primary FAISS manifest alignment still holds
book/page/image identity still matches the intended mapping rules

C. Live deployment checks

After release, check live:

/api/health
/assets/version.json
/api/books
representative /api/page-image result
representative AI chat path if AI behavior changed
supplemental_vectors.loaded=true if the release expects supplemental semantic recall

D. Search regression checks

After release, re-run at minimum:

潜热
海面蒸发潜热
极性
极性键
晶体的定义

For each query, verify:

no server error
relevant top results
correct subject behavior
correct edition/book identity
correct “view original” target
no meaningless character-split fallback results

E. Documentation and release-state checks

After release, confirm:

frontend version marker matches the release
cache-buster matches the release
docs reflect the right production-vs-local wording
GitHub-visible notes do not describe local-only assets as live
if the release was mixed code+data, both sides are reflected in the public docs

Suggested command entry points

These commands are the minimal retrieval entry points to pair with the checklists above. Run them from the repo root unless noted otherwise.

Local state

git status --short
git rev-parse --short HEAD

python3 -m py_compile platform/backend/main.py \
  platform/backend/preflight.py \
  platform/scripts/build_supplemental_textbook_index.py \
  platform/scripts/build_supplemental_vector_index.py
node --check platform/frontend/assets/app.js
bash -n platform/scripts/deploy_vps.sh
git diff --check

Local data/manifests

python3 - <<'PY'
import json, sqlite3
from pathlib import Path
root = Path('.').resolve()
man = json.loads((root / 'platform/backend/supplemental_textbook_pages.manifest.json').read_text())
print('supp_books', man.get('books'))
print('supp_pages', man.get('pages'))
print('supp_source_pages', man.get('source_pages'))
print('primary_books', man.get('primary_books'))
print('supp_only_books', man.get('supplemental_only_books'))
print('primary_bound_pages_omitted', man.get('primary_bound_pages_omitted'))
print('primary_bound_page_lookup_misses', man.get('primary_bound_page_lookup_misses'))
print('unresolved_books', man.get('unresolved_books'))
print('unresolved_pages', man.get('unresolved_pages'))
print('edition_conflicts', man.get('edition_conflicts'))
print('cross_source_identity_conflicts', man.get('cross_source_identity_conflicts'))
print('content_id_missing_books', man.get('content_id_missing_books'))
print('blank_title_duplicate_groups', man.get('blank_title_duplicate_groups'))
ver = json.loads((root / 'platform/backend/textbook_version_manifest.json').read_text())
print('primary_manifest_books', ver.get('primary_books'))
print('resolved_primary_books', ver.get('resolved_primary_books'))
print('unresolved_primary_books', ver.get('unresolved_primary_books'))
print('duplicate_primary_identity_groups', ver.get('duplicate_primary_identity_groups'))
print('safe_merge_candidates', len(ver.get('safe_merge_candidates') or []))
con = sqlite3.connect(root / 'data/index/textbook_mineru_fts.db')
cur = con.cursor()
print('db_total', cur.execute("SELECT COUNT(*) FROM chunks").fetchone()[0])
print('db_textbook_runtime', cur.execute("SELECT COUNT(*) FROM chunks WHERE source != 'gaokao'").fetchone()[0])
print('db_gaokao', cur.execute("SELECT COUNT(*) FROM chunks WHERE source = 'gaokao'").fetchone()[0])
print('db_books', cur.execute("SELECT COUNT(DISTINCT book_key) FROM chunks WHERE source != 'gaokao' AND book_key IS NOT NULL AND book_key<>''").fetchone()[0])
PY

/Users/ylsuen/.venv/bin/python platform/scripts/verify_textbook_runtime_data.py

Supplemental vector verify

HF_HUB_OFFLINE=1 /Users/ylsuen/textbook_ai_migration/.venv-vector/bin/python \
  platform/scripts/build_supplemental_vector_index.py verify \
  --source platform/backend/supplemental_textbook_pages.jsonl.gz \
  --index data/index/supplemental_textbook_pages.index \
  --manifest data/index/supplemental_textbook_pages.vector.manifest.json

Live production

curl -sS https://sun.bdfz.net/api/health | jq
curl -sS https://sun.bdfz.net/assets/version.json | jq
curl -sS https://sun.bdfz.net/api/books | jq '.books | length'

curl -sS 'https://sun.bdfz.net/api/search?q=潜热&source=textbook&limit=10' | jq
curl -sS 'https://sun.bdfz.net/api/search?q=极性&source=textbook&limit=10' | jq
curl -sS 'https://sun.bdfz.net/api/search?q=极性键&source=textbook&limit=10' | jq
curl -sS 'https://sun.bdfz.net/api/search?q=晶体的定义&source=textbook&limit=10' | jq

VPS capacity and runtime assets

Use the production host shell to check:

df -h /
free -h
du -sh /root/cross-subject-knowledge/data/index
du -sh /root/cross-subject-knowledge/state/cache/huggingface/hub
docker ps

Release-path confirmation

sed -n '1,220p' platform/.github/workflows/deploy.yml
sed -n '1,320p' platform/scripts/deploy_vps.sh

Search-quality failure classes already seen

These failure modes are part of long-term memory and should not be rediscovered from scratch.

1. Cross-edition supplemental misbinding

Symptom:

result text is real, but the linked book/page belongs to another edition
user clicks “view original” and cannot find the text in that book

Root cause:

supplemental OCR pages were mapped onto the wrong primary book_key
different editions were allowed to share the same runtime book identity

Current guardrail:

direct content_id match first
edition-aware matching
unique-title fallback only when safe
otherwise independent suppbook:*

2. Meaningless character-split fallback

Symptom:

searching a term like 潜热 returns pages that merely contain the characters 潜 and 热 separately
other subjects get pulled in even though the concept is not present

Root cause:

over-broad character-level fallback in supplemental recall

Current guardrail:

do not split user concepts into meaningless single-character fallback just to force results
results without true term or sentence-level evidence should not survive final filtering

This rule is general. It applies to all queries, not only 潜热.

3. Hybrid sort instability

Symptom:

some terms such as 极性 error while similar terms such as 极性键 work

Known cause encountered in this round:

mixed int / str IDs during sort/merge in hybrid ranking

Current guardrail:

stable sort identity handling must stay explicit in hybrid and rerank paths

Required verification before release

Data verification

Confirm supplemental manifest reports:
- unresolved_books=0
- unresolved_pages=0
- edition_conflicts=0
Confirm no duplicate blank-edition book groups are leaking into single-book selector behavior
Confirm page-image-bound rows do not point at missing primary page maps
If supplemental vectors were rebuilt, run manifest and source verify against the current page source
If a release claims new page-image coverage for supplemental editions, confirm those editions have real local page assets and are not merely remapped to a primary edition

Search verification

At minimum, regression queries must include:

潜热
海面蒸发潜热
极性
极性键
晶体的定义

For each one, verify:

result relevance
correct subject scope
correct book identity and edition
correct “view original” page behavior
absence of meaningless character-split hits

Deployment verification

After deployment:

check /api/health
check /assets/version.json
confirm supplemental manifest counts are the intended release counts
confirm supplemental vector loaded state matches the release goal
if page-image scope changed, sample /api/page-image for one newly covered supplemental edition and one still-uncovered supplemental-only edition
rerun the regression queries above against live production
if the rollout touches frontend or page-image behavior, confirm the running container image digest is the intended rollback anchor; do not assume textbook-knowledge:latest still matches the running container after a manual rollback
if the rollout touches “查看原文” behavior, verify the built image contains /app/frontend/assets/pages/book_map.json and that a representative live search result returns a non-null page_url

Current actionable blockers for this round

As of 2026-03-10, the current local worktree still has these release blockers:

Final supplemental vector rebuild is still in progress and must finish with a successful verify against the corrected supplemental page source.
Even after local build success, the current GitHub Actions deploy path will not automatically move the local supplemental vector from local data/index/ to the VPS runtime.
Production is still on the old supplemental manifest and still exhibits old-query behavior such as noisy 潜热 results and the 极性 failure path.
Frontend version markers and GitHub-facing docs must be synchronized with the actual release contents before deployment.

Do not treat the local rebuild as deployable until all four blockers are cleared.

Future architecture direction

Short-term recommended stack remains:

SQLite FTS5
FAISS
CrossEncoder reranker
host-mounted runtime assets

The next meaningful improvement is not “move everything to a new database first”. It is:

versioned runtime data artifacts
explicit artifact transport into production
sentence-level evidence extraction for definition queries
supplemental FAISS fully integrated and deployed
automated regression evaluation before cutover

Prioritization rule:

for this project, data identity correctness, runtime asset consistency, and release verification come before broad framework rewrites
do not let generic advice such as “split the monolith”, “migrate to React”, or “move to PostgreSQL” outrank concrete live risks like:
- wrong edition/book binding
- supplemental asset drift between local and production
- stale frontend version markers
- missing vector transport to VPS
- broken live query regression cases

Recommended future release model:

build versioned artifact bundles off-box
publish bundle checksums
make VPS deploy pull a specific data-artifact version
keep one machine-readable release manifest that includes:
- DB sha256
- primary FAISS sha256
- supplemental page index sha256
- supplemental vector sha256
- row counts
- book counts
- page counts
separate textbook registry identity from page-image mapping

Minimal pre-change checklist

Before any future update, read this document and explicitly confirm:

target state: local vs production
affected layer: DB / FAISS / supplemental / page images / frontend / AI gateway
affected artifacts to rebuild
required version/file sync
release verification queries
rollback artifact or previous release anchor

If those six items are not written down, the change is not ready.

2026-03-29 high-school geography page-image mapping note

Public CDN/R2 verification: representative high-school geography pages such as https://img.rdfzer.com/pages/890c235d20c4/p4.webp and https://img.rdfzer.com/pages/5d7ca2682888/p103.webp return HTTP 200; the page images already exist remotely
Live production symptom remains reproducible under frontend version 2026.03.26-r33: curl -sS 'https://sun.bdfz.net/api/search?q=自然灾害&source=textbook&subject=地理&phase=高中&limit=10' returns high-school geography supplemental rows with page_url=null
Root cause: the deployed image is missing the 10 suppbook:* entries for the high-school geography 人教版 textbook+atlas set, even though the OCR rows and CDN pages already exist
Local verified target state after the mapping restore:
- primary_db_books=117
- book_map_books=127
- book_map_primary_books=117
- book_map_supplemental_books=10
- supplemental_manifest.pages=2843
- supplemental_jsonl.book_map_key_rows=714
Release class: this is a page-mapping rollout, not a DB rebuild, FAISS rebuild, or R2 upload rollout
Production access warning: workstation SSH to sun.bdfz.net / 23.19.231.173 is currently not healthy for manual inspection from this environment; the server closes the connection before shell. Do not assume manual VPS login works until SSH access is repaired or revalidated
Deploy-path warning: GitHub Actions uses repository secret VPS_SSH_KEY; that trust path is separate from the workstation SSH config and separate from any VPS root password change. Verify the active deploy key before relying on an automatic rollout

2026-03-11 deployment incident note

Incident: a VPS-side manual release was built from the stale runtime repo instead of a clean local release source, and the resulting image lost frontend/assets/pages/book_map.json
User-facing symptom: live search results degraded to page_url=null, so main-site “查看原文” disappeared even though the frontend button code still existed
Misleading rollback detail: the previously accepted rollback image also lacked book_map.json, so “roll back to the accepted image” was not enough to restore page images
Fix that worked: ship a clean off-box release bundle containing the current frontend plus frontend/assets/pages/book_map.json, then deploy from that temporary release checkout
Guardrail for future manual work: when latest may have drifted, tag the running container image digest explicitly before cutover and verify book_map.json inside the built image before calling the release good

FilesExpand file tree

data_layer_lineage_memory.md

Latest commit

History

data_layer_lineage_memory.md

File metadata and controls

Data Layer Lineage Memory

Read-first rules

Canonical states

Project topology

Main runtime project

External runtime dependencies

Runtime host role

Environment matrix

Local machine roles

Local network environment

Production runtime environment

Environment rule

Identity model

Source identity

Runtime book identity

Page identity

Chunk identity

Canonical directories and current local counts

Source and preprocessing trees

Runtime-oriented local artifacts

Canonical runtime asset ledger

Local canonical release artifacts

Production runtime destinations

Artifact verification rule

Current corpus counts and relationships

Main runtime corpus

Supplemental page corpus

Count and terminology audit points

Non-interchangeable counts

Primary corpus wording

Analytics wording

Book totals wording

Production-vs-local wording

Data lineage: source to runtime

Stage 1: acquisition

Stage 2: early parsing and OCR intermediates

Stage 3: main search DB and concept data

Stage 4: primary dense vectors

Stage 5: page mapping and page-image delivery

Stage 6: supplemental page index

Stage 7: supplemental dense vectors

Docker and runtime data contract

What is inside the image

What is runtime-mounted

DB drift rule

Cloudflare and image-storage contract

R2 / CDN naming

R2 sync rule

AI integration contract

Parameter surfaces and retrieval points

Runtime roots and filesystem parameters

Search and retrieval parameters

AI gateway parameters

Supplemental data gates

Frontend version and asset markers

Image/CDN contract

Health and live runtime retrieval points

Production runtime facts

Release and deploy contract

Current deploy path

Mandatory asset groups

Health gate rule

Mandatory pre-update review checklist

A. Working tree and release scope

B. Current local data state

C. Current production state

D. Deploy-path feasibility

E. VPS capacity and runtime constraints

F. Public-doc and version sync

G. AI and external-service alignment

Mandatory post-update verification checklist

A. Local build and syntax checks

B. Artifact integrity checks

C. Live deployment checks