This document is the durable operating memory for the textbook project. Before any data rebuild, deploy, rollback, or search-quality debugging, read this file first.
It records:
- what data exists
- what each count actually means
- how source PDFs become OCR, DB, vectors, page images, and runtime assets
- how Docker and VPS consume those assets
- which external services are part of the real runtime path
- which update steps are mandatory and must not be skipped
This file is intended to be the first stop before future updates so the project does not lose critical context between iterations.
Before changing anything, answer these questions from this document:
- Am I looking at local pending state or production current state?
- Is the problem in primary DB, primary FAISS, supplemental page index, supplemental FAISS, page images, frontend rendering, or AI gateway?
- Which count am I quoting:
- source PDFs
- cleaned PDFs
- primary searchable books
- page-image books
- supplemental manifest books
- visible
/api/booksbooks - DB chunks
- primary vectors
- supplemental pages
- supplemental vectors
- Did a mapping rule change happen without rebuilding supplemental vectors?
- Did production receive only the DB while FAISS / supplemental assets stayed old?
- Are frontend version markers, backend behavior, and GitHub docs aligned for the same release?
- Is the result showing the right book identity, edition, and page-image source?
- If a query is wrong, is it a data identity problem, a retrieval problem, or a UI labeling problem?
- Has
textbook_version_manifest.jsonbeen regenerated, and areunresolved_primary_books=0,duplicate_primary_identity_groups=0, andsafe_merge_candidates=0?
Two states must always be tracked separately:
- Production current: what sun.bdfz.net is actually serving
- Local pending rollout: what has been rebuilt locally but not yet deployed
As of 2026-03-10:
| Layer | Production current | Local pending rollout | Notes |
|---|---|---|---|
| Main DB | textbook_mineru_fts.db, 21925 rows |
same DB file currently in local data/index/ |
Runtime startup auto-syncs only this DB |
| Main dense vectors | 17896 vectors, loaded |
same local primary FAISS | Physical DB filter is source != 'gaokao', not source='textbook' |
| Supplemental page index | corrected build loaded: books=176, primary_books=52, supplemental_only_books=124, pages=22844 |
rebuilt product-scoped source: books=175, primary_books=57, supplemental_only_books=118, pages=2843, source_pages=31170, unsupported_pages_omitted=17603 |
Local pending rollout now keeps only the public support scope (人教版全部 + 英语·北师大版 + 化学·鲁科版) in the runtime supplemental corpus |
| Supplemental vectors | loaded on production: 22844 vectors, manifest present, health loaded=true |
local rebuild verified: 2843 vectors against the scoped 2843-page source |
These assets still must be explicitly transported; GitHub deploy does not pull them automatically from local data/index/ |
| Frontend version marker | 2026.03.10-r23 |
local code is prepared for 2026.03.10-r27 |
Frontend marker must move with the support-scope rollout so book labels and runtime behavior stay in sync |
Never mix production current counts with local rebuilt counts in release notes or debugging conclusions.
- GitHub/repo runtime project:
platform/ - Production container app:
- FastAPI backend under
platform/backend/ - static frontend under
platform/frontend/
- FastAPI backend under
- Deploy workflow:
- GitHub Actions clones a fresh checkout on the VPS
platform/scripts/deploy_vps.shbuilds the image and cuts over
Important repository boundary:
platform/is the GitHub-tracked deploy repo- the workspace root
/Users/ylsuen/textbook_ai_migration/scriptscontains local data-processing tooling, but that tree is not part of theplatform/Git repository - changes under the workspace-root
scripts/tree do not reach production throughgit push; they affect local rebuild capability only unless separately copied or mirrored into a tracked repo - as of 2026-03-10, the local workspace-root scripts
scripts/pdf_to_pages.pyandscripts/33_rebuild_mineru_chunks_from_content_list.pywere updated to understandtextbook_version_manifest.jsonschema v2 (by_content_id + by_book_key); future local page-image or chunk rebuilds should preserve that compatibility
- Production site: sun.bdfz.net
- Image CDN: img.rdfzer.com
- AI gateway canonical domain: ai.bdfz.net
- AI gateway implementation currently routes to Cloudflare Worker service
apis/production - Worker code lives outside this repo at:
/Users/ylsuen/CF/upgrade_staging/apis/apis.js
- Production VPS hosts Docker runtime only
- Production VPS is not the place to run OCR, MinerU, or FAISS rebuild jobs
- Large rebuilds belong on a local or offline processing machine
- code editing, data inspection, manifest checks, and lightweight validation
- primary Python preference for general project operations:
/Users/ylsuen/.venv - current supplemental vector rebuild path uses a dedicated local environment:
/Users/ylsuen/textbook_ai_migration/.venv-vector- reason: isolate heavy sentence-transformers build execution on macOS and avoid the earlier local crash path
- The primary workstation runs
sing-box.appwith TUN mode enabled. - Current confirmed route checks on 2026-03-10:
route -n get github.com->interface: utun8route -n get 23.19.231.173->interface: utun8
- Practical consequence:
- large uploads and deploy traffic from the workstation are expected to traverse the sing-box TUN path
- when a transfer is unexpectedly slow, verify route first before assuming application-layer failure
- do not hardcode one artifact path forever; benchmark the available paths for the current session before choosing
- candidate paths for large runtime artifacts:
- direct workstation -> VPS over SSH /
scp/rsync - workstation -> R2, then VPS ->
curl
- direct workstation -> VPS over SSH /
- current measured outcome on 2026-03-10:
- direct SSH upload to VPS over
utun8was slower and less stable R2 -> VPS curlwas the more reliable choice for this rollout
- direct SSH upload to VPS over
- Route and process retrieval points:
ps -axo pid,etime,command | rg 'sing-box|singbox|tun'ifconfig | rg -n 'utun|tun'route -n get github.comroute -n get 23.19.231.173
- base image:
python:3.13-slim - single-container FastAPI runtime
- host-mounted:
/data/state
- shared HF cache root inside container environment:
/state/cache/huggingface/hub
Do not assume that a local artifact exists on the VPS just because local code can see it. Local data/index/ and production /data/index/ are different asset stores.
Also do not assume that an upload problem is a code problem before confirming whether the transfer path is using the expected sing-box TUN route.
The project has multiple identity layers. Mixing them causes bad mappings and search corruption.
subject- normalized
title editioncontent_idwhen available- file path lineage
content_id is the strongest identity signal and should win whenever present.
book_key- primary books use stable SmartEdu-derived book keys
- supplemental-only books use synthetic
suppbook:*
display_title- user-facing title with edition or disambiguating suffix
short_key- only for books that have page-image mapping in
book_map.json
- only for books that have page-image mapping in
- physical page index: zero-based page used by R2 page images
- logical page number: printed page number in the book, when known
page_numin runtime/page-image code should be treated as physical page index unless a separate logical-page field is explicitly present
- DB
chunks.id: physical row identity - DB
source: current physical values aremineruandgaokao - Runtime “textbook” is a logical concept built on
source != 'gaokao' - Analytics helper tables may use logical labels instead of physical DB labels
- for example,
keyword_counts.sourcecurrently usestextbook/gaokao
- for example,
Important: when debugging DB counts, do not filter local rows by source='textbook'. The current DB stores textbook rows as source='mineru'.
All counts below are current local filesystem counts and are not interchangeable.
| Directory | Meaning | Current count / size |
|---|---|---|
data/raw_pdf |
source PDFs downloaded from textbook acquisition flows | 53 recursive PDFs, about 5.0G |
data/clean_pdfs |
curated / normalized PDFs used for later processing | 63 PDFs, about 6.4G |
data/parsed |
earlier Markdown and JSON parsing output | 315 Markdown + 315 JSON, about 42G |
data/mineru_output |
primary OCR corpus for main searchable books | 69 Markdown + 207 JSON + embedded PDFs, about 23G |
data/mineru_output_backup |
backup OCR corpus for supplemental page-level recall | 253 Markdown + 759 JSON + embedded PDFs, about 80G |
Notes:
- Recursive PDF counts inside
mineru_outputandmineru_output_backupare not book counts. These trees include embedded PDFs and processing artifacts. Use Markdown-file count or manifest counts as the book proxy there. - The public textbook download library count mentioned elsewhere, such as “316 textbooks”, is not the same thing as the current local actively processed runtime corpus. Treat that as a separate public-library metric and re-verify independently before using it publicly.
| Artifact | Meaning | Current local status |
|---|---|---|
data/index/textbook_mineru_fts.db |
main runtime DB | about 56M |
data/index/textbook_chunks.index |
primary FAISS | about 70M, 17896 vectors |
data/index/textbook_chunks.manifest.json |
primary FAISS manifest | present |
platform/backend/supplemental_textbook_pages.jsonl.gz |
supplemental page index source bundled in repo | about 1.8M, 2843 searchable rows (31170 merged source pages before omission/dedupe) |
platform/backend/supplemental_textbook_pages.manifest.json |
supplemental page manifest bundled in repo | about 110K |
data/index/supplemental_textbook_pages.index |
supplemental FAISS target path | about 11M, verified against the scoped 2843-page source |
data/index/supplemental_textbook_pages.vector.manifest.json |
supplemental FAISS manifest | present, verify required before release |
Other data directories that matter operationally:
data/dict_pages- dictionary page images staged for R2
data/gaokao_rawdata/gaokao_scrapeddata/gaokao_exam_imagesdata/_img_tmp- temporary page-image working tree
Use this ledger before every upload, sync, or rollback. Do not transfer a runtime asset until its path, size, and SHA256 have been matched against this table or intentionally refreshed.
| Relative path | Role | Size (bytes) | SHA256 |
|---|---|---|---|
data/index/textbook_mineru_fts.db |
main runtime DB | 58892288 |
5a92fff4f33c4891a7b6916ce26eda69b413c8a3f852e1b8687c70e75fa45c71 |
data/index/textbook_chunks.index |
primary FAISS index | 73445274 |
2c5a5aa221c6e42ae0e3ca6e841c1a8dbe7b40fba606d5cf2345e59eccde0331 |
data/index/textbook_chunks.manifest.json |
primary FAISS manifest | 891 |
394d69870d116106fdcf7a5f17af9aa0275139340c41a8b029bb7a43f1664155 |
platform/backend/supplemental_textbook_pages.jsonl.gz |
bundled supplemental page source | 1889049 |
16937cbe0a7034ccc40b29e7db65fc12f3db5ecf0c0f29cfd431c07e9a75344e |
platform/backend/supplemental_textbook_pages.manifest.json |
bundled supplemental page manifest | 111821 |
bd071850079a96976ad6b495c91143ae8fca0bd1f6dd70bc41d6066cb72f2a9c |
data/index/supplemental_textbook_pages.index |
supplemental FAISS index | 11644973 |
879a02d544e999bfc31813eab76bbd5bf1b8b91a7ec70fb3f4cd65e2d2c5f4ca |
data/index/supplemental_textbook_pages.vector.manifest.json |
supplemental FAISS manifest | 717 |
f2fbbbb91988d58f891d90ab8677c7a05732de3308bfde7a1222c53e8bc9425f |
platform/frontend/assets/pages/book_map.json |
page-image identity map | 40261 |
b005aabe1a5f5ce3311fc849999de54db3ae3bf2b0afb10cc2921b7aeba7485d |
platform/backend/textbook_version_manifest.json |
version-label manifest | 72847 |
674494d6de4c0acca0e4e9a2f3c265e80d8865bfd4167f6ea9a021e934c88c93 |
platform/frontend/assets/version.json |
public frontend version ledger | 6303 |
5369c2d567f9223647d5a34857d2098fbab25c3c0ce5a6007a235afb74c0bc74 |
Important DB note:
textbook_mineru_fts.dbis a mutable runtime file because local smoke tests and production traffic can append telemetry tables and log rows.- File-level SHA256 for this DB is therefore not a stable content identity signal by itself.
- For release verification, pair the DB file SHA with a stable runtime identity fingerprint derived from the search-critical content tables plus FTS shadow table counts, while excluding mutable telemetry tables such as
search_logsandai_chat_logs.
These are the runtime destinations that must match the intended release artifact set:
/root/cross-subject-knowledge/data/index/textbook_mineru_fts.db/root/cross-subject-knowledge/data/index/textbook_chunks.index/root/cross-subject-knowledge/data/index/textbook_chunks.manifest.json/root/cross-subject-knowledge/data/index/supplemental_textbook_pages.jsonl.gz/root/cross-subject-knowledge/data/index/supplemental_textbook_pages.manifest.json/root/cross-subject-knowledge/data/index/supplemental_textbook_pages.index/root/cross-subject-knowledge/data/index/supplemental_textbook_pages.vector.manifest.json
Before any transfer:
- confirm the exact source path you are about to copy
- compute its size and SHA256
- compare that output to this ledger or to a deliberately updated replacement ledger
- only then copy it to R2 or directly to the VPS
- after the remote copy completes, recompute remote size and SHA256 before cutover
This rule exists because a Git checkout, a local build output, and a repo-bundled fallback file may have the same filename while representing different release states.
Current-vector verification note for this round:
platform/scripts/build_supplemental_vector_index.py verifyhas passed against the currentplatform/backend/supplemental_textbook_pages.jsonl.gz- verified result:
2843vector rows,1024dimensions, fingerprint5d2ecfc643d22aa32026b4f94dba14dd320d797c3f1408f7d6c9fb8768886948
Local main DB facts:
- DB total rows:
21925 - textbook-runtime rows:
17896- physical DB filter:
source != 'gaokao'
- physical DB filter:
- gaokao rows:
4029 - distinct primary textbook
book_keys in DB:69 - distinct gaokao
book_keys:416 - books in
platform/frontend/assets/pages/book_map.json:69 platform/backend/textbook_version_manifest.jsonschema:2- version manifest
by_book_keyentries:69 - version manifest
by_content_identries:33 - unresolved primary editions in version manifest:
0 - duplicate primary identities in version manifest:
0 - remaining safe merge candidates after rebuild:
0
Main DB subject row counts for source != 'gaokao':
- 数学:
5116 - 英语:
3580 - 化学:
1736 - 物理:
1629 - 思想政治:
1250 - 语文:
1228 - 地理:
1153 - 生物学:
1151 - 历史:
1053
Do not confuse:
- distinct books in DB
- books with page-image maps
- books in the version manifest
- books with real
content_identries in the version manifest
Those are four different sets.
Corrected local supplemental manifest facts:
- indexed source files:
251 / 251 - manifest books:
175 - searchable runtime pages:
2843 - merged source pages before omission/dedupe:
31170 - books safely merged back to primary
book_key:57 - supplemental-only books in the identity manifest:
118 - supported supplemental-only visible books in the current public release:
27 - primary-bound duplicate pages omitted from runtime search:
10724 - unsupported-version supplemental pages omitted from runtime search:
17603 - duplicate OCR pages collapsed inside the release-scoped supplemental corpus:
0 - unresolved books:
0 - unresolved pages:
0 - edition conflicts:
0 - remaining same-identity cross-source conflicts:
0(checked by audit, and future rebuilds should fail if this rises above0)
Supplemental page row facts:
- total searchable rows:
2843 - searchable rows with
content_id:2843 - searchable rows without
content_id:0 - searchable rows bound to primary page images:
0 - searchable rows carrying an explicit
book_map_keyfield:0 - empty-text rows:
0
Supplemental-only source PDF coverage facts:
- books with
*_origin.pdf:27 / 27 - books with
*_layout.pdf:27 / 27 - books with
*_span.pdf:27 / 27 - all
27supported supplemental-only books in the current public release have generated page-image products and validbook_map.jsonentries - release rule: do not remap unsupported parallel editions to a supported primary book merely to surface a page image
- future expansion note: unsupported parallel editions remain preserved in the audit data layer and can be productized later as a separate page-image rollout
Supplemental page counts by subject:
- 英语:
869 - 地理:
714 - 物理:
767 - 化学:
367 - 生物学:
126
Subjects intentionally absent from searchable supplemental rows after identity cleanup:
- 数学、语文、历史、思想政治 currently contribute
0searchable supplemental rows in the current public release scope
Supplemental manifest book counts by subject:
- 英语:
42 - 数学:
37 - 地理:
31 - 物理:
17 - 化学:
15 - 生物学:
15 - 思想政治:
8 - 历史:
5 - 语文:
5
Supplemental manifest edition distribution highlights:
- 人教版:
48 - 沪教版:
19 - 中图版:
10 - 沪科版:
16 - 苏教版:
14 - 湘教版:
9 - 北师大版:
8 - 鄂教版:
7 - B版:
7 - 上外教版:
7 - 重大版:
7
Relationship rule that must stay explicit:
- supplemental manifest books:
175 - supplemental books merged into primary:
57 - supplemental-only books in the identity manifest:
118 - supported supplemental-only books currently released:
27 - visible
/api/bookstotal in the current public release:35 + 27 = 62
Never report 175 or 118 as the current public /api/books total.
These are the recurring caliber problems that must be checked every round.
The following must never be conflated:
- source PDF count
- cleaned PDF count
- primary OCR Markdown count
- backup OCR Markdown count
- primary searchable books in the DB
- books with page-image mapping in
book_map.json - books covered by
textbook_version_manifest.json.by_book_key - books covered by
textbook_version_manifest.json.by_content_id - supplemental manifest books
- supplemental-only visible books
- visible
/api/bookstotal - DB textbook-runtime rows
- primary FAISS vector rows
- supplemental page rows
- supplemental FAISS vector rows
When describing the main searchable textbook corpus:
- logical runtime wording: “textbook corpus” or “non-gaokao textbook rows”
- physical DB filter:
source != 'gaokao' - physical row label in
chunks:mineru
Do not write “source='textbook' in the main DB” unless the DB schema actually changes to that.
For analytics helper tables, verify the table’s own semantics first.
keyword_counts.sourcecurrently uses logical labels:textbookgaokao
- that does not match the physical
chunks.sourcelabels
When describing books:
69= full primary books present in the runtime DB and page-image registry35= supported primary books in the current public release175= corrected supplemental manifest books after identity audit118= supplemental-only books in the full identity manifest after removing the57books merged back into primary identities27= supported supplemental-only books in the current public release62= expected visible/api/bookstotal in the current public release
Before writing any status update, release note, or GitHub summary:
- explicitly say whether the number is:
- production current
- local pending rollout
Never silently switch between the two.
flowchart TD
A["Source PDFs / acquisition metadata"] --> B["clean_pdfs / curated inputs"]
B --> C["parsed / early md+json output"]
B --> D["MinerU primary OCR -> data/mineru_output"]
B --> E["MinerU backup OCR -> data/mineru_output_backup"]
D --> F["Unified DB build -> textbook_mineru_fts.db"]
F --> G["Primary FAISS -> textbook_chunks.index"]
D --> H["Page mapping -> book_map.json + R2 page images"]
E --> I["Supplemental page build -> supplemental_textbook_pages.jsonl.gz"]
I --> J["Supplemental FAISS -> supplemental_textbook_pages.index"]
F --> K["FastAPI runtime"]
G --> K
I --> K
J --> K
H --> L["img.rdfzer.com page CDN"]
K --> M["sun.bdfz.net"]
N["ai.bdfz.net / Worker apis"] --> K
Primary scripts:
scripts/01_download_textbooks.shscripts/01_download_textbooks_via_images.py
Outputs:
data/raw_pdf
Primary scripts:
scripts/02_pdf_to_md.pyscripts/06_ocr_pages_to_jsonl.pyscripts/07_ocr_fullpage.pyscripts/08_mineru_batch.py
Outputs:
data/parseddata/mineru_outputdata/mineru_output_backupdata/index/mineru_chunks.jsonland related intermediate JSONL files
Primary scripts:
scripts/09_build_unified_index.pyscripts/19_build_concept_map.py
Outputs:
data/index/textbook_mineru_fts.db
Primary script:
scripts/21_build_vector_index.py
Outputs:
data/index/textbook_chunks.indexdata/index/textbook_chunks.manifest.json
Primary scripts:
scripts/31_generate_page_maps.pyscripts/32_apply_page_mapping.pyscripts/upload_pages_r2.py
Outputs:
platform/frontend/assets/pages/book_map.jsonplatform/frontend/assets/pages/{short_key}/p{N}.webp- R2
pages/{short_key}/p{N}.webp
Current operational boundary:
- the current page-image product covers the
69primary books inbook_map.json - it does not yet cover the
118supplemental-only visible books, even though those books already haveorigin/layout/spanPDFs indata/mineru_output_backup - therefore, missing
查看原文on a supplemental-only result should currently be interpreted as “page-image product not generated for this edition yet”, not “the OCR text was mapped to the wrong primary book”
Primary script:
platform/scripts/build_supplemental_textbook_index.py
Inputs:
data/mineru_output_backupdata/index/textbook_mineru_fts.dbplatform/frontend/assets/pages/book_map.json
Outputs:
platform/backend/supplemental_textbook_pages.jsonl.gzplatform/backend/supplemental_textbook_pages.manifest.json
Current safe mapping rules:
- prefer direct
content_idmatch - require edition consistency if
edition_hintexists - if no edition hint exists, allow title-based match only when
(subject, normalized title)resolves uniquely - otherwise generate independent
suppbook:*
These rules exist to prevent cross-edition corruption while still allowing safe rebinding to primary page-image books.
Primary script:
platform/scripts/build_supplemental_vector_index.py
Outputs:
data/index/supplemental_textbook_pages.indexdata/index/supplemental_textbook_pages.vector.manifest.json
Runtime rule:
- if the supplemental page source fingerprint or source
sha256no longer matches the vector manifest, the supplemental FAISS must be treated as stale and disabled until rebuilt
platform/Dockerfile copies only:
backend/frontend/- runtime Python dependencies
The image does not bake the heavy runtime data tree under data/.
Repo-bundled fallback assets currently inside platform/backend/:
supplemental_textbook_pages.jsonl.gzsupplemental_textbook_pages.manifest.json- optionally supplemental vector files if explicitly copied there for a release
Required host-mounted roots:
/data/state
Required runtime assets:
/data/index/textbook_mineru_fts.db/data/index/textbook_chunks.index/data/index/textbook_chunks.manifest.json/data/index/supplemental_textbook_pages.jsonl.gz/data/index/supplemental_textbook_pages.manifest.json
Conditionally required runtime assets:
/data/index/supplemental_textbook_pages.index/data/index/supplemental_textbook_pages.vector.manifest.json
Startup behavior from platform/backend/entrypoint.sh:
- optionally run
sync_db.pyonly whenRUNTIME_DB_SYNC_MODEis explicitly enabled - run
preflight.py - start
uvicorn
The runtime does not rebuild FAISS or supplemental assets on the VPS.
platform/backend/sync_db.py is now an explicit emergency path only. When manually enabled, it can sync only:
textbook_mineru_fts.db
By default, production startup keeps RUNTIME_DB_SYNC_MODE=disabled, so it does not auto-sync any runtime DB.
It never auto-syncs:
textbook_chunks.indextextbook_chunks.manifest.jsonsupplemental_textbook_pages.jsonl.gzsupplemental_textbook_pages.manifest.jsonsupplemental_textbook_pages.indexsupplemental_textbook_pages.vector.manifest.json
Therefore the DB can move ahead while FAISS and supplemental assets stay old. This is one of the main failure modes of the current release model.
Page-image naming is part of the data contract.
Textbook page images:
- local source:
platform/frontend/assets/pages/{short_key}/p{N}.webp - remote R2/CDN path:
pages/{short_key}/p{N}.webp - CDN base:
https://img.rdfzer.com/pages/{short_key}/p{N}.webp
Dictionary page images:
- remote protected dirs include:
pages/dict_xuci/pages/dict_changyong/pages/dict_ciyuan/
Inline book-origin images shown in results:
https://img.rdfzer.com/orig/{urlencoded_book_key}/{filename}
Gaokao images:
https://img.rdfzer.com/gaokao/{filename}
scripts/upload_pages_r2.py stages textbook and dictionary page trees together before rclone sync.
Do not sync only textbook page roots to the pages/ prefix. That would delete remote dictionary assets.
Backend defaults in platform/backend/main.py currently point to:
- AI service URL default:
https://apis.bdfz.workers.dev/ - AI label default:
Gemini - AI model default:
gemini-flash-latest
External worker implementation currently has its own defaults in:
/Users/ylsuen/CF/upgrade_staging/apis/apis.js
Current worker defaults include:
- generic text/chat default model:
gemini-3.1-flash-lite-preview - vision fallback model:
gemini-flash-latest
This means model naming and default behavior must be checked on both sides when debugging AI output drift. Do not assume backend-requested model and worker-internal fallback are the same thing.
These are the main parameter/control surfaces that must be checked before release. If a parameter changes, its retrieval point must be verified in the corresponding file or endpoint.
Retrieval points:
platform/DockerfilePROJECT_ROOTDATA_ROOTSTATE_ROOTPORTHF_HOMESENTENCE_TRANSFORMERS_HOMETRANSFORMERS_CACHE
platform/backend/main.py- local/runtime path resolution
- bundled vs runtime supplemental asset discovery
platform/backend/preflight.py- required runtime assets
platform/backend/sync_db.py- DB auto-sync source and target paths
platform/scripts/deploy_vps.sh- runtime mount destinations and rollout gate behavior
Retrieval points in platform/backend/main.py:
SQLITE_BUSY_TIMEOUT_MSFAISS_SCORE_THRESHOLDSUPPLEMENTAL_VECTOR_ENABLEDSUPPLEMENTAL_VECTOR_SCORE_THRESHOLDQUERY_TERM_PLAN_LIMITSUPPLEMENTAL_FALLBACK_LIMITRERANKER_ENABLEDRERANKER_PRELOADRERANKER_MAX_CANDIDATESRERANKER_FINAL_LIMITGRAPH_RAG_ENABLEDGRAPH_RAG_MAX_RELATIONS- evidence-span cache and semantic cache parameters
Retrieval points:
platform/backend/main.pyAI_SERVICE_URLAI_SERVICE_LABELAI_SERVICE_MODELAI_SERVICE_TIMEOUT_SECAI_SERVICE_RETRIESAI_SERVICE_RETRY_DELAY_SECAI_SERVICE_ORIGINAI_SERVICE_REFERERAI_SERVICE_USER_AGENTAI_SERVICE_PROJECTAI_SERVICE_TASK_TYPEAI_SERVICE_THINKING_LEVELAI_INTERNAL_TOKEN
- external worker implementation:
CF/upgrade_staging/apis/apis.js- check worker default model and fallback model there
Retrieval points:
platform/backend/main.pySUPPLEMENTAL_REQUIRED- supplemental source fallback order
- supplemental vector source fallback order
platform/backend/preflight.pySUPPLEMENTAL_REQUIREDSUPPLEMENTAL_VECTOR_REQUIRED
platform/scripts/deploy_vps.shSUPPLEMENTAL_VECTOR_BUNDLEDHEALTH_REQUIRE_RERANKERHEALTH_REQUIRE_SUPPLEMENTAL_VECTOR
Retrieval points:
platform/frontend/assets/version.json- public version history and current version marker
platform/frontend/index.html- cache-buster query strings for
style.cssandapp.js
- cache-buster query strings for
platform/frontend/assets/app.jsIMG_CDN- API request behavior and UI rendering assumptions
Retrieval points:
platform/backend/main.pyIMG_CDN/api/page-image
platform/frontend/assets/app.js- textbook inline image path
- gaokao image path
scripts/upload_pages_r2.py- textbook page upload roots
- dictionary protected roots
- final R2 path contract
Retrieval points:
- live endpoint:
/api/health - live endpoint:
/assets/version.json - live endpoint:
/api/books - live endpoint:
/api/search - live endpoint:
/api/page-image
Important /api/health fields to check:
statusdb.chunksfaiss.okfaiss.vectorsfaiss.manifest.vector_rowsmodel.okreranker.loadedsupplemental.oksupplemental.sourcesupplemental.manifest.source_files_totalsupplemental.manifest.source_files_indexedsupplemental.manifest.bookssupplemental.manifest.primary_bookssupplemental.manifest.supplemental_only_bookssupplemental.manifest.pagessupplemental.manifest.unresolved_bookssupplemental.manifest.unresolved_pagessupplemental.manifest.edition_conflictssupplemental_vectors.enabledsupplemental_vectors.loadedsupplemental_vectors.vectorssupplemental_vectors.reason
Current production VPS facts already confirmed:
- root filesystem:
99Gtotal, about42Gavailable /root/cross-subject-knowledge/data/index: about244M/root/cross-subject-knowledge/state/cache/huggingface/hub: about5.4G/var/lib/docker: about2.7G- memory:
5.8Gi, available about3.3Gi - swap:
0
Implications:
- production has enough disk for the pending supplemental assets
- production should still not be used for OCR or FAISS rebuilds
- lack of swap means Docker build spikes or large model warmup should stay conservative
platform/.github/workflows/deploy.yml does this:
- SSH into the VPS
- create a fresh temporary release checkout
git clone --depth 1 --branch main ...- run
platform/scripts/deploy_vps.sh
This means:
- the VPS deploy does not see arbitrary local files unless they are committed to the repo or copied to the runtime host by a separate step
- a locally built supplemental vector under local
data/index/will not magically appear on the VPS - docs-only pushes are now expected to be filtered by workflow
paths-ignore; release pushes that should touch production must include runtime-affecting files
If the main DB changes, verify at least:
textbook_mineru_fts.dbtextbook_chunks.indextextbook_chunks.manifest.json
If supplemental mapping or supplemental page source changes, rebuild and ship all of:
supplemental_textbook_pages.jsonl.gzsupplemental_textbook_pages.manifest.jsonsupplemental_textbook_pages.indexsupplemental_textbook_pages.vector.manifest.json
If frontend presentation of new data behavior changes, update together:
frontend/index.htmlcache-busterfrontend/assets/version.json- frontend code
- backend behavior
- GitHub docs
Do not treat /api/health status=ok alone as release success.
For releases expecting semantic supplemental recall, release validation must also confirm:
- supplemental manifest loaded
- supplemental vectors loaded
- reranker loaded when rerank is required
Before every rebuild, refactor, deploy, or rollback, check all of the following.
git status --shortgit rev-parse --short HEAD- identify whether the change touches:
- DB
- primary FAISS
- supplemental page index
- supplemental FAISS
- page-image mapping
- frontend version markers
- deploy scripts
- AI gateway defaults
- identify whether the release is code-only, data-only, or mixed
Check or regenerate:
platform/backend/supplemental_textbook_pages.manifest.jsondata/index/textbook_chunks.manifest.json- DB row counts from
data/index/textbook_mineru_fts.db platform/frontend/assets/pages/book_map.jsonplatform/backend/textbook_version_manifest.json- local supplemental vector manifest if the vector exists
Local caliber points to confirm explicitly:
- primary DB books
- primary vector rows
- supplemental manifest books
- supplemental merged-primary books
- supplemental-only visible books
- visible
/api/bookstarget total - unresolved books/pages
- edition conflicts
- content-id-missing supplemental books
- blank-title duplicate groups
- if page-image scope changed in the release, whether new supplemental-edition page images were actually regenerated locally rather than only relabeled in metadata
Check live:
/api/health/assets/version.json/api/books- search regression queries on live production
- if any large artifact was shipped separately, confirm the remote size and SHA256 against the intended release source before restart
At minimum, compare production current against local pending for:
- DB row count
- primary vector count
- supplemental manifest counts
- supplemental vector loaded state
- frontend version marker
- page-image scope:
book_map.jsonbook count- whether any supplemental-only editions are intended to gain page images in this release
- whether that change is reflected both in local page assets and in R2/CDN
Before release, confirm whether the changed artifacts will actually reach the VPS.
Check:
.github/workflows/deploy.ymlplatform/scripts/deploy_vps.sh- benchmark the current-session transfer options for artifacts larger than about
50M- direct workstation -> VPS over SSH /
scp/rsync - workstation -> R2, then VPS ->
curl - choose based on the current network route and measured throughput; do not hardcode
R2forever
- direct workstation -> VPS over SSH /
- source and destination paths for:
- supplemental page index
- supplemental page manifest
- supplemental vector index
- supplemental vector manifest
- whether the changed asset is:
- committed into the repo checkout
- copied separately to the VPS runtime root
- or not transported at all
Before shipping large artifacts, re-check:
- free disk
- available memory
- swap presence
data/indexsize- HF cache size
- Docker storage pressure
Before release, confirm whether these need updating together:
platform/frontend/assets/version.jsonplatform/frontend/index.htmlplatform/README.mdplatform/docs/runtime_operations_overview.mdplatform/docs/data_layer_lineage_memory.md
Before release, confirm:
- backend AI defaults
- worker AI defaults
- current canonical public AI domain
- image CDN path contract
- R2 upload path contract
After any update, run all relevant checks below.
Run the project-relevant checks, including at least:
python3 -m py_compilefor changed Python modulesnode --check platform/frontend/assets/app.jsif frontend JS changedbash -n platform/scripts/deploy_vps.shif deploy script changedgit diff --check
If data changed, verify:
- supplemental manifest values are the intended ones
- supplemental vectors were rebuilt if the supplemental page source changed
- supplemental vector
verifypasses against the current source - DB / primary FAISS manifest alignment still holds
- book/page/image identity still matches the intended mapping rules
After release, check live:
/api/health/assets/version.json/api/books- representative
/api/page-imageresult - representative AI chat path if AI behavior changed
supplemental_vectors.loaded=trueif the release expects supplemental semantic recall
After release, re-run at minimum:
潜热海面蒸发潜热极性极性键晶体的定义
For each query, verify:
- no server error
- relevant top results
- correct subject behavior
- correct edition/book identity
- correct “view original” target
- no meaningless character-split fallback results
After release, confirm:
- frontend version marker matches the release
- cache-buster matches the release
- docs reflect the right production-vs-local wording
- GitHub-visible notes do not describe local-only assets as live
- if the release was mixed code+data, both sides are reflected in the public docs
These commands are the minimal retrieval entry points to pair with the checklists above. Run them from the repo root unless noted otherwise.
git status --short
git rev-parse --short HEADpython3 -m py_compile platform/backend/main.py \
platform/backend/preflight.py \
platform/scripts/build_supplemental_textbook_index.py \
platform/scripts/build_supplemental_vector_index.py
node --check platform/frontend/assets/app.js
bash -n platform/scripts/deploy_vps.sh
git diff --checkpython3 - <<'PY'
import json, sqlite3
from pathlib import Path
root = Path('.').resolve()
man = json.loads((root / 'platform/backend/supplemental_textbook_pages.manifest.json').read_text())
print('supp_books', man.get('books'))
print('supp_pages', man.get('pages'))
print('supp_source_pages', man.get('source_pages'))
print('primary_books', man.get('primary_books'))
print('supp_only_books', man.get('supplemental_only_books'))
print('primary_bound_pages_omitted', man.get('primary_bound_pages_omitted'))
print('primary_bound_page_lookup_misses', man.get('primary_bound_page_lookup_misses'))
print('unresolved_books', man.get('unresolved_books'))
print('unresolved_pages', man.get('unresolved_pages'))
print('edition_conflicts', man.get('edition_conflicts'))
print('cross_source_identity_conflicts', man.get('cross_source_identity_conflicts'))
print('content_id_missing_books', man.get('content_id_missing_books'))
print('blank_title_duplicate_groups', man.get('blank_title_duplicate_groups'))
ver = json.loads((root / 'platform/backend/textbook_version_manifest.json').read_text())
print('primary_manifest_books', ver.get('primary_books'))
print('resolved_primary_books', ver.get('resolved_primary_books'))
print('unresolved_primary_books', ver.get('unresolved_primary_books'))
print('duplicate_primary_identity_groups', ver.get('duplicate_primary_identity_groups'))
print('safe_merge_candidates', len(ver.get('safe_merge_candidates') or []))
con = sqlite3.connect(root / 'data/index/textbook_mineru_fts.db')
cur = con.cursor()
print('db_total', cur.execute("SELECT COUNT(*) FROM chunks").fetchone()[0])
print('db_textbook_runtime', cur.execute("SELECT COUNT(*) FROM chunks WHERE source != 'gaokao'").fetchone()[0])
print('db_gaokao', cur.execute("SELECT COUNT(*) FROM chunks WHERE source = 'gaokao'").fetchone()[0])
print('db_books', cur.execute("SELECT COUNT(DISTINCT book_key) FROM chunks WHERE source != 'gaokao' AND book_key IS NOT NULL AND book_key<>''").fetchone()[0])
PY/Users/ylsuen/.venv/bin/python platform/scripts/verify_textbook_runtime_data.pyHF_HUB_OFFLINE=1 /Users/ylsuen/textbook_ai_migration/.venv-vector/bin/python \
platform/scripts/build_supplemental_vector_index.py verify \
--source platform/backend/supplemental_textbook_pages.jsonl.gz \
--index data/index/supplemental_textbook_pages.index \
--manifest data/index/supplemental_textbook_pages.vector.manifest.jsoncurl -sS https://sun.bdfz.net/api/health | jq
curl -sS https://sun.bdfz.net/assets/version.json | jq
curl -sS https://sun.bdfz.net/api/books | jq '.books | length'curl -sS 'https://sun.bdfz.net/api/search?q=潜热&source=textbook&limit=10' | jq
curl -sS 'https://sun.bdfz.net/api/search?q=极性&source=textbook&limit=10' | jq
curl -sS 'https://sun.bdfz.net/api/search?q=极性键&source=textbook&limit=10' | jq
curl -sS 'https://sun.bdfz.net/api/search?q=晶体的定义&source=textbook&limit=10' | jqUse the production host shell to check:
df -h /
free -h
du -sh /root/cross-subject-knowledge/data/index
du -sh /root/cross-subject-knowledge/state/cache/huggingface/hub
docker pssed -n '1,220p' platform/.github/workflows/deploy.yml
sed -n '1,320p' platform/scripts/deploy_vps.shThese failure modes are part of long-term memory and should not be rediscovered from scratch.
Symptom:
- result text is real, but the linked book/page belongs to another edition
- user clicks “view original” and cannot find the text in that book
Root cause:
- supplemental OCR pages were mapped onto the wrong primary
book_key - different editions were allowed to share the same runtime book identity
Current guardrail:
- direct
content_idmatch first - edition-aware matching
- unique-title fallback only when safe
- otherwise independent
suppbook:*
Symptom:
- searching a term like
潜热returns pages that merely contain the characters潜and热separately - other subjects get pulled in even though the concept is not present
Root cause:
- over-broad character-level fallback in supplemental recall
Current guardrail:
- do not split user concepts into meaningless single-character fallback just to force results
- results without true term or sentence-level evidence should not survive final filtering
This rule is general. It applies to all queries, not only 潜热.
Symptom:
- some terms such as
极性error while similar terms such as极性键work
Known cause encountered in this round:
- mixed
int/strIDs during sort/merge in hybrid ranking
Current guardrail:
- stable sort identity handling must stay explicit in hybrid and rerank paths
- Confirm supplemental manifest reports:
unresolved_books=0unresolved_pages=0edition_conflicts=0
- Confirm no duplicate blank-edition book groups are leaking into single-book selector behavior
- Confirm page-image-bound rows do not point at missing primary page maps
- If supplemental vectors were rebuilt, run manifest and source verify against the current page source
- If a release claims new page-image coverage for supplemental editions, confirm those editions have real local page assets and are not merely remapped to a primary edition
At minimum, regression queries must include:
潜热海面蒸发潜热极性极性键晶体的定义
For each one, verify:
- result relevance
- correct subject scope
- correct book identity and edition
- correct “view original” page behavior
- absence of meaningless character-split hits
After deployment:
- check
/api/health - check
/assets/version.json - confirm supplemental manifest counts are the intended release counts
- confirm supplemental vector loaded state matches the release goal
- if page-image scope changed, sample
/api/page-imagefor one newly covered supplemental edition and one still-uncovered supplemental-only edition - rerun the regression queries above against live production
- if the rollout touches frontend or page-image behavior, confirm the running container image digest is the intended rollback anchor; do not assume
textbook-knowledge:lateststill matches the running container after a manual rollback - if the rollout touches “查看原文” behavior, verify the built image contains
/app/frontend/assets/pages/book_map.jsonand that a representative live search result returns a non-nullpage_url
As of 2026-03-10, the current local worktree still has these release blockers:
- Final supplemental vector rebuild is still in progress and must finish with a successful
verifyagainst the corrected supplemental page source. - Even after local build success, the current GitHub Actions deploy path will not automatically move the local supplemental vector from local
data/index/to the VPS runtime. - Production is still on the old supplemental manifest and still exhibits old-query behavior such as noisy
潜热results and the极性failure path. - Frontend version markers and GitHub-facing docs must be synchronized with the actual release contents before deployment.
Do not treat the local rebuild as deployable until all four blockers are cleared.
Short-term recommended stack remains:
- SQLite FTS5
- FAISS
- CrossEncoder reranker
- host-mounted runtime assets
The next meaningful improvement is not “move everything to a new database first”. It is:
- versioned runtime data artifacts
- explicit artifact transport into production
- sentence-level evidence extraction for definition queries
- supplemental FAISS fully integrated and deployed
- automated regression evaluation before cutover
Prioritization rule:
- for this project, data identity correctness, runtime asset consistency, and release verification come before broad framework rewrites
- do not let generic advice such as “split the monolith”, “migrate to React”, or “move to PostgreSQL” outrank concrete live risks like:
- wrong edition/book binding
- supplemental asset drift between local and production
- stale frontend version markers
- missing vector transport to VPS
- broken live query regression cases
Recommended future release model:
- build versioned artifact bundles off-box
- publish bundle checksums
- make VPS deploy pull a specific data-artifact version
- keep one machine-readable release manifest that includes:
- DB sha256
- primary FAISS sha256
- supplemental page index sha256
- supplemental vector sha256
- row counts
- book counts
- page counts
- separate textbook registry identity from page-image mapping
Before any future update, read this document and explicitly confirm:
- target state: local vs production
- affected layer: DB / FAISS / supplemental / page images / frontend / AI gateway
- affected artifacts to rebuild
- required version/file sync
- release verification queries
- rollback artifact or previous release anchor
If those six items are not written down, the change is not ready.
- Public CDN/R2 verification: representative high-school geography pages such as
https://img.rdfzer.com/pages/890c235d20c4/p4.webpandhttps://img.rdfzer.com/pages/5d7ca2682888/p103.webpreturnHTTP 200; the page images already exist remotely - Live production symptom remains reproducible under frontend version
2026.03.26-r33:curl -sS 'https://sun.bdfz.net/api/search?q=自然灾害&source=textbook&subject=地理&phase=高中&limit=10'returns high-school geography supplemental rows withpage_url=null - Root cause: the deployed image is missing the
10suppbook:*entries for the high-school geography人教版textbook+atlas set, even though the OCR rows and CDN pages already exist - Local verified target state after the mapping restore:
primary_db_books=117book_map_books=127book_map_primary_books=117book_map_supplemental_books=10supplemental_manifest.pages=2843supplemental_jsonl.book_map_key_rows=714
- Release class: this is a page-mapping rollout, not a DB rebuild, FAISS rebuild, or R2 upload rollout
- Production access warning: workstation SSH to
sun.bdfz.net/23.19.231.173is currently not healthy for manual inspection from this environment; the server closes the connection before shell. Do not assume manual VPS login works until SSH access is repaired or revalidated - Deploy-path warning: GitHub Actions uses repository secret
VPS_SSH_KEY; that trust path is separate from the workstation SSH config and separate from any VPS root password change. Verify the active deploy key before relying on an automatic rollout
- Incident: a VPS-side manual release was built from the stale runtime repo instead of a clean local release source, and the resulting image lost
frontend/assets/pages/book_map.json - User-facing symptom: live search results degraded to
page_url=null, so main-site “查看原文” disappeared even though the frontend button code still existed - Misleading rollback detail: the previously accepted rollback image also lacked
book_map.json, so “roll back to the accepted image” was not enough to restore page images - Fix that worked: ship a clean off-box release bundle containing the current frontend plus
frontend/assets/pages/book_map.json, then deploy from that temporary release checkout - Guardrail for future manual work: when
latestmay have drifted, tag the running container image digest explicitly before cutover and verifybook_map.jsoninside the built image before calling the release good