just check- Check code (runscargo check --all-targets --all-features)just test- Run all tests (runscargo test --all-targets --all-features)just fix- Auto-fix code (runscargo fix --allow-dirty --allow-staged)just clean- Clean build artifacts
- Edition: Rust 2024
- Formatting:
- ALWAYS inline format args (e.g.
info!("inline {display} {debug:?}")) - ALWAYS prefer guard clauses and early returns/continues/breaks to nesting if/else
- ALWAYS inline format args (e.g.
- Imports: Use workspace dependencies when available
- Serialization: Serde with camelCase JSON naming via
#[serde(rename_all = "camelCase")] - Error Handling: Use wherror for custom error types, Result<T, WritingsError> alias
- Documentation: Comprehensive doc comments with examples
- Tests: Inline in modules with
#[cfg(test)], use#[test]attributes - Features: Use feature flags for optional functionality (embed-all, poem, utoipa, etc.)
The writings are parsed from bahai.org HTML sources using a visitor pattern. Each writing type (HiddenWords, Gleanings, Prayers, etc.) implements the WritingsVisitor trait:
- Core Trait:
WritingsVisitorwithvisit()method returningVisitorAction(VisitChildren, SkipChildren, Stop) - HTML Parsing: Uses
scrapercrate with CSS selectors to navigate DOM structure - Class-based Matching: Visitors match HTML elements by CSS classes (e.g.,
zd hbfor prologues,dd zdfor content) - State Management: Each visitor maintains parsing state (current section, paragraph count, etc.)
- Citation Handling: Extracts and resolves footnotes/endnotes using
CitationTextandresolve_citations() - Text Extraction:
ElementExttrait providestrimmed_text()methods for clean text extraction with citation support - Validation: Each visitor defines
EXPECTED_COUNTand validates parsed content against known totals
"wf"- Footer/stop condition (Gleanings, Meditations)"c q"- Roman numeral section headers (Gleanings, Meditations)"hb"- Author/header elements (appears in Hidden Words, Prayers, CDB)"zd"- Content sections (Hidden Words: "zd hb", "dd zd hb", "dd zd")"ub"- Section headers (Prayers: "ub c l", CDB: "ub w kf")
Hidden Words:
"w"- Top invocation text"zd hb"- Prologue/epilogue sections"dd zd hb"- Prelude text (special prefaces)"dd zd"- Main hidden word content
Gleanings & Meditations (Similar Structure):
"wf"- Footer (stop parsing)"c q"- Roman numeral section headers (I, II, III, etc.)
Prayers (Most Complex Structure):
"hb.ac"/"hb ac"- Author attribution"bf wf"- Endnotes (stop condition)"e"- Title elements"g c"- Prayer kind/category"ub c l"- Section headers"xc jb c kf z nb zd ub"- Subsection headers"c kf z nb zd ub"- Teaching sections"cb"/"z"- Instructional text
Call of Divine Beloved (Poetry Structure):
"ic .g"- Work titles"ic .hb"/"ic .j"- Work subtitles"a.td"- Paragraph numbers"ub w kf"- Invocation text"span.dd"- Poetry containers"span.ce"- Poetry lines
- Section Headers: Look for
"c q"(roman numerals) or"ub"variants - Content Paragraphs: Usually
"p"elements with specific class combinations - Stop Conditions: Typically
"wf"(footer) or"bf wf"(endnotes) - Author Attribution:
"hb"variants, often combined with"ac" - Special Text:
"zd"variants for prologues/epilogues,"w"for invocations - Poetry: Look for
"span.dd"containers with"span.ce"lines
- Download HTML: Fetch the target document from bahai.org and save to
writings/html/ - Extract CSS Classes: Use
grep -o 'class="[^"]*"' file.html | sort | uniqto identify all classes - Map Document Structure: Identify key structural elements:
- Navigation/TOC: Usually
nav.gcwithulstructure - Main Content: Look for
div.ddcontainers or similar content wrappers - Content Start: Beginning of actual content (excluding preface, introductions, etc.)
- Section Headers: Search for patterns like
"c q","ub c l", etc. - Content Paragraphs: Find
pelements with meaningful class combinations - Stop Conditions: End of content (usually just before end notes/footers)
- Reference IDs: Extract
a.sfelements withidattributes for paragraph references - Numbering Systems: Document-specific numbering (roman numerals, paragraph numbers, etc.)
- Special Elements: Identify citations (
supwitha.sf), poetry structures, etc.
- Navigation/TOC: Usually
- Create Struct: Define the Rust struct that will hold parsed data
- Implement WritingsVisitor Trait:
- Set
URLandEXPECTED_COUNT - Implement
visit()method with pattern matching - Add state management fields (counters, current section, etc.)
- Set
- CSS Class Constants: Define
LazyLock<ClassList>constants for each pattern - State Machine Logic: Handle document flow:
- Start Conditions: When to begin parsing (after title, first section, etc.)
- Content Extraction: How to extract text and metadata
- Transition Logic: When to move between sections/works
- Stop Conditions: When to terminate parsing
- Reference ID Extraction: Always use
self.get_ref_id(element)fora.sfelements - Text Extraction: Use
element.trimmed_text(depth, strip_newlines)with appropriate depth - Citation Handling: For complex documents, implement citation extraction and resolution
- Validation: Include
EXPECTED_COUNTvalidation and test with known text samples - Error Handling: Use
panic!for parsing errors during development, refine for production
- Unit Tests: Create
#[tokio::test]withtest_visitor::<Visitor>(EXPECTED_TEXTS).await - Expected Texts: Include 5-10 representative text samples from the document
- Count Validation: Ensure
EXPECTED_COUNTmatches actual parsed items - Integration: Add to workspace and test with
just test
- Gleanings/Meditations: Simple structure with roman numeral sections and paragraph counting
- Hidden Words: Complex state management with prelude/invocation tracking and part transitions
- Prayers: Most complex with nested sections, author detection, and citation resolution
- CDB: Poetry-specific with line-by-line parsing and work title detection
- Common Pattern: All visitors use
VisitorActionto control traversal flow - Text Processing:
trimmed_text()handles citation extraction and clean text normalization - State Management: Each visitor maintains parsing state specific to document structure