feat(accent): reading override layer + streaming MarkAccent endpoint#47
feat(accent): reading override layer + streaming MarkAccent endpoint#47torrid-fish wants to merge 3 commits into
Conversation
…endaku fold The greedy aligner had two failure modes that cascaded across whole sentences: a numeric anchor that over-consumed when Yahoo and OJAD disagreed on phrase boundary, and a +1 fallback path that turned a single mismatch into type-0 fallback for every downstream token. Replaces it with a global DP over (yahoo_token, ojad_entry) pairs: each Yahoo token consumes k ∈ [0, K_MAX] contiguous OJAD entries, with per-token cost computed via shape (punct/numeric/kana) and edit distance over rendaku-folded strings for kana tokens. Sub cost (0.4) is lower than ins/del (1.0) so the DP prefers same-length spans with substitutions over shorter spans with deletions — fixes the case where OJAD's `う` from `等→とう` leaked onto the next token. Adds a voicing-fold table so Yahoo's dictionary-form readings (ふんかん) align against OJAD's pronounced readings with rendaku (ぷんかん). All comparisons under this fold; ぱ/ば/ぷ/ぶ all alias to は/ふ. Refs #47.
Add api/accent/reading_overrides.py — a context-blind correction layer sitting between Yahoo Furigana and OJAD alignment. Each override is a regex on the concatenated surface text plus the replacement tokens that should appear instead. Covers: - 曜日 brackets: (月)/(月)→ げつ, (土) → ど, etc. for all 7 weekdays. - All 31 day-of-month readings: 1日 → ついたち (atamadaka), 5日 → いつか, 14日 → じゅうよっか, 20日 → はつか, etc. - N日間 durations 1-31: 1日間 → いちにちかん (NOT ついたちかん since the 1st-of-month reading is impossible for a duration), 7日間 → しちにちかん (modern technical writing preference over なのかかん). - 20歳 / 二十歳 / 20才 → はたち (the only irregular age reading). Patterns accept arabic / full-width / kanji numeral variants of the same N so `3月5日(土)` / `3月5日(土)` / `三月五日(土)` all trigger the same overrides. Order-of-overrides matters: duration list precedes date list so `N日間` wins over `N日` at the same start (longer match breaks ties in _collect_matches). apply_furigana_overrides runs BEFORE align_accent so merged spans like `5日→いつか` reach OJAD as a single token whose furigana matches OJAD's phrase reading (the numeric-anchor logic in align_accent otherwise cascades-fails because numeric tokens lack any Yahoo furigana). apply_accent_overrides runs AFTER align to re-stamp both furigana and accent on the same matched spans, so the response is consistent. Adds URL preprocessing: each https?:// is swapped for the placeholder "URLPLACEHOLDER" before the pipeline runs (Yahoo fragments URLs across several alphabet tokens; OJAD's phrasing scraper produces noise for Latin punctuation runs — both drag alignment off-rail). Placeholders are walked back to the originals in order after alignment. URL body stops at whitespace, any Japanese char, or `,()<>[]"'` so embedded URLs strip cleanly. Adds a non-Japanese short-circuit: if (after URL stripping) the chunk contains no hiragana / katakana / CJK ideograph, skip Yahoo + OJAD entirely and echo the chunk back as a single token. Lets pure-URL / pure-English lines stream through cheaply. Also adds stream_accent_chunks() to pipeline.py as a helper used by the streaming endpoint added in the next commit. Splits the input on \n then on full-width sentence terminators (。!?.) — long paragraphs degrade OJAD's phrasing predictor and parallelising across sentences caps the latency. In-flight work is bounded by a semaphore (concurrency=4) because OJAD's u-tokyo backend falls over with 30+ parallel scrapes. main.py docstring updated to reflect /MarkAccent/stream/. Refs #47.
Add a streaming variant of /MarkAccent/ that processes the input as a
sequence of (line, sentence) chunks and emits one NDJSON object per
chunk in input order. Each line carries `{"chunk": line_idx,
"subchunk": sub_idx, ...AccentResponse}` so clients can render output
incrementally while keeping document position. Underlying chunk-fanout
and concurrency limiting live in pipeline.stream_accent_chunks; the
route is a thin StreamingResponse wrapper.
Streaming benefits compound: OJAD's phrasing predictor degrades on
long inputs (a single misaligned mora cascades across the paragraph),
so per-sentence chunks both stay short enough for OJAD to handle and
fan out under the bounded semaphore.
Also adds test.sh — a small bash smoke-test helper that POSTs a sample
text to either /MarkAccent/ or /MarkFurigana/ and pretty-prints the
per-moji (surface|furigana|accent_marking_type) rows. STREAM=1 switches
to the streaming endpoint, ENDPOINT= picks which router. Useful while
iterating on overrides; not wired into CI.
.gitignore adds data/ and output/ for ad-hoc test fixtures we don't
want committed.
Refs #47.
b87ebe9 to
801aff4
Compare
|
Rebased onto PR #52 (refactor/accent-package). Previous monolithic
3 commits on top of Previous tip Verified locally:
PR body still describes the old monolithic structure — will refresh if you want. |
Tighten the no-behavior-change refactor based on Copilot's review on PR #52. Active findings: - pipeline.py: rename unused `ojad_surface` to `_ojad_surface` so F841 catches it if Ruff's unused-binding rule ever lands; the OJAD echo string isn't consumed here. - furigana.py: wrap `response.json()` in try/except. The docstring promised malformed payloads would surface via the FuriganaResponse envelope, but an invalid Content-Type / non-JSON body would have raised through. Catch ValueError and return a 500 envelope. - ojad.py: switch `raise e` -> bare `raise` and `logger.error(f"...")` -> `logger.exception(...)` to preserve the original traceback. - models.py: "describe" -> "describes" (x3 occurrences); "givent" -> "given". Low-confidence findings also addressed: - align.py module docstring used to claim `punctuation_marks`, `skip_marks`, and `clean_query` are consumed by alignment. They aren't — they're carried over from the pre-refactor module for the downstream PR #47 to use. Reword to reflect that. - clean_query docstring overclaimed punctuation stripping; it only filters ASCII letters. Reword + rename the local comprehension var from `chr` to `char` to stop shadowing the builtin. Verified: uv run ruff check api/accent/ main.py # all passed uv run ruff format --check api/accent/ main.py # 8/8 formatted uv run mypy api/accent/ main.py # 8 files, no issues POST /api/MarkAccent/ + /MarkFurigana/ routes still register Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cleanups from Copilot's follow-up review on the fix commit: - README mermaid diagram labels `align_accent` as "DP alignment", but the surrounding prose (lines 99/107) correctly calls it "single-pass greedy". DP is a future PR (#47). Rename the node to "Greedy alignment" so the diagram matches the implementation. - `align.py:122` had a comment typo "regard as punchutation" → "regarded as punctuation". No code changes; docstring / comment / diagram only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…endaku fold The greedy aligner had two failure modes that cascaded across whole sentences: a numeric anchor that over-consumed when Yahoo and OJAD disagreed on phrase boundary, and a +1 fallback path that turned a single mismatch into type-0 fallback for every downstream token. Replaces it with a global DP over (yahoo_token, ojad_entry) pairs: each Yahoo token consumes k ∈ [0, K_MAX] contiguous OJAD entries, with per-token cost computed via shape (punct/numeric/kana) and edit distance over rendaku-folded strings for kana tokens. Sub cost (0.4) is lower than ins/del (1.0) so the DP prefers same-length spans with substitutions over shorter spans with deletions — fixes the case where OJAD's `う` from `等→とう` leaked onto the next token. Adds a voicing-fold table so Yahoo's dictionary-form readings (ふんかん) align against OJAD's pronounced readings with rendaku (ぷんかん). All comparisons under this fold; ぱ/ば/ぷ/ぶ all alias to は/ふ. Refs #47.
Add api/accent/reading_overrides.py — a context-blind correction layer sitting between Yahoo Furigana and OJAD alignment. Each override is a regex on the concatenated surface text plus the replacement tokens that should appear instead. Covers: - 曜日 brackets: (月)/(月)→ げつ, (土) → ど, etc. for all 7 weekdays. - All 31 day-of-month readings: 1日 → ついたち (atamadaka), 5日 → いつか, 14日 → じゅうよっか, 20日 → はつか, etc. - N日間 durations 1-31: 1日間 → いちにちかん (NOT ついたちかん since the 1st-of-month reading is impossible for a duration), 7日間 → しちにちかん (modern technical writing preference over なのかかん). - 20歳 / 二十歳 / 20才 → はたち (the only irregular age reading). Patterns accept arabic / full-width / kanji numeral variants of the same N so `3月5日(土)` / `3月5日(土)` / `三月五日(土)` all trigger the same overrides. Order-of-overrides matters: duration list precedes date list so `N日間` wins over `N日` at the same start (longer match breaks ties in _collect_matches). apply_furigana_overrides runs BEFORE align_accent so merged spans like `5日→いつか` reach OJAD as a single token whose furigana matches OJAD's phrase reading (the numeric-anchor logic in align_accent otherwise cascades-fails because numeric tokens lack any Yahoo furigana). apply_accent_overrides runs AFTER align to re-stamp both furigana and accent on the same matched spans, so the response is consistent. Adds URL preprocessing: each https?:// is swapped for the placeholder "URLPLACEHOLDER" before the pipeline runs (Yahoo fragments URLs across several alphabet tokens; OJAD's phrasing scraper produces noise for Latin punctuation runs — both drag alignment off-rail). Placeholders are walked back to the originals in order after alignment. URL body stops at whitespace, any Japanese char, or `,()<>[]"'` so embedded URLs strip cleanly. Adds a non-Japanese short-circuit: if (after URL stripping) the chunk contains no hiragana / katakana / CJK ideograph, skip Yahoo + OJAD entirely and echo the chunk back as a single token. Lets pure-URL / pure-English lines stream through cheaply. Also adds stream_accent_chunks() to pipeline.py as a helper used by the streaming endpoint added in the next commit. Splits the input on \n then on full-width sentence terminators (。!?.) — long paragraphs degrade OJAD's phrasing predictor and parallelising across sentences caps the latency. In-flight work is bounded by a semaphore (concurrency=4) because OJAD's u-tokyo backend falls over with 30+ parallel scrapes. main.py docstring updated to reflect /MarkAccent/stream/. Refs #47.
Add a streaming variant of /MarkAccent/ that processes the input as a
sequence of (line, sentence) chunks and emits one NDJSON object per
chunk in input order. Each line carries `{"chunk": line_idx,
"subchunk": sub_idx, ...AccentResponse}` so clients can render output
incrementally while keeping document position. Underlying chunk-fanout
and concurrency limiting live in pipeline.stream_accent_chunks; the
route is a thin StreamingResponse wrapper.
Streaming benefits compound: OJAD's phrasing predictor degrades on
long inputs (a single misaligned mora cascades across the paragraph),
so per-sentence chunks both stay short enough for OJAD to handle and
fan out under the bounded semaphore.
Also adds test.sh — a small bash smoke-test helper that POSTs a sample
text to either /MarkAccent/ or /MarkFurigana/ and pretty-prints the
per-moji (surface|furigana|accent_marking_type) rows. STREAM=1 switches
to the streaming endpoint, ENDPOINT= picks which router. Useful while
iterating on overrides; not wired into CI.
.gitignore adds data/ and output/ for ad-hoc test fixtures we don't
want committed.
Refs #47.
801aff4 to
c9b321d
Compare
🛡️ PR Quality Check Summary✅ PR Title: Passed (Length: 68/75, Format: OK). 📋 Click for detailed commit validation report |
|
Closing in favour of #51, which now targets |
The greedy aligner had two failure modes that cascaded across whole sentences: a numeric anchor that over-consumed when Yahoo and OJAD disagreed on phrase boundary, and a +1 fallback path that turned a single mismatch into type-0 fallback for every downstream token. Replaces it with a global DP over (yahoo_token, ojad_entry) pairs: each Yahoo token consumes k ∈ [0, K_MAX] contiguous OJAD entries, with per-token cost computed via shape (punct/numeric/kana) and edit distance over rendaku-folded strings for kana tokens. Sub cost (0.4) is lower than ins/del (1.0) so the DP prefers same-length spans with substitutions over shorter spans with deletions — fixes the case where OJAD's `う` from `等→とう` leaked onto the next token. Adds a voicing-fold table so Yahoo's dictionary-form readings (ふんかん) align against OJAD's pronounced readings with rendaku (ぷんかん). All comparisons under this fold; ぱ/ば/ぷ/ぶ all alias to は/ふ. Refs #47.
Add api/accent/reading_overrides.py — a context-blind correction layer sitting between Yahoo Furigana and OJAD alignment. Each override is a regex on the concatenated surface text plus the replacement tokens that should appear instead. Covers: - 曜日 brackets: (月)/(月)→ げつ, (土) → ど, etc. for all 7 weekdays. - All 31 day-of-month readings: 1日 → ついたち (atamadaka), 5日 → いつか, 14日 → じゅうよっか, 20日 → はつか, etc. - N日間 durations 1-31: 1日間 → いちにちかん (NOT ついたちかん since the 1st-of-month reading is impossible for a duration), 7日間 → しちにちかん (modern technical writing preference over なのかかん). - 20歳 / 二十歳 / 20才 → はたち (the only irregular age reading). Patterns accept arabic / full-width / kanji numeral variants of the same N so `3月5日(土)` / `3月5日(土)` / `三月五日(土)` all trigger the same overrides. Order-of-overrides matters: duration list precedes date list so `N日間` wins over `N日` at the same start (longer match breaks ties in _collect_matches). apply_furigana_overrides runs BEFORE align_accent so merged spans like `5日→いつか` reach OJAD as a single token whose furigana matches OJAD's phrase reading (the numeric-anchor logic in align_accent otherwise cascades-fails because numeric tokens lack any Yahoo furigana). apply_accent_overrides runs AFTER align to re-stamp both furigana and accent on the same matched spans, so the response is consistent. Adds URL preprocessing: each https?:// is swapped for the placeholder "URLPLACEHOLDER" before the pipeline runs (Yahoo fragments URLs across several alphabet tokens; OJAD's phrasing scraper produces noise for Latin punctuation runs — both drag alignment off-rail). Placeholders are walked back to the originals in order after alignment. URL body stops at whitespace, any Japanese char, or `,()<>[]"'` so embedded URLs strip cleanly. Adds a non-Japanese short-circuit: if (after URL stripping) the chunk contains no hiragana / katakana / CJK ideograph, skip Yahoo + OJAD entirely and echo the chunk back as a single token. Lets pure-URL / pure-English lines stream through cheaply. Also adds stream_accent_chunks() to pipeline.py as a helper used by the streaming endpoint added in the next commit. Splits the input on \n then on full-width sentence terminators (。!?.) — long paragraphs degrade OJAD's phrasing predictor and parallelising across sentences caps the latency. In-flight work is bounded by a semaphore (concurrency=4) because OJAD's u-tokyo backend falls over with 30+ parallel scrapes. main.py docstring updated to reflect /MarkAccent/stream/. Refs #47.
Add a streaming variant of /MarkAccent/ that processes the input as a
sequence of (line, sentence) chunks and emits one NDJSON object per
chunk in input order. Each line carries `{"chunk": line_idx,
"subchunk": sub_idx, ...AccentResponse}` so clients can render output
incrementally while keeping document position. Underlying chunk-fanout
and concurrency limiting live in pipeline.stream_accent_chunks; the
route is a thin StreamingResponse wrapper.
Streaming benefits compound: OJAD's phrasing predictor degrades on
long inputs (a single misaligned mora cascades across the paragraph),
so per-sentence chunks both stay short enough for OJAD to handle and
fan out under the bounded semaphore.
Also adds test.sh — a small bash smoke-test helper that POSTs a sample
text to either /MarkAccent/ or /MarkFurigana/ and pretty-prints the
per-moji (surface|furigana|accent_marking_type) rows. STREAM=1 switches
to the streaming endpoint, ENDPOINT= picks which router. Useful while
iterating on overrides; not wired into CI.
.gitignore adds data/ and output/ for ad-hoc test fixtures we don't
want committed.
Refs #47.
目的
針對 Yahoo Furigana / OJAD 在日期、星期、期間、年齡等 context-sensitive 讀音上的錯誤,加上一層 regex-based override;同時把
/api/MarkAccent/升級成可串流的版本,並把 furigana endpoint 合併進來統一維護。Rebase 更新 (2026-05-21)
Rebased onto PR #52 (
refactor/accent-package)。原本一坨堆在api/accent_marker.py的內容,已依照新的 package layout 重新分布:api/accent/reading_overrides.py/MarkAccent/stream/) →api/accent/routes.py+ helper inapi/accent/pipeline.pyapi/accent/align.pyapi/accent/pipeline.py3 commits on top of
refactor/accent-package:Previous tip
b87ebe9reachable via reflog if needed. Old base (feat/docker-compose) replaced withrefactor/accent-package.Verified locally:
ruff format --check,ruff check,mypyall pass on 9 source files3月5日(土)→ date override fires (いつか), weekday bracket override fires (ど)/MarkAccent/stream/returns NDJSON, one object per line下方原始說明仍然描述設計意圖(演算法、規則表、串流分段策略),檔案路徑請對照上方新 layout。
方法/實作說明
1. Reading override 層(now
api/accent/reading_overrides.py)Yahoo Furigana 對日期/星期沒有上下文判斷(
5日→ にち、(土)→ つち),OJAD 對數字 token 也常 misalign。新增一層套在 Yahoo 之後、OJAD alignment 之後各跑一次的 regex override:FuriganaOverride(pattern, replacements, description)+ReplacementToken(furigana, surface, accent),引擎完全 generic,跟領域無關。_collect_matches收集所有 hit 後依(start, -length)排序、丟掉 overlap;同位置「較長 match 勝出」自然讓N日間(3-4 chars) 蓋過N日(2-3 chars)。涵蓋規則:
_date_overrides):1-10日、14日、20日、24日 給正確的 irregular 讀音(ついたち、ふつか、…、はつか);11-31日 給確定的 regular 讀音,避開 Yahoo 對數字 token 不回 furigana 的問題。_day_of_week_overrides):括號內單漢字星期。_duration_overrides):避免 Yahoo 把1日間切成[1, 日間]後 surface 變1にちかん。7日間採 しちにちかん(現代偏好),1日間採 いちにちかん(不能跟 ついたち 撞)。_age_overrides):はたち(頭高)。2. 數字變體 helper
_int_to_kanji(n)+_numeric_pattern(n)把 (arabic, full-width, kanji) 三種寫法從整數自動展開,新增 N-prefixed 規則只需要寫(n, 讀音),不用再手寫(?:漢字|全形|半形)alternation。同時 honorfeedback_japanese_text_variants— 任何 JP regex 都會 cover 三種變體。3. 對齊演算法升級(now
api/accent/align.py)align_accent取代原本的 greedy alignment — 解決長段落裡一個 misalignment 連鎖污染後續所有 token 的問題。4. 串流 endpoint
/api/MarkAccent/stream/\n與全形句點 (。/!/?) 切 chunk,每段獨立打 Yahoo + OJAD。{"chunk": N, "subchunk": M, "status": ..., "result": [...], "error": ...}。status=500回該 chunk 然後繼續。5. furigana → accent 合併
api/furigana_marker.py與/api/MarkFurigana/route。所有用 furigana 的內部呼叫改走_fetch_yahoo_raw + apply_furigana_overrides(now underapi/accent/)。_fetch_yahoo_raw改成回傳list[WordResult] | _YahooFetchError(frozen dataclass sentinel),避免外層 try/except 把 408 timeout 吞成 opaque 500。config/furigana_overrides.py→api/accent/reading_overrides.py(git mv,保留歷史) —config/只該放真 config,不該放轉換邏輯。6. 開發工具
test.sh:local API smoke test,支援STREAM=1 ./test.sh跑串流,per-line(surface|furigana|accent_marking_type)輸出。.gitignore:把data/跟output/排除掉,避免 test fixture 進 git。Diff 範圍
3 commits ahead of
refactor/accent-package(#52),主要動到:api/accent/align.py(DP align + rendaku fold)api/accent/reading_overrides.py(新檔,從config/furigana_overrides.pyrename + 大幅擴充)api/accent/pipeline.py(streaming chunk orchestrator、URL/non-JP preprocess)api/accent/routes.py(/MarkAccent/stream/route)test.sh(新增)關聯
refactor/accent-package) 上。附註
日後/日前/日中等期間訊號目前刻意暫不處理,等實際 case 出現再加)。apply_*_overrides已經涵蓋 furigana 跟 accent 兩條 path,新加 rule 不用動 pipeline 主流程。🤖 Generated with Claude Code