Skip to content

feat(accent): reading override layer + streaming MarkAccent endpoint#47

Closed
torrid-fish wants to merge 3 commits into
mainfrom
feat/reading-overrides
Closed

feat(accent): reading override layer + streaming MarkAccent endpoint#47
torrid-fish wants to merge 3 commits into
mainfrom
feat/reading-overrides

Conversation

@torrid-fish
Copy link
Copy Markdown
Member

@torrid-fish torrid-fish commented May 20, 2026

目的

針對 Yahoo Furigana / OJAD 在日期、星期、期間、年齡等 context-sensitive 讀音上的錯誤,加上一層 regex-based override;同時把 /api/MarkAccent/ 升級成可串流的版本,並把 furigana endpoint 合併進來統一維護。

Base = refactor/accent-package (#52)。等 #52 merge 後改 base 到 main

Rebase 更新 (2026-05-21)

Rebased onto PR #52 (refactor/accent-package)。原本一坨堆在 api/accent_marker.py 的內容,已依照新的 package layout 重新分布:

  • regex override engine → api/accent/reading_overrides.py
  • streaming endpoint (/MarkAccent/stream/) → api/accent/routes.py + helper in api/accent/pipeline.py
  • DP aligner upgrade (greedy → Needleman-Wunsch with rendaku fold) → api/accent/align.py
  • URL / non-JP preprocessing + sentence splitting → api/accent/pipeline.py

3 commits on top of refactor/accent-package:

801aff4 feat(accent): add /MarkAccent/stream/ NDJSON endpoint + dev helpers
27d9da9 feat(accent): regex reading-override layer + URL/non-JP preprocessing
d659feb refactor(accent): replace greedy aligner with Needleman-Wunsch DP + rendaku fold

Previous tip b87ebe9 reachable via reflog if needed. Old base (feat/docker-compose) replaced with refactor/accent-package.

Verified locally:

  • ruff format --check, ruff check, mypy all pass on 9 source files
  • 3月5日(土) → date override fires (いつか), weekday bracket override fires ()
  • /MarkAccent/stream/ returns NDJSON, one object per line

下方原始說明仍然描述設計意圖(演算法、規則表、串流分段策略),檔案路徑請對照上方新 layout。

方法/實作說明

1. Reading override 層(now api/accent/reading_overrides.py

Yahoo Furigana 對日期/星期沒有上下文判斷(5日 → にち、(土) → つち),OJAD 對數字 token 也常 misalign。新增一層套在 Yahoo 之後、OJAD alignment 之後各跑一次的 regex override:

  • 資料結構FuriganaOverride(pattern, replacements, description) + ReplacementToken(furigana, surface, accent),引擎完全 generic,跟領域無關。
  • 比對演算法_collect_matches 收集所有 hit 後依 (start, -length) 排序、丟掉 overlap;同位置「較長 match 勝出」自然讓 N日間 (3-4 chars) 蓋過 N日 (2-3 chars)。
  • boundary check:match 一定要落在 Yahoo token 邊界上,沒對齊就 warn 並跳過 — 不會弄壞 Yahoo 原本的 token list。

涵蓋規則:

  • 日期 1-31日_date_overrides):1-10日、14日、20日、24日 給正確的 irregular 讀音(ついたち、ふつか、…、はつか);11-31日 給確定的 regular 讀音,避開 Yahoo 對數字 token 不回 furigana 的問題。
  • 星期 (月)-(日)_day_of_week_overrides):括號內單漢字星期。
  • 期間 1-31日間_duration_overrides):避免 Yahoo 把 1日間 切成 [1, 日間] 後 surface 變 1にちかん7日間 採 しちにちかん(現代偏好),1日間 採 いちにちかん(不能跟 ついたち 撞)。
  • 年齡 20歳/才_age_overrides):はたち(頭高)。

2. 數字變體 helper

_int_to_kanji(n) + _numeric_pattern(n) 把 (arabic, full-width, kanji) 三種寫法從整數自動展開,新增 N-prefixed 規則只需要寫 (n, 讀音),不用再手寫 (?:漢字|全形|半形) alternation。同時 honor feedback_japanese_text_variants — 任何 JP regex 都會 cover 三種變體。

3. 對齊演算法升級(now api/accent/align.py

  • Needleman-Wunsch DP align_accent 取代原本的 greedy alignment — 解決長段落裡一個 misalignment 連鎖污染後續所有 token 的問題。
  • Rendaku-tolerant:DP cost 容許 が↔か、だ↔た 等濁音/清音互換。
  • 長音字不再被跳過:原本 alignment 對「動画」、「映像」這類有長音的詞會直接 skip,現在能正確對到。
  • furigana override 在 OJAD alignment 之前套用,讓 OJAD 看到的是已修正的 token。

4. 串流 endpoint /api/MarkAccent/stream/

  • \n 與全形句點 () 切 chunk,每段獨立打 Yahoo + OJAD。
  • NDJSON 回傳,一個 chunk 一行 {"chunk": N, "subchunk": M, "status": ..., "result": [...], "error": ...}
  • 非日文段落、純 URL/純標點段落會被 skip 而不打外部 API。
  • 單一 chunk 失敗不會中斷整個 stream,會以 status=500 回該 chunk 然後繼續。
  • 長段落 stability fix:DP table 大小、subchunk 切點都有對應的 safety bound。

5. furigana → accent 合併

  • 刪掉舊的 api/furigana_marker.py/api/MarkFurigana/ route。所有用 furigana 的內部呼叫改走 _fetch_yahoo_raw + apply_furigana_overrides(now under api/accent/)。
  • _fetch_yahoo_raw 改成回傳 list[WordResult] | _YahooFetchError(frozen dataclass sentinel),避免外層 try/except 把 408 timeout 吞成 opaque 500。
  • override module 從 config/furigana_overrides.pyapi/accent/reading_overrides.py(git mv,保留歷史) — config/ 只該放真 config,不該放轉換邏輯。

6. 開發工具

  • test.sh:local API smoke test,支援 STREAM=1 ./test.sh 跑串流,per-line (surface|furigana|accent_marking_type) 輸出。
  • .gitignore:把 data/output/ 排除掉,避免 test fixture 進 git。

Diff 範圍

3 commits ahead of refactor/accent-package (#52),主要動到:

  • api/accent/align.py(DP align + rendaku fold)
  • api/accent/reading_overrides.py(新檔,從 config/furigana_overrides.py rename + 大幅擴充)
  • api/accent/pipeline.py(streaming chunk orchestrator、URL/non-JP preprocess)
  • api/accent/routes.py/MarkAccent/stream/ route)
  • test.sh(新增)

關聯

附註

  • Draft 狀態:streaming endpoint 的 stability fix 跟 override 表都還有可能再迭代(例如 日後日前日中 等期間訊號目前刻意暫不處理,等實際 case 出現再加)。
  • apply_*_overrides 已經涵蓋 furigana 跟 accent 兩條 path,新加 rule 不用動 pipeline 主流程。

🤖 Generated with Claude Code

torrid-fish added a commit that referenced this pull request May 21, 2026
…endaku fold

The greedy aligner had two failure modes that cascaded across whole
sentences: a numeric anchor that over-consumed when Yahoo and OJAD
disagreed on phrase boundary, and a +1 fallback path that turned a
single mismatch into type-0 fallback for every downstream token.

Replaces it with a global DP over (yahoo_token, ojad_entry) pairs:
each Yahoo token consumes k ∈ [0, K_MAX] contiguous OJAD entries,
with per-token cost computed via shape (punct/numeric/kana) and edit
distance over rendaku-folded strings for kana tokens. Sub cost
(0.4) is lower than ins/del (1.0) so the DP prefers same-length
spans with substitutions over shorter spans with deletions — fixes
the case where OJAD's `う` from `等→とう` leaked onto the next token.

Adds a voicing-fold table so Yahoo's dictionary-form readings
(ふんかん) align against OJAD's pronounced readings with rendaku
(ぷんかん). All comparisons under this fold; ぱ/ば/ぷ/ぶ all alias
to は/ふ.

Refs #47.
torrid-fish added a commit that referenced this pull request May 21, 2026
Add api/accent/reading_overrides.py — a context-blind correction layer
sitting between Yahoo Furigana and OJAD alignment. Each override is a
regex on the concatenated surface text plus the replacement tokens that
should appear instead. Covers:

- 曜日 brackets: (月)/(月)→ げつ, (土) → ど, etc. for all 7 weekdays.
- All 31 day-of-month readings: 1日 → ついたち (atamadaka), 5日 → いつか,
  14日 → じゅうよっか, 20日 → はつか, etc.
- N日間 durations 1-31: 1日間 → いちにちかん (NOT ついたちかん since
  the 1st-of-month reading is impossible for a duration), 7日間 →
  しちにちかん (modern technical writing preference over なのかかん).
- 20歳 / 二十歳 / 20才 → はたち (the only irregular age reading).

Patterns accept arabic / full-width / kanji numeral variants of the
same N so `3月5日(土)` / `3月5日(土)` / `三月五日(土)` all trigger
the same overrides. Order-of-overrides matters: duration list precedes
date list so `N日間` wins over `N日` at the same start (longer match
breaks ties in _collect_matches).

apply_furigana_overrides runs BEFORE align_accent so merged spans like
`5日→いつか` reach OJAD as a single token whose furigana matches OJAD's
phrase reading (the numeric-anchor logic in align_accent otherwise
cascades-fails because numeric tokens lack any Yahoo furigana).
apply_accent_overrides runs AFTER align to re-stamp both furigana and
accent on the same matched spans, so the response is consistent.

Adds URL preprocessing: each https?:// is swapped for the placeholder
"URLPLACEHOLDER" before the pipeline runs (Yahoo fragments URLs across
several alphabet tokens; OJAD's phrasing scraper produces noise for
Latin punctuation runs — both drag alignment off-rail). Placeholders
are walked back to the originals in order after alignment. URL body
stops at whitespace, any Japanese char, or `,()<>[]"'` so embedded
URLs strip cleanly.

Adds a non-Japanese short-circuit: if (after URL stripping) the chunk
contains no hiragana / katakana / CJK ideograph, skip Yahoo + OJAD
entirely and echo the chunk back as a single token. Lets pure-URL /
pure-English lines stream through cheaply.

Also adds stream_accent_chunks() to pipeline.py as a helper used by
the streaming endpoint added in the next commit. Splits the input on
\n then on full-width sentence terminators (。!?.) — long
paragraphs degrade OJAD's phrasing predictor and parallelising across
sentences caps the latency. In-flight work is bounded by a semaphore
(concurrency=4) because OJAD's u-tokyo backend falls over with 30+
parallel scrapes.

main.py docstring updated to reflect /MarkAccent/stream/.

Refs #47.
torrid-fish added a commit that referenced this pull request May 21, 2026
Add a streaming variant of /MarkAccent/ that processes the input as a
sequence of (line, sentence) chunks and emits one NDJSON object per
chunk in input order. Each line carries `{"chunk": line_idx,
"subchunk": sub_idx, ...AccentResponse}` so clients can render output
incrementally while keeping document position. Underlying chunk-fanout
and concurrency limiting live in pipeline.stream_accent_chunks; the
route is a thin StreamingResponse wrapper.

Streaming benefits compound: OJAD's phrasing predictor degrades on
long inputs (a single misaligned mora cascades across the paragraph),
so per-sentence chunks both stay short enough for OJAD to handle and
fan out under the bounded semaphore.

Also adds test.sh — a small bash smoke-test helper that POSTs a sample
text to either /MarkAccent/ or /MarkFurigana/ and pretty-prints the
per-moji (surface|furigana|accent_marking_type) rows. STREAM=1 switches
to the streaming endpoint, ENDPOINT= picks which router. Useful while
iterating on overrides; not wired into CI.

.gitignore adds data/ and output/ for ad-hoc test fixtures we don't
want committed.

Refs #47.
@torrid-fish torrid-fish force-pushed the feat/reading-overrides branch from b87ebe9 to 801aff4 Compare May 21, 2026 06:41
@torrid-fish torrid-fish changed the base branch from feat/docker-compose to refactor/accent-package May 21, 2026 06:42
@torrid-fish
Copy link
Copy Markdown
Member Author

Rebased onto PR #52 (refactor/accent-package). Previous monolithic api/accent_marker.py changes redistributed across the new package layout:

  • regex override engine → api/accent/reading_overrides.py
  • streaming endpoint (/MarkAccent/stream/) → api/accent/routes.py + helper in api/accent/pipeline.py
  • DP aligner upgrade (greedy → Needleman-Wunsch with rendaku fold) → api/accent/align.py
  • URL/non-JP preprocessing + sentence splitting → api/accent/pipeline.py

3 commits on top of refactor/accent-package:

801aff4 feat(accent): add /MarkAccent/stream/ NDJSON endpoint + dev helpers
27d9da9 feat(accent): regex reading-override layer + URL/non-JP preprocessing
d659feb refactor(accent): replace greedy aligner with Needleman-Wunsch DP + rendaku fold

Previous tip b87ebe9 reachable via reflog if needed. Old base (feat/docker-compose) replaced with refactor/accent-package.

Verified locally:

  • ruff format --check, ruff check, mypy all pass on 9 source files
  • 3月5日(土) → date override fires (いつか), weekday bracket override fires ()
  • /MarkAccent/stream/ returns NDJSON, one object per line

PR body still describes the old monolithic structure — will refresh if you want.

torrid-fish added a commit that referenced this pull request May 21, 2026
Tighten the no-behavior-change refactor based on Copilot's review on
PR #52.

Active findings:
- pipeline.py: rename unused `ojad_surface` to `_ojad_surface` so F841
  catches it if Ruff's unused-binding rule ever lands; the OJAD echo
  string isn't consumed here.
- furigana.py: wrap `response.json()` in try/except. The docstring
  promised malformed payloads would surface via the FuriganaResponse
  envelope, but an invalid Content-Type / non-JSON body would have
  raised through. Catch ValueError and return a 500 envelope.
- ojad.py: switch `raise e` -> bare `raise` and `logger.error(f"...")`
  -> `logger.exception(...)` to preserve the original traceback.
- models.py: "describe" -> "describes" (x3 occurrences); "givent"
  -> "given".

Low-confidence findings also addressed:
- align.py module docstring used to claim `punctuation_marks`,
  `skip_marks`, and `clean_query` are consumed by alignment. They
  aren't — they're carried over from the pre-refactor module for the
  downstream PR #47 to use. Reword to reflect that.
- clean_query docstring overclaimed punctuation stripping; it only
  filters ASCII letters. Reword + rename the local comprehension var
  from `chr` to `char` to stop shadowing the builtin.

Verified:
  uv run ruff check api/accent/ main.py        # all passed
  uv run ruff format --check api/accent/ main.py  # 8/8 formatted
  uv run mypy api/accent/ main.py              # 8 files, no issues
  POST /api/MarkAccent/ + /MarkFurigana/ routes still register

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
torrid-fish added a commit that referenced this pull request May 21, 2026
Two cleanups from Copilot's follow-up review on the fix commit:

- README mermaid diagram labels `align_accent` as "DP alignment", but
  the surrounding prose (lines 99/107) correctly calls it "single-pass
  greedy". DP is a future PR (#47). Rename the node to "Greedy
  alignment" so the diagram matches the implementation.
- `align.py:122` had a comment typo "regard as punchutation" →
  "regarded as punctuation".

No code changes; docstring / comment / diagram only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Base automatically changed from refactor/accent-package to main May 27, 2026 10:40
…endaku fold

The greedy aligner had two failure modes that cascaded across whole
sentences: a numeric anchor that over-consumed when Yahoo and OJAD
disagreed on phrase boundary, and a +1 fallback path that turned a
single mismatch into type-0 fallback for every downstream token.

Replaces it with a global DP over (yahoo_token, ojad_entry) pairs:
each Yahoo token consumes k ∈ [0, K_MAX] contiguous OJAD entries,
with per-token cost computed via shape (punct/numeric/kana) and edit
distance over rendaku-folded strings for kana tokens. Sub cost
(0.4) is lower than ins/del (1.0) so the DP prefers same-length
spans with substitutions over shorter spans with deletions — fixes
the case where OJAD's `う` from `等→とう` leaked onto the next token.

Adds a voicing-fold table so Yahoo's dictionary-form readings
(ふんかん) align against OJAD's pronounced readings with rendaku
(ぷんかん). All comparisons under this fold; ぱ/ば/ぷ/ぶ all alias
to は/ふ.

Refs #47.
Add api/accent/reading_overrides.py — a context-blind correction layer
sitting between Yahoo Furigana and OJAD alignment. Each override is a
regex on the concatenated surface text plus the replacement tokens that
should appear instead. Covers:

- 曜日 brackets: (月)/(月)→ げつ, (土) → ど, etc. for all 7 weekdays.
- All 31 day-of-month readings: 1日 → ついたち (atamadaka), 5日 → いつか,
  14日 → じゅうよっか, 20日 → はつか, etc.
- N日間 durations 1-31: 1日間 → いちにちかん (NOT ついたちかん since
  the 1st-of-month reading is impossible for a duration), 7日間 →
  しちにちかん (modern technical writing preference over なのかかん).
- 20歳 / 二十歳 / 20才 → はたち (the only irregular age reading).

Patterns accept arabic / full-width / kanji numeral variants of the
same N so `3月5日(土)` / `3月5日(土)` / `三月五日(土)` all trigger
the same overrides. Order-of-overrides matters: duration list precedes
date list so `N日間` wins over `N日` at the same start (longer match
breaks ties in _collect_matches).

apply_furigana_overrides runs BEFORE align_accent so merged spans like
`5日→いつか` reach OJAD as a single token whose furigana matches OJAD's
phrase reading (the numeric-anchor logic in align_accent otherwise
cascades-fails because numeric tokens lack any Yahoo furigana).
apply_accent_overrides runs AFTER align to re-stamp both furigana and
accent on the same matched spans, so the response is consistent.

Adds URL preprocessing: each https?:// is swapped for the placeholder
"URLPLACEHOLDER" before the pipeline runs (Yahoo fragments URLs across
several alphabet tokens; OJAD's phrasing scraper produces noise for
Latin punctuation runs — both drag alignment off-rail). Placeholders
are walked back to the originals in order after alignment. URL body
stops at whitespace, any Japanese char, or `,()<>[]"'` so embedded
URLs strip cleanly.

Adds a non-Japanese short-circuit: if (after URL stripping) the chunk
contains no hiragana / katakana / CJK ideograph, skip Yahoo + OJAD
entirely and echo the chunk back as a single token. Lets pure-URL /
pure-English lines stream through cheaply.

Also adds stream_accent_chunks() to pipeline.py as a helper used by
the streaming endpoint added in the next commit. Splits the input on
\n then on full-width sentence terminators (。!?.) — long
paragraphs degrade OJAD's phrasing predictor and parallelising across
sentences caps the latency. In-flight work is bounded by a semaphore
(concurrency=4) because OJAD's u-tokyo backend falls over with 30+
parallel scrapes.

main.py docstring updated to reflect /MarkAccent/stream/.

Refs #47.
Add a streaming variant of /MarkAccent/ that processes the input as a
sequence of (line, sentence) chunks and emits one NDJSON object per
chunk in input order. Each line carries `{"chunk": line_idx,
"subchunk": sub_idx, ...AccentResponse}` so clients can render output
incrementally while keeping document position. Underlying chunk-fanout
and concurrency limiting live in pipeline.stream_accent_chunks; the
route is a thin StreamingResponse wrapper.

Streaming benefits compound: OJAD's phrasing predictor degrades on
long inputs (a single misaligned mora cascades across the paragraph),
so per-sentence chunks both stay short enough for OJAD to handle and
fan out under the bounded semaphore.

Also adds test.sh — a small bash smoke-test helper that POSTs a sample
text to either /MarkAccent/ or /MarkFurigana/ and pretty-prints the
per-moji (surface|furigana|accent_marking_type) rows. STREAM=1 switches
to the streaming endpoint, ENDPOINT= picks which router. Useful while
iterating on overrides; not wired into CI.

.gitignore adds data/ and output/ for ad-hoc test fixtures we don't
want committed.

Refs #47.
@torrid-fish torrid-fish force-pushed the feat/reading-overrides branch from 801aff4 to c9b321d Compare May 27, 2026 10:51
@github-actions
Copy link
Copy Markdown

🛡️ PR Quality Check Summary

PR Title: Passed (Length: 68/75, Format: OK). feat(accent): reading override layer + streaming MarkAccent endpoint
Branch Name: Follows naming convention (feat/reading-overrides)
Commit Messages: 1 of 3 commit(s) failed validation
Conflicts: No merge conflict markers found
Python Quality: All checks passed.

📋 Click for detailed commit validation report
Expected format: `type(scope): description` (max 75 chars)
Valid types: build|chore|ci|docs|feat|fix|hotfix|perf|refactor|revert|style|test

Failed commits:
- [`c84bc65`] `refactor(accent): replace greedy aligner with Needleman-Wunsch DP + rendaku fold`
  ↳ Title is too long (is **80** chars, max is **75**)


⚠️ Please fix the failing checks (❌) before merging.

@torrid-fish
Copy link
Copy Markdown
Member Author

Closing in favour of #51, which now targets main directly. The reading-override layer, Needleman-Wunsch DP aligner, and /MarkAccent/stream/ endpoint from this branch are all included in #51 as the foundation of the local fugashi + UniDic migration — the Yahoo-backed intermediate is a stepping stone and won't ship independently, so it's folded into the single migration PR rather than merged separately.

torrid-fish added a commit that referenced this pull request May 27, 2026
The greedy aligner had two failure modes that cascaded across whole
sentences: a numeric anchor that over-consumed when Yahoo and OJAD
disagreed on phrase boundary, and a +1 fallback path that turned a
single mismatch into type-0 fallback for every downstream token.

Replaces it with a global DP over (yahoo_token, ojad_entry) pairs:
each Yahoo token consumes k ∈ [0, K_MAX] contiguous OJAD entries,
with per-token cost computed via shape (punct/numeric/kana) and edit
distance over rendaku-folded strings for kana tokens. Sub cost
(0.4) is lower than ins/del (1.0) so the DP prefers same-length
spans with substitutions over shorter spans with deletions — fixes
the case where OJAD's `う` from `等→とう` leaked onto the next token.

Adds a voicing-fold table so Yahoo's dictionary-form readings
(ふんかん) align against OJAD's pronounced readings with rendaku
(ぷんかん). All comparisons under this fold; ぱ/ば/ぷ/ぶ all alias
to は/ふ.

Refs #47.
torrid-fish added a commit that referenced this pull request May 27, 2026
Add api/accent/reading_overrides.py — a context-blind correction layer
sitting between Yahoo Furigana and OJAD alignment. Each override is a
regex on the concatenated surface text plus the replacement tokens that
should appear instead. Covers:

- 曜日 brackets: (月)/(月)→ げつ, (土) → ど, etc. for all 7 weekdays.
- All 31 day-of-month readings: 1日 → ついたち (atamadaka), 5日 → いつか,
  14日 → じゅうよっか, 20日 → はつか, etc.
- N日間 durations 1-31: 1日間 → いちにちかん (NOT ついたちかん since
  the 1st-of-month reading is impossible for a duration), 7日間 →
  しちにちかん (modern technical writing preference over なのかかん).
- 20歳 / 二十歳 / 20才 → はたち (the only irregular age reading).

Patterns accept arabic / full-width / kanji numeral variants of the
same N so `3月5日(土)` / `3月5日(土)` / `三月五日(土)` all trigger
the same overrides. Order-of-overrides matters: duration list precedes
date list so `N日間` wins over `N日` at the same start (longer match
breaks ties in _collect_matches).

apply_furigana_overrides runs BEFORE align_accent so merged spans like
`5日→いつか` reach OJAD as a single token whose furigana matches OJAD's
phrase reading (the numeric-anchor logic in align_accent otherwise
cascades-fails because numeric tokens lack any Yahoo furigana).
apply_accent_overrides runs AFTER align to re-stamp both furigana and
accent on the same matched spans, so the response is consistent.

Adds URL preprocessing: each https?:// is swapped for the placeholder
"URLPLACEHOLDER" before the pipeline runs (Yahoo fragments URLs across
several alphabet tokens; OJAD's phrasing scraper produces noise for
Latin punctuation runs — both drag alignment off-rail). Placeholders
are walked back to the originals in order after alignment. URL body
stops at whitespace, any Japanese char, or `,()<>[]"'` so embedded
URLs strip cleanly.

Adds a non-Japanese short-circuit: if (after URL stripping) the chunk
contains no hiragana / katakana / CJK ideograph, skip Yahoo + OJAD
entirely and echo the chunk back as a single token. Lets pure-URL /
pure-English lines stream through cheaply.

Also adds stream_accent_chunks() to pipeline.py as a helper used by
the streaming endpoint added in the next commit. Splits the input on
\n then on full-width sentence terminators (。!?.) — long
paragraphs degrade OJAD's phrasing predictor and parallelising across
sentences caps the latency. In-flight work is bounded by a semaphore
(concurrency=4) because OJAD's u-tokyo backend falls over with 30+
parallel scrapes.

main.py docstring updated to reflect /MarkAccent/stream/.

Refs #47.
torrid-fish added a commit that referenced this pull request May 27, 2026
Add a streaming variant of /MarkAccent/ that processes the input as a
sequence of (line, sentence) chunks and emits one NDJSON object per
chunk in input order. Each line carries `{"chunk": line_idx,
"subchunk": sub_idx, ...AccentResponse}` so clients can render output
incrementally while keeping document position. Underlying chunk-fanout
and concurrency limiting live in pipeline.stream_accent_chunks; the
route is a thin StreamingResponse wrapper.

Streaming benefits compound: OJAD's phrasing predictor degrades on
long inputs (a single misaligned mora cascades across the paragraph),
so per-sentence chunks both stay short enough for OJAD to handle and
fan out under the bounded semaphore.

Also adds test.sh — a small bash smoke-test helper that POSTs a sample
text to either /MarkAccent/ or /MarkFurigana/ and pretty-prints the
per-moji (surface|furigana|accent_marking_type) rows. STREAM=1 switches
to the streaming endpoint, ENDPOINT= picks which router. Useful while
iterating on overrides; not wired into CI.

.gitignore adds data/ and output/ for ad-hoc test fixtures we don't
want committed.

Refs #47.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant