feat(accent): POS-driven ます/たい patches via Yahoo MA-UniDic by torrid-fish · Pull Request #49 · sessatakuma/API-tools

torrid-fish · 2026-05-20T14:32:12Z

目的

Closes #48. PR #47 用 regex-on-surface 處理日期 / 期間 / 年齡等 closed-class case。本 PR 把 Yahoo Furigana v2 換成 Yahoo MA v2 (UniDic 變體)，把 POS 資訊接到 pipeline 上，然後新增一層「in-place 詞尾 patch」處理動詞ます型與たい形容詞型的ま / た下降。

Base = feat/reading-overrides (PR #47)。等 #47 merge 後改 base 到 main。

方法／實作說明

Phase 1 — Spike (已完成，code 已刪除)

寫了一支 scripts/spike_ma_vs_furigana.py（throwaway，已在 commit 前刪掉）比對 MA-IPADic / MA-UniDic / Furigana 三者：

Surface concat parity：test_0.txt (36 行) / test_1.txt (57 行) 兩邊 100% match
Override boundary check：10 個觸發既有 override 的句子全部 hit on MA tokens
POS disambiguation probe：13 句邊界 case 全部分類正確（食べます ↔ 励ます、升名詞 ↔ ます助動詞、ません拆成 [ませ, ん]、たかった / たくない不同活用）

Phase 2 — Furigana → MA-UniDic 遷移

api/accent_marker.py：
- _YAHOO_FURIGANA_URL → _YAHOO_MA_URL = "https://jlp.yahooapis.jp/MAService/V2/parse"
- method 從 jlp.furiganaservice.furigana 改成 jlp.maservice.parse.unidic
- 解析新的 positional array response：[surface, reading, base, pos, pos1, conjugation_type, conjugation_form]，* null marker 對應到 Python None
WordResult / WordAccentResult 加 5 個 Optional POS 欄位（base / pos / pos1 / conjugation_type / conjugation_form）— 全部預設 None，override-constructed token（沒有 MA backing）能無痛維持向後相容
_build_word_result 把 POS metadata 從 input token 透過 **pos_meta unpacking 帶到 alignment 輸出

為何用 UniDic 不用 IPADic：兩者 7-field schema 完全一樣，但 UniDic 把ます/たい標成 pos=助動詞 + conjugation_type=助動詞-マス/-タイ，IPADic 標成 pos=接尾辞 + pos1=動詞性接尾辞。助動詞 是教科書級的標準分類，linguistics 文獻通用；conjugation_type=助動詞-マス 提供一個獨立第二軸給 self-check。唯一已知 regression 是 UniDic 把 升 mis-tag 成「固有名詞-人名-姓」(reading=ます)；不影響我們的 patch（surface=升漢字，failing the surface prefix check），但 升 單獨出現時讀音會顯示成ます而非 IPADic 的しょう。低頻名詞，未來如需要可加 regex override 修正。

Phase 3 — Override engine 加 POS predicate（基礎建設）

FuriganaOverride 加 pos_match: Callable[[list[Any]], bool] | None。在 _apply 內 boundary-resolve 之後做 filter，pos_match=None 保留原本純 regex 行為。本 PR 沒新增任何用到 pos_match 的 override — 純基礎建設，留給未來 case（e.g. 升 視 POS 決定發音）使用。

Phase 4 — `apply_accent_patches`（核心 feature）

新增 in-place patch pass，跟 apply_accent_overrides 的 full-span replacement 分離：

def apply_accent_patches(words: list[WordAccentResult]) -> list[WordAccentResult]:
    """POS-driven in-place accent patches on aligned MA tokens."""

呼叫順序：align_accent → apply_accent_overrides（既有 full-span overrides）→ apply_accent_patches（新的 POS patches）→ _restore_urls。

Self-check 五軸（任何一軸 mis-tag 都不會 fire）：

pos == "助動詞"（過濾 励ます 等 pos=動詞）
conjugation_type == "助動詞-マス" / "助動詞-タイ"（區分ます跟助動詞-ヌ等其他助動詞）
base == "ます" / "たい"
surface.startswith("ま") / "た"（過濾 UniDic 把 升 mis-tag 成人名・reading=ます的 case — surface 是漢字「升」）
conjugation_form ∈ {終止形*, 連用形*}（load-bearing：ません 的 ませ 是 cform=未然形-一般，kernel 自然落在せ而非ま，OJAD 已預測正確；少了這層 gate 就會把對的ません弄壞）

Patch 內容：first mora = FALL，其餘 = HEIBAN。透過 base 跨活用形不變的性質，一條規則涵蓋多種變體：ます / まし / たい / たく / たかっ共用同個 helper。

_PATCH_EXCEPTIONS: frozenset[str] 是 exception list 的 escape hatch — 初始為空（spike 沒找到任何 MA mis-tag），留給未來實際 case 出現再 populate。

驗證結果（11 句 disambiguation probe，全 pass）

輸入	Token	預期	實際
飲みます	ます (終止形)	ま↓す	ま=2, す=1 ✓
食べたい	たい (終止形)	た↓い	た=2, い=1 ✓
見ました	まし (連用形)	ま↓し	ま=2, し=1 ✓
行きません	ませ (未然形)	ま-↓せ-ん	ま=1, せ=2 ✓ (patch SKIPPED — gate 生效)
食べませんでした	ませ + でし + た	ま-せ-んでした	同上 ✓
彼を励ます	励ます (動詞 one token)	既有 OJAD 預測	不 patch ✓
升は容器です	升 (名詞)	不 patch	不 patch ✓
怠けたかった	たかっ (連用形-促音便)	た↓かっ	た=2, か=1, っ=1 ✓
泣きたくない	たく (連用形)	た↓く	た=2, く=1 ✓
3月5日(土)、7日間と1日	(regression)	byte-identical to #47	✓
彼女は20歳です	(regression)	byte-identical to #47	✓

Diff 範圍

api/accent_marker.py — Furigana→MA endpoint 切換、WordResult + WordAccentResult 加 5 個 Optional 欄位、_build_word_result 透傳 POS metadata、_process_accent_chunk 接 apply_accent_patches
api/reading_overrides.py — FuriganaOverride.pos_match、_apply filter logic、新增 apply_accent_patches 與 self-check predicates

Out of scope（留給後續 issue / PR）

i-adjective た形 跨 stem + 助動詞的多 token 規則（高かった = 高 + かっ + た），需要不同的多 token matcher
Loanword 3-mora 規則（バナナ / ピアノ系列）
数詞 + counter 不規則（4時 / 7時 / 9時 / 1分 / 1人 / 2人）— 屬於 regex-side，跟本 PR 的 POS-side 分離；之後做應該另開 PR
升 → しょう名詞 reading override — UniDic 把升 mis-tag 成人名的 side-effect

附註

POS predicate (Phase 3) 是基礎建設，本 PR 沒新增 user，純為未來 case 準備。判斷標準：用 regex 表達不到的 case 出現再加；現在堆 升→しょう 之類的 POS-aware override 算 premature。
把 spike script 在 commit 前刪掉了，但 spike 過程的設計決策（用 UniDic、conjugation_form 當第三 axis、初始 exception list 空）都寫在 docstring + commit message 裡。

🤖 Generated with Claude Code

Yahoo Furigana v2 → MA v2 (UniDic variant) gives us POS metadata (pos / pos1 / conjugation_type / conjugation_form / base) on every token, threaded through WordResult, align_accent, and WordAccentResult as Optional fields. Same host, auth, and request limit — no extra API call. POS-driven accent patches (apply_accent_patches in reading_overrides): - ます (pos=助動詞, ctype=助動詞-マス, base=ます, surface starts ま, cform ∈ {終止形,連用形}) → first-mora FALL, rest HEIBAN. - たい (pos=助動詞, ctype=助動詞-タイ, base=たい, surface starts た, cform ∈ {終止形,連用形}) → same shape. Self-check predicate combines five axes so a single MA mis-tag cannot fire the patch. The conjugation_form gate is the load-bearing one — ません's ませ token has cform=未然形-一般, where the kernel falls on せ before ん, NOT on ま; without the gate the patch would un-fix OJAD's correct prediction. Also extends FuriganaOverride with an optional pos_match predicate (threaded through _apply's boundary check) so future overrides can mix regex matching with POS filtering. Currently unused — backward compat preserved for all existing date/duration/age/weekday rules. Refs issue #48. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

torrid-fish · 2026-05-21T04:04:24Z

Superseded by the PR from spike/local-unidic (will open shortly).

Rationale: Issue #50 explicitly framed itself as an alternative to #48, with the spike outcome deciding which architecture ships. The spike on spike/local-unidic is now GO (see commit 2ed0f10's measurement report), and the local fugashi + UniDic 3.1.0 swap is implemented in commit bff43a9 on top of this PR's POS-rule infrastructure. Merging this PR first would land the Yahoo MA HTTP endpoint only to have the next PR rip it out — pure churn.

The new consolidated PR keeps everything this PR introduced (POS-driven apply_accent_patches, pos_match override predicate, the 5 POS columns on WordResult/WordAccentResult) — they all carry over identically to the local UniDic tokeniser. Only the upstream data source changes.

This was referenced May 20, 2026

Evaluate local UniDic (fugashi/Sudachi) for in-process accent + lower latency #50

Open

Rule-based post accent correction via Yahoo MA POS metadata #48

Open

torrid-fish closed this May 21, 2026

torrid-fish mentioned this pull request May 21, 2026

feat(accent): local UniDic + POS-driven patches (closes #48, #50) #51

Closed

torrid-fish deleted the feat/yahoo-ma-migration branch May 21, 2026 04:06

torrid-fish mentioned this pull request May 27, 2026

feat(accent): local UniDic + POS-driven patches #53

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(accent): POS-driven ます/たい patches via Yahoo MA-UniDic#49

feat(accent): POS-driven ます/たい patches via Yahoo MA-UniDic#49
torrid-fish wants to merge 1 commit into
feat/reading-overridesfrom
feat/yahoo-ma-migration

torrid-fish commented May 20, 2026

Uh oh!

torrid-fish commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

torrid-fish commented May 20, 2026

目的

方法／實作說明

Phase 1 — Spike (已完成，code 已刪除)

Phase 2 — Furigana → MA-UniDic 遷移

Phase 3 — Override engine 加 POS predicate（基礎建設）

Phase 4 — apply_accent_patches（核心 feature）

驗證結果（11 句 disambiguation probe，全 pass）

Diff 範圍

Out of scope（留給後續 issue / PR）

附註

Uh oh!

torrid-fish commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Phase 4 — `apply_accent_patches`（核心 feature）