feat(accent): POS-driven ます/たい patches via Yahoo MA-UniDic#49
feat(accent): POS-driven ます/たい patches via Yahoo MA-UniDic#49torrid-fish wants to merge 1 commit into
Conversation
Yahoo Furigana v2 → MA v2 (UniDic variant) gives us POS metadata
(pos / pos1 / conjugation_type / conjugation_form / base) on every
token, threaded through WordResult, align_accent, and WordAccentResult
as Optional fields. Same host, auth, and request limit — no extra
API call.
POS-driven accent patches (apply_accent_patches in reading_overrides):
- ます (pos=助動詞, ctype=助動詞-マス, base=ます, surface starts ま,
cform ∈ {終止形,連用形}) → first-mora FALL, rest HEIBAN.
- たい (pos=助動詞, ctype=助動詞-タイ, base=たい, surface starts た,
cform ∈ {終止形,連用形}) → same shape.
Self-check predicate combines five axes so a single MA mis-tag cannot
fire the patch. The conjugation_form gate is the load-bearing one —
ません's ませ token has cform=未然形-一般, where the kernel falls on
せ before ん, NOT on ま; without the gate the patch would un-fix
OJAD's correct prediction.
Also extends FuriganaOverride with an optional pos_match predicate
(threaded through _apply's boundary check) so future overrides can
mix regex matching with POS filtering. Currently unused — backward
compat preserved for all existing date/duration/age/weekday rules.
Refs issue #48.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Superseded by the PR from Rationale: Issue #50 explicitly framed itself as an alternative to #48, with the spike outcome deciding which architecture ships. The spike on The new consolidated PR keeps everything this PR introduced (POS-driven |
目的
Closes #48. PR #47 用 regex-on-surface 處理日期 / 期間 / 年齡 等 closed-class case。本 PR 把 Yahoo Furigana v2 換成 Yahoo MA v2 (UniDic 變體),把 POS 資訊接到 pipeline 上,然後新增一層「in-place 詞尾 patch」處理動詞ます型與たい形容詞型的 ま / た 下降。
方法/實作說明
Phase 1 — Spike (已完成,code 已刪除)
寫了一支
scripts/spike_ma_vs_furigana.py(throwaway,已在 commit 前刪掉)比對 MA-IPADic / MA-UniDic / Furigana 三者:Phase 2 — Furigana → MA-UniDic 遷移
api/accent_marker.py:_YAHOO_FURIGANA_URL→_YAHOO_MA_URL = "https://jlp.yahooapis.jp/MAService/V2/parse"method從jlp.furiganaservice.furigana改成jlp.maservice.parse.unidic[surface, reading, base, pos, pos1, conjugation_type, conjugation_form],*null marker 對應到 PythonNoneWordResult/WordAccentResult加 5 個 Optional POS 欄位(base/pos/pos1/conjugation_type/conjugation_form)— 全部預設None,override-constructed token(沒有 MA backing)能無痛維持向後相容_build_word_result把 POS metadata 從 input token 透過**pos_metaunpacking 帶到 alignment 輸出為何用 UniDic 不用 IPADic:兩者 7-field schema 完全一樣,但 UniDic 把 ます/たい 標成
pos=助動詞+conjugation_type=助動詞-マス/-タイ,IPADic 標成pos=接尾辞+pos1=動詞性接尾辞。助動詞是教科書級的標準分類,linguistics 文獻通用;conjugation_type=助動詞-マス提供一個獨立第二軸給 self-check。唯一已知 regression 是 UniDic 把升mis-tag 成「固有名詞-人名-姓」(reading=ます);不影響我們的 patch(surface=升 漢字,failing the surface prefix check),但升單獨出現時讀音會顯示成 ます 而非 IPADic 的 しょう。低頻名詞,未來如需要可加 regex override 修正。Phase 3 — Override engine 加 POS predicate(基礎建設)
FuriganaOverride加pos_match: Callable[[list[Any]], bool] | None。在_apply內 boundary-resolve 之後做 filter,pos_match=None保留原本純 regex 行為。本 PR 沒新增任何用到pos_match的 override — 純基礎建設,留給未來 case(e.g.升視 POS 決定發音)使用。Phase 4 —
apply_accent_patches(核心 feature)新增 in-place patch pass,跟
apply_accent_overrides的 full-span replacement 分離:呼叫順序:
align_accent→apply_accent_overrides(既有 full-span overrides)→apply_accent_patches(新的 POS patches)→_restore_urls。Self-check 五軸(任何一軸 mis-tag 都不會 fire):
pos == "助動詞"(過濾励ます等pos=動詞)conjugation_type == "助動詞-マス"/"助動詞-タイ"(區分 ます 跟 助動詞-ヌ 等其他助動詞)base == "ます"/"たい"surface.startswith("ま")/"た"(過濾 UniDic 把升mis-tag 成人名・reading=ます 的 case — surface 是漢字「升」)conjugation_form ∈ {終止形*, 連用形*}(load-bearing:ません的ませ是cform=未然形-一般,kernel 自然落在 せ 而非 ま,OJAD 已預測正確;少了這層 gate 就會把對的 ません 弄壞)Patch 內容:first mora = FALL,其餘 = HEIBAN。透過
base跨活用形不變的性質,一條規則涵蓋多種變體:ます / まし / たい / たく / たかっ 共用同個 helper。_PATCH_EXCEPTIONS: frozenset[str]是 exception list 的 escape hatch — 初始為空(spike 沒找到任何 MA mis-tag),留給未來實際 case 出現再 populate。驗證結果(11 句 disambiguation probe,全 pass)
Diff 範圍
api/accent_marker.py— Furigana→MA endpoint 切換、WordResult+WordAccentResult加 5 個 Optional 欄位、_build_word_result透傳 POS metadata、_process_accent_chunk接apply_accent_patchesapi/reading_overrides.py—FuriganaOverride.pos_match、_applyfilter logic、新增apply_accent_patches與 self-check predicatesOut of scope(留給後續 issue / PR)
附註
升→しょう之類的 POS-aware override 算 premature。conjugation_form當第三 axis、初始 exception list 空)都寫在 docstring + commit message 裡。🤖 Generated with Claude Code