Chinese Hanzi Typo Checker

Introduction

In Chinese text scenarios, typos are inevitable due to the input method, as most Chinese characters are entered using pinyin. Additionally, Chinese characters are phono-semantic compounds (形声字), which further contributes to typo occurrences.

Therefore, techniques for detecting and correcting typos in Chinese text are highly valuable and in demand. This repository aims to build a Chinese typo checker by leveraging the capabilities of masked language models like BERT.

Method

graph LR
    A@{ shape: lean-r, label: "Input Chinese Text" } --> B@{ shape: rect, label: "Tokenizer" }
    B -->|"input_ids<br>attention_mask"| C@{ shape: procs, label: "Stacked Encoder Layers" }
    subgraph Model
    C -->|"last hidden state"| D@{ shape: rect, label: "MaskedLM Head" }
    C -->|"last hidden state"| E@{ shape: rect, label: "Token CLS Head" }
    end
    D --> F(["Typo Correction"])
    E --> G(["Typo Detection"])

How to Use

Download model weight from here

from checker import HZTypoChecker

tokenizer_name = "data/bert"

model_name = "where you downloaded the model weights"


ckecker = HZTypoChecker(model_name, tokenizer_name)

Give a txt may contain typos

txt = "忧 质 的 产 品 和 服 务 实 际 上 是 最 好 的 晶 牌 推 厂 方 式 。"
# typos 忧(优)  晶（品） 厂（广）

Call checker to detect typos

ck_out = ckecker.check(txt)

print(ck_out.raw_tokens)
print(ck_out.check_cls)
print(ck_out.mod_tokens)

['[CLS]', '忧', '质', '的', '产', '品', '和', '服', '务', '实', '际', '上', '是', '最', '好', '的', '晶', '牌', '推', '厂', '方', '式', '。', '[SEP]']
[1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1]
['[CLS]', '优', '质', '的', '产', '品', '和', '服', '务', '实', '际', '上', '是', '最', '好', '的', '品', '牌', '推', '广', '方', '式', '。', '[SEP]']

References

BERT paper code

ELECTRA paper code

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data/bert		data/bert
models		models
utils		utils
checker.py		checker.py
readme.md		readme.md
requirements.txt		requirements.txt
step_by_step.ipynb		step_by_step.ipynb
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chinese Hanzi Typo Checker

Introduction

Method

How to Use

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chinese Hanzi Typo Checker

Introduction

Method

How to Use

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages