Skip to content

Romeo-CC/chinese_typo_checker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chinese Hanzi Typo Checker

Introduction

In Chinese text scenarios, typos are inevitable due to the input method, as most Chinese characters are entered using pinyin. Additionally, Chinese characters are phono-semantic compounds (形声字), which further contributes to typo occurrences.

Therefore, techniques for detecting and correcting typos in Chinese text are highly valuable and in demand. This repository aims to build a Chinese typo checker by leveraging the capabilities of masked language models like BERT.

Method

graph LR
    A@{ shape: lean-r, label: "Input Chinese Text" } --> B@{ shape: rect, label: "Tokenizer" }
    B -->|"input_ids<br>attention_mask"| C@{ shape: procs, label: "Stacked Encoder Layers" }
    subgraph Model
    C -->|"last hidden state"| D@{ shape: rect, label: "MaskedLM Head" }
    C -->|"last hidden state"| E@{ shape: rect, label: "Token CLS Head" }
    end
    D --> F(["Typo Correction"])
    E --> G(["Typo Detection"]) 
Loading

How to Use

Download model weight from here

from checker import HZTypoChecker

tokenizer_name = "data/bert"

model_name = "where you downloaded the model weights"


ckecker = HZTypoChecker(model_name, tokenizer_name)

Give a txt may contain typos

txt = "忧 质 的 产 品 和 服 务 实 际 上 是 最 好 的 晶 牌 推 厂 方 式 。"
# typos 忧(优)  晶(品) 厂(广)

Call checker to detect typos

ck_out = ckecker.check(txt)
print(ck_out.raw_tokens)
print(ck_out.check_cls)
print(ck_out.mod_tokens)
['[CLS]', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '[SEP]']
[1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1]
['[CLS]', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '广', '', '', '', '[SEP]']

References

BERT paper code

ELECTRA paper code

About

A Chinese Typo Checker based on BERT MLM.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors