Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 120 additions & 0 deletions docs/plugins/analysis-kuromoji.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -624,3 +624,123 @@ Which results in:
} ]
}
--------------------------------------------------

[[analysis-kuromoji-hiragana-uppercase]]
==== `hiragana_uppercase` token filter

The `hiragana_uppercase` token filter normalizes small letters (捨て仮名) in hiragana into normal letters.
Copy link
Contributor

@leemthompo leemthompo Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "normal letters" accepted phrasing?

Maybe "The hiragana_uppercase token filter normalizes small Hiragana letters (捨て仮名) into full-size Hiragana letters? "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, maybe standard (or regular) would be better. The word "Full-size" sounds like "full-width" (multi-byte), which is not the case here. Let me change "normal" to "standard".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Glad I was able to communicate that despite my ignorance of linguistic terms :)

This filter is useful if you want to search against old style Japanese text such as
patents, legal documents, contract policies, etc.

For example:

[source,console]
--------------------------------------------------
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"hiragana_uppercase"
]
}
}
}
}
}
}

GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "ちょっとまって"
}
--------------------------------------------------

Which results in:

[source,console-result]
--------------------------------------------------
{
"tokens": [
{
"token": "ちよつと",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "まつ",
"start_offset": 4,
"end_offset": 6,
"type": "word",
"position": 1
},
{
"token": "て",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 2
}
]
}
--------------------------------------------------

[[analysis-kuromoji-katakana-uppercase]]
==== `katakana_uppercase` token filter

The `katakana_uppercase` token filter normalizes small letters (捨て仮名) in katakana into normal letters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above.

This filter is useful if you want to search against old style Japanese text such as
patents, legal documents, contract policies, etc.

For example:

[source,console]
--------------------------------------------------
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"katakana_uppercase"
]
}
}
}
}
}
}

GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "ストップウォッチ"
}
--------------------------------------------------

Which results in:

[source,console-result]
--------------------------------------------------
{
"tokens": [
{
"token": "ストツプウオツチ",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
}
]
}
--------------------------------------------------
Loading