Skip to content

Ranges in DerivedNames for Rugrep #19

@noraj

Description

@noraj

The issue is there are several ranges in DerivedName.txt

➜ cat data/DerivedName.txt | grep '\.\.'                                                                  │
3400..4DBF    ; CJK UNIFIED IDEOGRAPH-*                                                                   │
4E00..9FFF    ; CJK UNIFIED IDEOGRAPH-*                                                                   │
F900..FA6D    ; CJK COMPATIBILITY IDEOGRAPH-*                                                             │
FA70..FAD9    ; CJK COMPATIBILITY IDEOGRAPH-*                                                             │
17000..187F7  ; TANGUT IDEOGRAPH-*                                                                        │
18B00..18CD5  ; KHITAN SMALL SCRIPT CHARACTER-*                                                           │
18D00..18D08  ; TANGUT IDEOGRAPH-*                                                                        │
1B170..1B2FB  ; NUSHU CHARACTER-*                                                                         │
20000..2A6DF  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2A700..2B739  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2B740..2B81D  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2B820..2CEA1  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2CEB0..2EBE0  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2EBF0..2EE5D  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
2F800..2FA1D  ; CJK COMPATIBILITY IDEOGRAPH-*                                                             │
30000..3134A  ; CJK UNIFIED IDEOGRAPH-*                                                                   │
31350..323AF  ; CJK UNIFIED IDEOGRAPH-*    

actually this code was casting the hex code point to decimal code point

cp_int = cp_int.chomp.to_i(16)

which is ignoring ranges

irb(main):001:0> '2CEB0..2EBE0'.to_i(16)
=> 183984
irb(main):002:0> '2CEB0'.to_i(16)
=> 183984

So ranges are displayed as a single code point

➜ unisec grep '' | grep 'NUSHU'
U+16FE1 𖿡    NUSHU ITERATION MARK
U+1B170 𛅰    NUSHU CHARACTER-*

Solutions :

  1. Parse this better to display ranges with a horizontal ellipsis
    • Pros: keep one command
    • Cons: add code complexity, output is inconsistent (bad for piping to other commands)
  2. Add a sub-command named ranges
    • Pros: keep consistent output for the grep command
    • Cons: split in several commands
  3. Pad range end to the name, eg. U+1B170 𛅰 NUSHU CHARACTER-* (up to U+1B2FB)
    • Pros: keep on command, code point column is consistent
    • Cons: name column becomes unreliable (information appended to the name)
  4. Expending the name dynamically
    • Pros: no inconsistency, no unreliable column
    • Cons: for matching result the output will be quite large for not so much value and become unreadable
  5. Adding a third field for comments
    • New behavior just for a few exceptions

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions