[TOC]
UnicodeSets use regular-expression syntax to allow for arbitrary set operations
(Union, Intersection, Difference) on sets of Unicode characters. The base sets
can be specified explicitly, such as [a-m w-z], or using a combinations of
Unicode Properties such as the following, for the Arabic script characters
that have a canonical decomposition:
[[:script=arabic:]&[:decompositiontype=canonical:]]
Enter a UnicodeSet into the Input box, and hit Show Set. You can also choose certain combinations of options for display, such as abbreviated or not.
The values you use are encapsulated into a URL for reference, such as
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=\\p{sc:Greek}
If you add properties to the Group By box, you can sort the results by
property values. For example, if you set it to General_Category Numeric_Value
(or the short form gc nv), you'll see the results sorted first by the general
category of the characters, and then by the numeric value.
UnicodeSets are defined according to the description on UTS #35: Locale Data Markup Language (LDML), but has some useful extensions in these online demos.
Properties can be specified either with Perl-style notation
(\p{script=arabic}) or with POSIX-style notation ([:script=arabic:]).
Properties and values can either use a long form (like script) or a short form
(like sc).
No argument is equivalent to "Yes"; mostly useful with binary properties, like
\p{isLowercase}.
The following examples illustrate the syntax with a particular property, value
pair: the property age and the value 3.2:
The : can be used in the place of =. (Mostly because : doesn't require
percent-encoding in URLs.)
\p{age:3.2}and[:age:3.2:]
The Perl and Posix syntax for negations are \P{...} and [:^...:],
respectively. The characters ≠ and ! are added for convenience:
\p{age≠3.2}and\:age≠3.2:]\p{age!=3.2}and\:age!=3.2:]\p{age!:3.2}and\:age!=3.2:]
For the name property, regular expressions can be used for the value, enclosed
in /.../. For example in the following expression, the first term will select
all those Unicode characters whose names contain "CJK". The rest of the
expression will then subtract the ideographic characters, showing that these can
be used in arbitrary combinations.
[[[:name=/CJK/:]-[:ideographic:]]](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5B:name=/CJK/:%5D-%5B:ideographic:%5D%5D)- the set of all characters with names that contain CJK that are not Ideographic
[[:name=/\bDOT$/:]](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B:name=/%5CbDOT$/:%5D)- the set of all characters with names that end with the word DOT
[[:block=/(?i)arab/:]](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B:block=/(?i)arab/:%5D)- the set of all characters in blocks that contain the sequence of letters "arab" (case-insensitive)
[[:toNFKC=/\./:]](https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B:toNFKC=/%5C./:%5D)- the set of all characters with toNFKC values that contain a literal period
Some particularly useful regex features are:
\bmeans a word break,^means front of the string, and$means end. So/^DOT\\b/means the word DOT at the start.(?i)means case-insensitive matching.
Caveats:
- The regex uses the standard Java
Pattern.
In particular, it does not have the extended functions in UnicodeSet, nor is
it up-to-date with the latest Unicode. So be aware that you shouldn't depend
on properties inside of the
/.../pattern. - If you do use properties, then use
[:...:]syntax on the outside, such as: - The Unassigned, Surrogate, and Private Use code points are skipped in the
Regex comparison, so
[:Block=/Aegean_Numbers/:]returns a different number of characters than[:Block=Aegean_Numbers:], because it skips Unassigned code points. - None of the normal "loose matching" is enabled. So
[:Block=aegeannumbers:]works, but[:Block=/aegeannumbers/:]fails -- you have to use[:Block=/Aegean_Numbers/:]or[:Block=/(?i)aegean_numbers/:].
Property values can be compared to those for other properties, using the syntax
@...@. For example:
- Find the characters for which IDNA2003 is not the same as UTS46:
\p{idna2003!=@uts46@} - The same thing, but limited to Unicode 3.2:
\p{idna2003!=@uts46@}&\\p{age=3.2}
There is a special property "cp" that returns the code point itself. For example:
- Find the characters whose lowercase is different:
\p{toLowercase!=@code point@}
You can see a full listing of the possible properties on
https://util.unicode.org/UnicodeJsps/properties.jsp. The standard Unicode
properties are supported, plus the extra ICU properties. There are some
additional properties just in this demo. The easiest way to see the properties
for a range of characters is to use a set like [:Greek:] in the Input, and
then set the Group By box to the property name.
- International Domain Names:
- idna2003, uts46, idna2008, idna2008c, plus the tranforms:
- toIdna2003, toUTS46t (transitional form), toUTS46n (the normal form).
- Security:
- usage (the xmod properties from the security mechanisms),
- idr (identifier restrictions),
- confusables
- HanType:
- Hans (Simplified-only), Hant (Traditional-only), or Han (both) (based on Unicode properties)
- Casing:
- toCaseFold, toLowerCase, toUpperCase, toTitleCase, toNFKC_CF
- isCaseFolded, isUppercase, isLowercase, isTitlecase, isCased, isNFKC_CF
- Normalization:
- toNFC, toNFD, toNFKD, toNFKC;
- isNFC, isNFKC, isNFD, isNFKD
- Informational:
- subhead (the subhead from the Unicode charts, simplified slightly to remove variations like plurals and use of terms like "Additional")
- Misc:
- ASCII, ANY (matches any code point), BMP,
- emoji (the emoji characters, both new and old)
- Scripts:
- scs (the script extensions in Unicode 6.0 -- also adds HanType)
- Encodings:
- enc_GBK, is_enc_GBK
- (and a few other common encodings)
- Sorting:
- uca (the primary UCA weight -- after the CLDR transforms),
- uca2 (the primary and secondary weights)
Normally, \p{isX} is equivalent to \p{toX=@code point@}. There are some exceptions and
missing cases.
Note: The Unassigned, Surrogate, and Private Use code points are skipped in the generation of some of these sets.
The following provides details for some cases.
Unicode defines a number of string casing functions in Section 3.13 Default
Case Algorithms. These string functions can also be applied to single
characters.Warning: the first three sets may be somewhat misleading:
isLowercase means that the character is the same as its lowercase version, which
includes all uncased characters. To get those characters that are cased
characters and lowercase, use
[[:isLowercase:]&[:isCased:]]
-
The binary testing operations take no argument:
-
The string functions are also provided, and require an argument. For example:
[:toLowercase=a:]- the set of all characters X such that toLowercase(X) = a
[:toCaseFold=a:][:toUppercase=A:][:toTitlecase=A:]
Note: The Unassigned, Surrogate, and Private Use code points are skipped in generation of the sets.
Unicode defines a number of string normalization functions UAX #15. These string functions can also be applied to single characters.
[:toNFC=a:]- the set of all characters X such that toNFC(X) = a
[:toNFD=A\u0300:][:toNFKC=A:][:toNFKD=A\u0300:]