|
| 1 | +# 📚 utf8.h |
| 2 | + |
| 3 | +[](https://github.com/sheredom/utf8.h/actions) |
| 4 | +[](https://ci.appveyor.com/project/sheredom/utf8-h) |
| 5 | +[](https://github.com/sponsors/sheredom) |
| 6 | + |
| 7 | +A simple one header solution to supporting utf8 strings in C and C++. |
| 8 | + |
| 9 | +Functions provided from the C header string.h but with a utf8* prefix instead of the str* prefix: |
| 10 | + |
| 11 | +[API function docs](#api-function-docs) |
| 12 | + |
| 13 | +string.h | utf8.h | complete | C++14 constexpr |
| 14 | +---------|--------|---------|--------- |
| 15 | +strcat | utf8cat | ✔ | |
| 16 | +strchr | utf8chr | ✔ | ✔ |
| 17 | +strcmp | utf8cmp | ✔ | ✔ |
| 18 | +strcoll | utf8coll | | |
| 19 | +strcpy | utf8cpy | ✔ | |
| 20 | +strcspn | utf8cspn | ✔ | ✔ |
| 21 | +strdup | utf8dup | ✔ | |
| 22 | +strfry | utf8fry | | |
| 23 | +strlen | utf8len | ✔ | ✔ |
| 24 | +strnlen | utf8nlen | ✔ | ✔ |
| 25 | +strncat | utf8ncat | ✔ | |
| 26 | +strncmp | utf8ncmp | ✔ | ✔ |
| 27 | +strncpy | utf8ncpy | ✔ | |
| 28 | +strndup | utf8ndup | ✔ | |
| 29 | +strpbrk | utf8pbrk | ✔ | ✔ |
| 30 | +strrchr | utf8rchr | ✔ | ✔ |
| 31 | +strsep | utf8sep | | |
| 32 | +strspn | utf8spn | ✔ | ✔ |
| 33 | +strstr | utf8str | ✔ | ✔ |
| 34 | +strtok | utf8tok | | |
| 35 | +strxfrm | utf8xfrm | | |
| 36 | + |
| 37 | +Functions provided from the C header strings.h but with a utf8* prefix instead of the str* prefix: |
| 38 | + |
| 39 | +strings.h | utf8.h | complete | C++14 constexpr |
| 40 | +----------|--------|---------|--------- |
| 41 | +strcasecmp | utf8casecmp | ~~✔~~ | ✔ |
| 42 | +strncasecmp | utf8ncasecmp | ~~✔~~ | ✔ |
| 43 | +strcasestr | utf8casestr | ~~✔~~ | ✔ |
| 44 | + |
| 45 | +Functions provided that are unique to utf8.h: |
| 46 | + |
| 47 | +utf8.h | complete | C++14 constexpr |
| 48 | +-------|---------|--------- |
| 49 | +utf8codepoint | ✔ | ✔ |
| 50 | +utf8rcodepoint | ✔ | ✔ |
| 51 | +utf8size | ✔ | ✔ |
| 52 | +utf8size\_lazy | ✔ | ✔ |
| 53 | +utf8nsize\_lazy | ✔ | ✔ |
| 54 | +utf8valid | ✔ | ✔ |
| 55 | +utf8nvalid | ✔ | ✔ |
| 56 | +utf8makevalid | ✔ | |
| 57 | +utf8codepointsize | ✔ | ✔ |
| 58 | +utf8catcodepoint | ✔ | |
| 59 | +utf8isupper | ~~✔~~ | ✔ |
| 60 | +utf8islower | ~~✔~~ | ✔ |
| 61 | +utf8lwr | ~~✔~~ | |
| 62 | +utf8upr | ~~✔~~ | |
| 63 | +utf8lwrcodepoint | ~~✔~~ | ✔ |
| 64 | +utf8uprcodepoint | ~~✔~~ | ✔ |
| 65 | + |
| 66 | +## Usage ## |
| 67 | + |
| 68 | +Just `#include "utf8.h"` in your code! |
| 69 | + |
| 70 | +The current supported platforms are Linux, macOS and Windows. |
| 71 | + |
| 72 | +The current supported compilers are gcc, clang, MSVC's cl.exe, and clang-cl.exe. |
| 73 | + |
| 74 | +## Design ## |
| 75 | + |
| 76 | +The utf8.h API matches the string.h API as much as possible by design. There are a few major differences though. |
| 77 | + |
| 78 | +utf8.h uses char8_t* in C++ 20 instead of char* |
| 79 | + |
| 80 | +Anywhere in the string.h or strings.h documentation where it refers to 'bytes' I have changed that to utf8 codepoints. For instance, utf8len will return the number of utf8 codepoints in a utf8 string - which does not necessarily equate to the number of bytes. |
| 81 | + |
| 82 | +## API function docs ## |
| 83 | + |
| 84 | +```c |
| 85 | +int utf8casecmp(const void *src1, const void *src2); |
| 86 | +``` |
| 87 | +Return less than 0, 0, greater than 0 if `src1 < src2`, `src1 == src2`, |
| 88 | +`src1 > src2` respectively, case insensitive. |
| 89 | +
|
| 90 | +```c |
| 91 | +void *utf8cat(void *dst, const void *src); |
| 92 | +``` |
| 93 | +Append the utf8 string `src` onto the utf8 string `dst`. |
| 94 | + |
| 95 | +```c |
| 96 | +void *utf8chr(const void *src, utf8_int32_t chr); |
| 97 | +``` |
| 98 | +Find the first match of the utf8 codepoint `chr` in the utf8 string `src`. |
| 99 | +
|
| 100 | +```c |
| 101 | +int utf8cmp(const void *src1, const void *src2); |
| 102 | +``` |
| 103 | +Return less than 0, 0, greater than 0 if `src1 < src2`, |
| 104 | +`src1 == src2`, `src1 > src2` respectively. |
| 105 | + |
| 106 | +```c |
| 107 | +void *utf8cpy(void *dst, const void *src); |
| 108 | +``` |
| 109 | +Copy the utf8 string `src` onto the memory allocated in `dst`. |
| 110 | +
|
| 111 | +```c |
| 112 | +size_t utf8cspn(const void *src, const void *reject); |
| 113 | +``` |
| 114 | +Number of utf8 codepoints in the utf8 string `src` that consists entirely |
| 115 | +of utf8 codepoints not from the utf8 string `reject`. |
| 116 | + |
| 117 | +```c |
| 118 | +void *utf8dup(const void *src); |
| 119 | +``` |
| 120 | +Duplicate the utf8 string `src` by getting its size, `malloc`ing a new buffer |
| 121 | +copying over the data, and returning that. Or 0 if `malloc` failed. |
| 122 | +
|
| 123 | +```c |
| 124 | +size_t utf8len(const void *str); |
| 125 | +``` |
| 126 | +Number of utf8 codepoints in the utf8 string `str`, |
| 127 | +**excluding** the null terminating byte. |
| 128 | + |
| 129 | +```c |
| 130 | +size_t utf8nlen(const void *str, size_t n); |
| 131 | +``` |
| 132 | +Similar to `utf8len`, except that only at most `n` bytes of `src` are looked. |
| 133 | +
|
| 134 | +```c |
| 135 | +int utf8ncasecmp(const void *src1, const void *src2, size_t n); |
| 136 | +``` |
| 137 | +Return less than 0, 0, greater than 0 if `src1 < src2`, `src1 == src2`, |
| 138 | +`src1 > src2` respectively, case insensitive. Checking at most `n` |
| 139 | +bytes of each utf8 string. |
| 140 | + |
| 141 | +```c |
| 142 | +void *utf8ncat(void *dst, const void *src, size_t n); |
| 143 | +``` |
| 144 | +Append the utf8 string `src` onto the utf8 string `dst`, |
| 145 | +writing at most `n+1` bytes. Can produce an invalid utf8 |
| 146 | +string if `n` falls partway through a utf8 codepoint. |
| 147 | +
|
| 148 | +```c |
| 149 | +int utf8ncmp(const void *src1, const void *src2, size_t n); |
| 150 | +``` |
| 151 | +Return less than 0, 0, greater than 0 if `src1 < src2`, |
| 152 | +`src1 == src2`, `src1 > src2` respectively. Checking at most `n` |
| 153 | +bytes of each utf8 string. |
| 154 | + |
| 155 | +```c |
| 156 | +void *utf8ncpy(void *dst, const void *src, size_t n); |
| 157 | +``` |
| 158 | +Copy the utf8 string `src` onto the memory allocated in `dst`. |
| 159 | +Copies at most `n` bytes. If `n` falls partway through a utf8 |
| 160 | +codepoint, or if `dst` doesn't have enough room for a null |
| 161 | +terminator, the final string will be cut short to preserve |
| 162 | +utf8 validity. |
| 163 | +
|
| 164 | +```c |
| 165 | +void *utf8pbrk(const void *str, const void *accept); |
| 166 | +``` |
| 167 | +Locates the first occurrence in the utf8 string `str` of any byte in the |
| 168 | +utf8 string `accept`, or 0 if no match was found. |
| 169 | + |
| 170 | +```c |
| 171 | +void *utf8rchr(const void *src, utf8_int32_t chr); |
| 172 | +``` |
| 173 | +Find the last match of the utf8 codepoint `chr` in the utf8 string `src`. |
| 174 | +
|
| 175 | +```c |
| 176 | +size_t utf8size(const void *str); |
| 177 | +``` |
| 178 | +Number of bytes in the utf8 string `str`, |
| 179 | +including the null terminating byte. |
| 180 | + |
| 181 | +```c |
| 182 | +size_t utf8size_lazy(const void *str); |
| 183 | +``` |
| 184 | +Similar to `utf8size`, except that the null terminating byte is **excluded**. |
| 185 | +
|
| 186 | +```c |
| 187 | +size_t utf8nsize_lazy(const void *str, size_t n); |
| 188 | +``` |
| 189 | +Similar to `utf8size`, except that only at most `n` bytes of `src` are looked and |
| 190 | +the null terminating byte is **excluded**. |
| 191 | + |
| 192 | +```c |
| 193 | +size_t utf8spn(const void *src, const void *accept); |
| 194 | +``` |
| 195 | +Number of utf8 codepoints in the utf8 string `src` that consists entirely |
| 196 | +of utf8 codepoints from the utf8 string `accept`. |
| 197 | +
|
| 198 | +```c |
| 199 | +void *utf8str(const void *haystack, const void *needle); |
| 200 | +``` |
| 201 | +The position of the utf8 string `needle` in the utf8 string `haystack`. |
| 202 | + |
| 203 | +```c |
| 204 | +void *utf8casestr(const void *haystack, const void *needle); |
| 205 | +``` |
| 206 | +The position of the utf8 string `needle` in the utf8 string `haystack`, |
| 207 | +case insensitive. |
| 208 | +
|
| 209 | +```c |
| 210 | +void *utf8valid(const void *str); |
| 211 | +``` |
| 212 | +Return 0 on success, or the position of the invalid utf8 codepoint on failure. |
| 213 | + |
| 214 | +```c |
| 215 | +void *utf8nvalid(const void *str, size_t n); |
| 216 | +``` |
| 217 | +Similar to `utf8valid`, except that only at most `n` bytes of `src` are looked. |
| 218 | +
|
| 219 | +```c |
| 220 | +int utf8makevalid(void *str, utf8_int32_t replacement); |
| 221 | +``` |
| 222 | +Return 0 on success. Makes the `str` valid by replacing invalid sequences with |
| 223 | +the 1-byte `replacement` codepoint. |
| 224 | + |
| 225 | +```c |
| 226 | +void *utf8codepoint(const void *str, utf8_int32_t *out_codepoint); |
| 227 | +``` |
| 228 | +Sets out_codepoint to the current utf8 codepoint in `str`, and returns the |
| 229 | +address of the next utf8 codepoint after the current one in `str`. |
| 230 | +
|
| 231 | +```c |
| 232 | +void *utf8rcodepoint(const void *str, utf8_int32_t *out_codepoint); |
| 233 | +``` |
| 234 | +Sets out_codepoint to the current utf8 codepoint in `str`, and returns the |
| 235 | +address of the previous utf8 codepoint before the current one in `str`. |
| 236 | + |
| 237 | +```c |
| 238 | +size_t utf8codepointsize(utf8_int32_t chr); |
| 239 | +``` |
| 240 | +Returns the size of the given codepoint in bytes. |
| 241 | +
|
| 242 | +```c |
| 243 | +void *utf8catcodepoint(void *utf8_restrict str, utf8_int32_t chr, size_t n); |
| 244 | +``` |
| 245 | +Write a codepoint to the given string, and return the address to the next |
| 246 | +place after the written codepoint. Pass how many bytes left in the buffer to |
| 247 | +n. If there is not enough space for the codepoint, this function returns |
| 248 | +null. |
| 249 | + |
| 250 | +```c |
| 251 | +int utf8islower(utf8_int32_t chr); |
| 252 | +``` |
| 253 | +Returns 1 if the given character is lowercase, or 0 if it is not. |
| 254 | +
|
| 255 | +```c |
| 256 | +int utf8isupper(utf8_int32_t chr); |
| 257 | +``` |
| 258 | +Returns 1 if the given character is uppercase, or 0 if it is not. |
| 259 | + |
| 260 | +```c |
| 261 | +void utf8lwr(void *utf8_restrict str); |
| 262 | +``` |
| 263 | +Transform the given string into all lowercase codepoints. |
| 264 | +
|
| 265 | +```c |
| 266 | +void utf8upr(void *utf8_restrict str); |
| 267 | +``` |
| 268 | +Transform the given string into all uppercase codepoints. |
| 269 | + |
| 270 | +```c |
| 271 | +utf8_int32_t utf8lwrcodepoint(utf8_int32_t cp); |
| 272 | +``` |
| 273 | +Make a codepoint lower case if possible. |
| 274 | +
|
| 275 | +```c |
| 276 | +utf8_int32_t utf8uprcodepoint(utf8_int32_t cp); |
| 277 | +``` |
| 278 | +Make a codepoint upper case if possible. |
| 279 | + |
| 280 | +## Codepoint Case |
| 281 | + |
| 282 | +Various functions provided will do case insensitive compares, or transform utf8 |
| 283 | +strings from one case to another. Given the vastness of unicode, and the authors |
| 284 | +lack of understanding beyond latin codepoints on whether case means anything, |
| 285 | +the following categories are the only ones that will be checked in case |
| 286 | +insensitive code: |
| 287 | + |
| 288 | +* [ASCII](https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block)) |
| 289 | +* [Latin-1 Supplement](https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)) |
| 290 | +* [Latin Extended-A](https://en.wikipedia.org/wiki/Latin_Extended-A) |
| 291 | +* [Latin Extended-B](https://en.wikipedia.org/wiki/Latin_Extended-B) |
| 292 | +* [Greek and Coptic](https://en.wikipedia.org/wiki/Greek_and_Coptic) |
| 293 | +* [Cyrillic](https://en.wikipedia.org/wiki/Cyrillic_(Unicode_block)) |
| 294 | + |
| 295 | +## Todo ## |
| 296 | + |
| 297 | +- Implement utf8coll (akin to strcoll). |
| 298 | +- Implement utf8fry (akin to strfry). |
| 299 | +- Investigate adding dst buffer sizes for utf8cpy and utf8cat to catch overwrites (as suggested by [@FlohOfWoe](https://twitter.com/FlohOfWoe) in https://twitter.com/FlohOfWoe/status/618669237771608064) |
| 300 | + |
| 301 | +## License ## |
| 302 | + |
| 303 | +This is free and unencumbered software released into the public domain. |
| 304 | + |
| 305 | +Anyone is free to copy, modify, publish, use, compile, sell, or |
| 306 | +distribute this software, either in source code form or as a compiled |
| 307 | +binary, for any purpose, commercial or non-commercial, and by any |
| 308 | +means. |
| 309 | + |
| 310 | +In jurisdictions that recognize copyright laws, the author or authors |
| 311 | +of this software dedicate any and all copyright interest in the |
| 312 | +software to the public domain. We make this dedication for the benefit |
| 313 | +of the public at large and to the detriment of our heirs and |
| 314 | +successors. We intend this dedication to be an overt act of |
| 315 | +relinquishment in perpetuity of all present and future rights to this |
| 316 | +software under copyright law. |
| 317 | + |
| 318 | +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, |
| 319 | +EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF |
| 320 | +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. |
| 321 | +IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR |
| 322 | +OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, |
| 323 | +ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR |
| 324 | +OTHER DEALINGS IN THE SOFTWARE. |
| 325 | + |
| 326 | +For more information, please refer to <http://unlicense.org/> |
0 commit comments