Skip to content

Commit 15fc3dd

Browse files
committed
集成utf8库,代码诊断以utf8长度计算
1 parent c6dc71f commit 15fc3dd

File tree

7 files changed

+2047
-3
lines changed

7 files changed

+2047
-3
lines changed

3rd/utf8/include/LICENSE

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
This is free and unencumbered software released into the public domain.
2+
3+
Anyone is free to copy, modify, publish, use, compile, sell, or
4+
distribute this software, either in source code form or as a compiled
5+
binary, for any purpose, commercial or non-commercial, and by any
6+
means.
7+
8+
In jurisdictions that recognize copyright laws, the author or authors
9+
of this software dedicate any and all copyright interest in the
10+
software to the public domain. We make this dedication for the benefit
11+
of the public at large and to the detriment of our heirs and
12+
successors. We intend this dedication to be an overt act of
13+
relinquishment in perpetuity of all present and future rights to this
14+
software under copyright law.
15+
16+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
19+
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
20+
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
21+
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
22+
OTHER DEALINGS IN THE SOFTWARE.
23+
24+
For more information, please refer to <http://unlicense.org/>

3rd/utf8/include/README.md

Lines changed: 326 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,326 @@
1+
# 📚 utf8.h
2+
3+
[![Actions Status](https://github.com/sheredom/utf8.h/workflows/CMake/badge.svg)](https://github.com/sheredom/utf8.h/actions)
4+
[![Build status](https://ci.appveyor.com/api/projects/status/phfjjahhs9j4gxvs?svg=true)](https://ci.appveyor.com/project/sheredom/utf8-h)
5+
[![Sponsor](https://img.shields.io/badge/💜-sponsor-blueviolet)](https://github.com/sponsors/sheredom)
6+
7+
A simple one header solution to supporting utf8 strings in C and C++.
8+
9+
Functions provided from the C header string.h but with a utf8* prefix instead of the str* prefix:
10+
11+
[API function docs](#api-function-docs)
12+
13+
string.h | utf8.h | complete | C++14 constexpr
14+
---------|--------|---------|---------
15+
strcat | utf8cat | &#10004; |
16+
strchr | utf8chr | &#10004; | &#10004;
17+
strcmp | utf8cmp | &#10004; | &#10004;
18+
strcoll | utf8coll | |
19+
strcpy | utf8cpy | &#10004; |
20+
strcspn | utf8cspn | &#10004; | &#10004;
21+
strdup | utf8dup | &#10004; |
22+
strfry | utf8fry | |
23+
strlen | utf8len | &#10004; | &#10004;
24+
strnlen | utf8nlen | &#10004; | &#10004;
25+
strncat | utf8ncat | &#10004; |
26+
strncmp | utf8ncmp | &#10004; | &#10004;
27+
strncpy | utf8ncpy | &#10004; |
28+
strndup | utf8ndup | &#10004; |
29+
strpbrk | utf8pbrk | &#10004; | &#10004;
30+
strrchr | utf8rchr | &#10004; | &#10004;
31+
strsep | utf8sep | |
32+
strspn | utf8spn | &#10004; | &#10004;
33+
strstr | utf8str | &#10004; | &#10004;
34+
strtok | utf8tok | |
35+
strxfrm | utf8xfrm | |
36+
37+
Functions provided from the C header strings.h but with a utf8* prefix instead of the str* prefix:
38+
39+
strings.h | utf8.h | complete | C++14 constexpr
40+
----------|--------|---------|---------
41+
strcasecmp | utf8casecmp | ~~&#10004;~~ | &#10004;
42+
strncasecmp | utf8ncasecmp | ~~&#10004;~~ | &#10004;
43+
strcasestr | utf8casestr | ~~&#10004;~~ | &#10004;
44+
45+
Functions provided that are unique to utf8.h:
46+
47+
utf8.h | complete | C++14 constexpr
48+
-------|---------|---------
49+
utf8codepoint | &#10004; | &#10004;
50+
utf8rcodepoint | &#10004; | &#10004;
51+
utf8size | &#10004; | &#10004;
52+
utf8size\_lazy | &#10004; | &#10004;
53+
utf8nsize\_lazy | &#10004; | &#10004;
54+
utf8valid | &#10004; | &#10004;
55+
utf8nvalid | &#10004; | &#10004;
56+
utf8makevalid | &#10004; |
57+
utf8codepointsize | &#10004; | &#10004;
58+
utf8catcodepoint | &#10004; |
59+
utf8isupper | ~~&#10004;~~ | &#10004;
60+
utf8islower | ~~&#10004;~~ | &#10004;
61+
utf8lwr | ~~&#10004;~~ |
62+
utf8upr | ~~&#10004;~~ |
63+
utf8lwrcodepoint | ~~&#10004;~~ | &#10004;
64+
utf8uprcodepoint | ~~&#10004;~~ | &#10004;
65+
66+
## Usage ##
67+
68+
Just `#include "utf8.h"` in your code!
69+
70+
The current supported platforms are Linux, macOS and Windows.
71+
72+
The current supported compilers are gcc, clang, MSVC's cl.exe, and clang-cl.exe.
73+
74+
## Design ##
75+
76+
The utf8.h API matches the string.h API as much as possible by design. There are a few major differences though.
77+
78+
utf8.h uses char8_t* in C++ 20 instead of char*
79+
80+
Anywhere in the string.h or strings.h documentation where it refers to 'bytes' I have changed that to utf8 codepoints. For instance, utf8len will return the number of utf8 codepoints in a utf8 string - which does not necessarily equate to the number of bytes.
81+
82+
## API function docs ##
83+
84+
```c
85+
int utf8casecmp(const void *src1, const void *src2);
86+
```
87+
Return less than 0, 0, greater than 0 if `src1 < src2`, `src1 == src2`,
88+
`src1 > src2` respectively, case insensitive.
89+
90+
```c
91+
void *utf8cat(void *dst, const void *src);
92+
```
93+
Append the utf8 string `src` onto the utf8 string `dst`.
94+
95+
```c
96+
void *utf8chr(const void *src, utf8_int32_t chr);
97+
```
98+
Find the first match of the utf8 codepoint `chr` in the utf8 string `src`.
99+
100+
```c
101+
int utf8cmp(const void *src1, const void *src2);
102+
```
103+
Return less than 0, 0, greater than 0 if `src1 < src2`,
104+
`src1 == src2`, `src1 > src2` respectively.
105+
106+
```c
107+
void *utf8cpy(void *dst, const void *src);
108+
```
109+
Copy the utf8 string `src` onto the memory allocated in `dst`.
110+
111+
```c
112+
size_t utf8cspn(const void *src, const void *reject);
113+
```
114+
Number of utf8 codepoints in the utf8 string `src` that consists entirely
115+
of utf8 codepoints not from the utf8 string `reject`.
116+
117+
```c
118+
void *utf8dup(const void *src);
119+
```
120+
Duplicate the utf8 string `src` by getting its size, `malloc`ing a new buffer
121+
copying over the data, and returning that. Or 0 if `malloc` failed.
122+
123+
```c
124+
size_t utf8len(const void *str);
125+
```
126+
Number of utf8 codepoints in the utf8 string `str`,
127+
**excluding** the null terminating byte.
128+
129+
```c
130+
size_t utf8nlen(const void *str, size_t n);
131+
```
132+
Similar to `utf8len`, except that only at most `n` bytes of `src` are looked.
133+
134+
```c
135+
int utf8ncasecmp(const void *src1, const void *src2, size_t n);
136+
```
137+
Return less than 0, 0, greater than 0 if `src1 < src2`, `src1 == src2`,
138+
`src1 > src2` respectively, case insensitive. Checking at most `n`
139+
bytes of each utf8 string.
140+
141+
```c
142+
void *utf8ncat(void *dst, const void *src, size_t n);
143+
```
144+
Append the utf8 string `src` onto the utf8 string `dst`,
145+
writing at most `n+1` bytes. Can produce an invalid utf8
146+
string if `n` falls partway through a utf8 codepoint.
147+
148+
```c
149+
int utf8ncmp(const void *src1, const void *src2, size_t n);
150+
```
151+
Return less than 0, 0, greater than 0 if `src1 < src2`,
152+
`src1 == src2`, `src1 > src2` respectively. Checking at most `n`
153+
bytes of each utf8 string.
154+
155+
```c
156+
void *utf8ncpy(void *dst, const void *src, size_t n);
157+
```
158+
Copy the utf8 string `src` onto the memory allocated in `dst`.
159+
Copies at most `n` bytes. If `n` falls partway through a utf8
160+
codepoint, or if `dst` doesn't have enough room for a null
161+
terminator, the final string will be cut short to preserve
162+
utf8 validity.
163+
164+
```c
165+
void *utf8pbrk(const void *str, const void *accept);
166+
```
167+
Locates the first occurrence in the utf8 string `str` of any byte in the
168+
utf8 string `accept`, or 0 if no match was found.
169+
170+
```c
171+
void *utf8rchr(const void *src, utf8_int32_t chr);
172+
```
173+
Find the last match of the utf8 codepoint `chr` in the utf8 string `src`.
174+
175+
```c
176+
size_t utf8size(const void *str);
177+
```
178+
Number of bytes in the utf8 string `str`,
179+
including the null terminating byte.
180+
181+
```c
182+
size_t utf8size_lazy(const void *str);
183+
```
184+
Similar to `utf8size`, except that the null terminating byte is **excluded**.
185+
186+
```c
187+
size_t utf8nsize_lazy(const void *str, size_t n);
188+
```
189+
Similar to `utf8size`, except that only at most `n` bytes of `src` are looked and
190+
the null terminating byte is **excluded**.
191+
192+
```c
193+
size_t utf8spn(const void *src, const void *accept);
194+
```
195+
Number of utf8 codepoints in the utf8 string `src` that consists entirely
196+
of utf8 codepoints from the utf8 string `accept`.
197+
198+
```c
199+
void *utf8str(const void *haystack, const void *needle);
200+
```
201+
The position of the utf8 string `needle` in the utf8 string `haystack`.
202+
203+
```c
204+
void *utf8casestr(const void *haystack, const void *needle);
205+
```
206+
The position of the utf8 string `needle` in the utf8 string `haystack`,
207+
case insensitive.
208+
209+
```c
210+
void *utf8valid(const void *str);
211+
```
212+
Return 0 on success, or the position of the invalid utf8 codepoint on failure.
213+
214+
```c
215+
void *utf8nvalid(const void *str, size_t n);
216+
```
217+
Similar to `utf8valid`, except that only at most `n` bytes of `src` are looked.
218+
219+
```c
220+
int utf8makevalid(void *str, utf8_int32_t replacement);
221+
```
222+
Return 0 on success. Makes the `str` valid by replacing invalid sequences with
223+
the 1-byte `replacement` codepoint.
224+
225+
```c
226+
void *utf8codepoint(const void *str, utf8_int32_t *out_codepoint);
227+
```
228+
Sets out_codepoint to the current utf8 codepoint in `str`, and returns the
229+
address of the next utf8 codepoint after the current one in `str`.
230+
231+
```c
232+
void *utf8rcodepoint(const void *str, utf8_int32_t *out_codepoint);
233+
```
234+
Sets out_codepoint to the current utf8 codepoint in `str`, and returns the
235+
address of the previous utf8 codepoint before the current one in `str`.
236+
237+
```c
238+
size_t utf8codepointsize(utf8_int32_t chr);
239+
```
240+
Returns the size of the given codepoint in bytes.
241+
242+
```c
243+
void *utf8catcodepoint(void *utf8_restrict str, utf8_int32_t chr, size_t n);
244+
```
245+
Write a codepoint to the given string, and return the address to the next
246+
place after the written codepoint. Pass how many bytes left in the buffer to
247+
n. If there is not enough space for the codepoint, this function returns
248+
null.
249+
250+
```c
251+
int utf8islower(utf8_int32_t chr);
252+
```
253+
Returns 1 if the given character is lowercase, or 0 if it is not.
254+
255+
```c
256+
int utf8isupper(utf8_int32_t chr);
257+
```
258+
Returns 1 if the given character is uppercase, or 0 if it is not.
259+
260+
```c
261+
void utf8lwr(void *utf8_restrict str);
262+
```
263+
Transform the given string into all lowercase codepoints.
264+
265+
```c
266+
void utf8upr(void *utf8_restrict str);
267+
```
268+
Transform the given string into all uppercase codepoints.
269+
270+
```c
271+
utf8_int32_t utf8lwrcodepoint(utf8_int32_t cp);
272+
```
273+
Make a codepoint lower case if possible.
274+
275+
```c
276+
utf8_int32_t utf8uprcodepoint(utf8_int32_t cp);
277+
```
278+
Make a codepoint upper case if possible.
279+
280+
## Codepoint Case
281+
282+
Various functions provided will do case insensitive compares, or transform utf8
283+
strings from one case to another. Given the vastness of unicode, and the authors
284+
lack of understanding beyond latin codepoints on whether case means anything,
285+
the following categories are the only ones that will be checked in case
286+
insensitive code:
287+
288+
* [ASCII](https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block))
289+
* [Latin-1 Supplement](https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block))
290+
* [Latin Extended-A](https://en.wikipedia.org/wiki/Latin_Extended-A)
291+
* [Latin Extended-B](https://en.wikipedia.org/wiki/Latin_Extended-B)
292+
* [Greek and Coptic](https://en.wikipedia.org/wiki/Greek_and_Coptic)
293+
* [Cyrillic](https://en.wikipedia.org/wiki/Cyrillic_(Unicode_block))
294+
295+
## Todo ##
296+
297+
- Implement utf8coll (akin to strcoll).
298+
- Implement utf8fry (akin to strfry).
299+
- Investigate adding dst buffer sizes for utf8cpy and utf8cat to catch overwrites (as suggested by [@FlohOfWoe](https://twitter.com/FlohOfWoe) in https://twitter.com/FlohOfWoe/status/618669237771608064)
300+
301+
## License ##
302+
303+
This is free and unencumbered software released into the public domain.
304+
305+
Anyone is free to copy, modify, publish, use, compile, sell, or
306+
distribute this software, either in source code form or as a compiled
307+
binary, for any purpose, commercial or non-commercial, and by any
308+
means.
309+
310+
In jurisdictions that recognize copyright laws, the author or authors
311+
of this software dedicate any and all copyright interest in the
312+
software to the public domain. We make this dedication for the benefit
313+
of the public at large and to the detriment of our heirs and
314+
successors. We intend this dedication to be an overt act of
315+
relinquishment in perpetuity of all present and future rights to this
316+
software under copyright law.
317+
318+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
319+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
320+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
321+
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
322+
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
323+
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
324+
OTHER DEALINGS IN THE SOFTWARE.
325+
326+
For more information, please refer to <http://unlicense.org/>

0 commit comments

Comments
 (0)