Skip to content

Commit b4507a9

Browse files
authored
Add vignette about locale sensitive functions. (#589)
Fixes #404
1 parent 2ebd55e commit b4507a9

File tree

4 files changed

+99
-4
lines changed

4 files changed

+99
-4
lines changed

.vscode/settings.json

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,5 @@
22
"[r]": {
33
"editor.formatOnSave": true,
44
"editor.defaultFormatter": "Posit.air-vscode"
5-
},
6-
"[quarto]": {
7-
"editor.formatOnSave": true,
8-
"editor.defaultFormatter": "quarto.quarto"
95
}
106
}

NEWS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# stringr (development version)
22

3+
* New `vignette("locale-sensitive")` about locale sensitive functions (@kylieainslie, #404)
34
* New `str_ilike()` that follows the conventions of the SQL ILIKE operator (@edward-burn, #543).
45
* `str_like(ignore_case)` is deprecated, with `str_like()` now always case sensitive to better follow the conventions of the SQL LIKE operator (@edward-burn, #543).
56
* `str_sub<-` now gives a more informative error if `value` is not the correct length.

vignettes/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/.quarto/

vignettes/locale-sensitive.Rmd

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
---
2+
title: "Locale sensitive functions"
3+
output: rmarkdown::html_vignette
4+
vignette: >
5+
%\VignetteIndexEntry{Locale sensitive functions}
6+
%\VignetteEngine{knitr::rmarkdown}
7+
%\VignetteEncoding{UTF-8}
8+
---
9+
10+
```{r}
11+
#| include: FALSE
12+
knitr::opts_chunk$set(
13+
collapse = TRUE,
14+
comment = "#>"
15+
)
16+
library(stringr)
17+
```
18+
19+
A locale is a set of parameters that define a user's language, region, and cultural preferences. It determines language-specific rules for text processing, including how to:
20+
21+
- Convert between uppercase and lowercase letters
22+
- Sort text alphabetically
23+
- Format dates, numbers, and currency
24+
- Handle character encoding and display
25+
26+
In stringr, you can control the locale using the `locale` argument, which takes language codes like "en" (English), "tr" (Turkish), or "es_MX" (Mexican Spanish). In general, a locale is a lower-case language abbreviation, optionally followed by an underscore (_) and an upper-case region identifier. You can see which locales are supported in stringr by running `stringi::stri_locale_list()`.
27+
28+
This vignette describes locale-sensitive stringr functions, i.e. functions with a `locale` argument. These functions fall into two broad categories:
29+
30+
1. Case conversion
31+
2. Sorting and ordering
32+
33+
## Case conversion
34+
35+
`str_to_lower()`, `str_to_upper()`, `str_to_title()`, and `str_to_sentence()` all change the case of their inputs. But while most languages that use the Latin alphabet (like English) have upper and lower case, the rules for converting between the two aren't always the same. For example, Turkish has two forms of the letter "I": as well as "i" and "I", Turkish also has "ı", the dotless lowercase i, and "İ" is the dotted uppercase I. This means the rules for converting i to upper case and I to lower case are different from English:
36+
37+
```{r}
38+
# English
39+
str_to_upper("i")
40+
str_to_lower("I")
41+
42+
# Turkish
43+
str_to_upper("i", locale = "tr")
44+
str_to_lower("I", locale = "tr")
45+
```
46+
47+
Another example is Dutch, where "ij" is a digraph treated as a single letter. This means that `str_to_sentence()` will incorrectly capitalize "ij" at the start of a sentence unless you use a Dutch locale:
48+
49+
```{r}
50+
#| warning: false
51+
dutch_sentence <- "ijsland is een prachtig land in Noord-Europa."
52+
53+
# Incorrect
54+
str_to_sentence(dutch_sentence)
55+
# Correct
56+
str_to_sentence(dutch_sentence, locale = "nl")
57+
```
58+
59+
Case conversion also comes up in another situation: case-insensitive comparison. This is relevant in two contexts. First, `str_equal()` and `str_unique()` can optionally ignore case, so it's important to also supply locale when working with non-English text. For example, imagine we're searching for a Turkish name, ignoring case:
60+
61+
```{r}
62+
turkish_names <- c("İpek", "Işık", "İbrahim")
63+
search_name <- "ipek"
64+
65+
# incorrect
66+
str_equal(turkish_names, search_name, ignore_case = TRUE)
67+
68+
# correct
69+
str_equal(turkish_names, search_name, ignore_case = TRUE, locale = "tr")
70+
```
71+
72+
Case conversion also comes up in pattern matching functions like `str_detect()`. You might be accustomed to use `ignore_case = TRUE` with `regex()` or `fixed()`, but if you want to use locale-sensitive comparison you instead need to use `coll()`:
73+
74+
```{r}
75+
# incorrect
76+
str_detect(turkish_names, fixed(search_name, ignore_case = TRUE))
77+
78+
# correct
79+
str_detect(turkish_names, coll(search_name, ignore_case = TRUE, locale = "tr"))
80+
```
81+
82+
## Sorting and ordering
83+
84+
`str_sort()`, `str_order()`, and `str_rank()` all rely on the alphabetical ordering of letters. But not every language uses the same ordering as English. For example, Lithuanian places 'y' between 'i' and 'k', and Czech treats "ch" as a single compound letter that sorts after all other words beginning with 'h'. This means that to correctly sort words in these languages, you must provide the appropriate locale:
85+
86+
```{r}
87+
czech_words <- c("had", "chata", "hrad", "chůze")
88+
lithuanian_words <- c("ąžuolas", "ėglė", "šuo", "yra", "žuvis")
89+
90+
# incorrect
91+
str_sort(czech_words)
92+
str_sort(lithuanian_words)
93+
94+
# correct
95+
str_sort(czech_words, locale = "cs")
96+
str_sort(lithuanian_words, locale = "lt")
97+
```

0 commit comments

Comments
 (0)