Skip to content

Commit 24463fe

Browse files
committed
ext/standard: speed up php_url_parse_ex2 by ~12%
Three related changes to ext/standard/url.c targeting the ctype macros on the parse_url hot path. On a 17-URL mix (17M parses per run, CPU pinned, same-session A/B), median wall time drops from 1.90s to 1.68s, a ~12% reduction and ~13% throughput increase (8.94M/s to 10.10M/s). 1. php_replace_controlchars replaces its iscntrl() call with an inline `c < 0x20 || c == 0x7f` comparison. Callgrind showed iscntrl at ~14% of total instructions on a realistic URL workload; glibc's iscntrl goes through __ctype_b_loc() per byte for a TLS lookup and table deref, which defeats auto-vectorization. URL components are bytes, not locale-dependent text, so C/POSIX semantics are what we want regardless of the process locale. The Zend language scanner uses the same pattern (yych <= 0x1F). This runs once per component per parse, up to 7 times. 2. The scheme-validation walk uses isalpha/isdigit which have the same __ctype_b_loc tax. I extracted the check into php_url_is_scheme_char with an inline ASCII test: ((c | 0x20) - 'a' < 26u) || (c - '0' < 10u) for the letter/digit half, plus the three literal comparisons for + - and . The scheme loop runs once per byte of the scheme on every parse. A helper php_url_is_ascii_digit covers the two isdigit call sites in the port-scan loops (one in the mailto-branch port probe, one in the parse_port fallback). 3. The three branches that allocate ret->scheme all followed zend_string_init with a php_replace_controlchars call. The scheme loop above has already rejected any byte that isn't in [a-zA-Z0-9+.-], so the control-char scan on scheme is dead work. Removed from all three sites. No behavior change: the inline comparisons are identical in behavior to the ctype macros in C/POSIX, and URL bytes are never locale-dependent. I checked that contaminated inputs like http://ex\x7fample.com/p\x1fath still get their control bytes replaced with underscores.
1 parent 8ad79e1 commit 24463fe

File tree

1 file changed

+29
-8
lines changed

1 file changed

+29
-8
lines changed

ext/standard/url.c

Lines changed: 29 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -47,16 +47,35 @@ PHPAPI void php_url_free(php_url *theurl)
4747
}
4848
/* }}} */
4949

50+
/* ASCII-only scheme-char test. Used to be `isalpha(c) || isdigit(c) || c=='+'
51+
* || c=='-' || c=='.'`, but those ctype macros hit __ctype_b_loc() on glibc
52+
* and become a measurable portion of parse_url on short schemes. */
53+
static zend_always_inline bool php_url_is_scheme_char(unsigned char c)
54+
{
55+
return ((c | 0x20) - 'a' < 26u) || (c - '0' < 10u)
56+
|| c == '+' || c == '-' || c == '.';
57+
}
58+
59+
static zend_always_inline bool php_url_is_ascii_digit(unsigned char c)
60+
{
61+
return c - '0' < 10u;
62+
}
63+
5064
static void php_replace_controlchars(char *str, size_t len)
5165
{
5266
unsigned char *s = (unsigned char *)str;
5367
unsigned char *e = (unsigned char *)str + len;
5468

5569
ZEND_ASSERT(str != NULL);
5670

71+
/* Replace ASCII C0 control chars (0x00..0x1F) and DEL (0x7F). An inline
72+
* comparison is used instead of iscntrl() because URL components are
73+
* bytes, not locale-dependent text, and the ctype macros force a lookup
74+
* through __ctype_b_loc() per byte that measurably dominates parsing of
75+
* short components. */
5776
while (s < e) {
58-
if (iscntrl(*s)) {
59-
*s='_';
77+
if (UNEXPECTED(*s < 0x20 || *s == 0x7f)) {
78+
*s = '_';
6079
}
6180
s++;
6281
}
@@ -103,7 +122,7 @@ PHPAPI php_url *php_url_parse_ex2(char const *str, size_t length, bool *has_port
103122
p = s;
104123
while (p < e) {
105124
/* scheme = 1*[ lowalpha | digit | "+" | "-" | "." ] */
106-
if (!isalpha(*p) && !isdigit(*p) && *p != '+' && *p != '.' && *p != '-') {
125+
if (!php_url_is_scheme_char((unsigned char) *p)) {
107126
if (e + 1 < ue && e < binary_strcspn(s, ue, "?#")) {
108127
goto parse_port;
109128
} else if (s + 1 < ue && *s == '/' && *(s + 1) == '/') { /* relative-scheme URL */
@@ -118,8 +137,10 @@ PHPAPI php_url *php_url_parse_ex2(char const *str, size_t length, bool *has_port
118137
}
119138

120139
if (e + 1 == ue) { /* only scheme is available */
140+
/* scheme is guaranteed to contain only [a-zA-Z0-9+.-] per the
141+
* validation loop above, so there are no control characters to
142+
* replace. Skip the scan. */
121143
ret->scheme = zend_string_init(s, (e - s), 0);
122-
php_replace_controlchars(ZSTR_VAL(ret->scheme), ZSTR_LEN(ret->scheme));
123144
return ret;
124145
}
125146

@@ -132,22 +153,22 @@ PHPAPI php_url *php_url_parse_ex2(char const *str, size_t length, bool *has_port
132153
* correctly parse things like a.com:80
133154
*/
134155
p = e + 1;
135-
while (p < ue && isdigit(*p)) {
156+
while (p < ue && php_url_is_ascii_digit((unsigned char) *p)) {
136157
p++;
137158
}
138159

139160
if ((p == ue || *p == '/') && (p - e) < 7) {
140161
goto parse_port;
141162
}
142163

164+
/* scheme is pre-validated above to contain only [a-zA-Z0-9+.-] */
143165
ret->scheme = zend_string_init(s, (e-s), 0);
144-
php_replace_controlchars(ZSTR_VAL(ret->scheme), ZSTR_LEN(ret->scheme));
145166

146167
s = e + 1;
147168
goto just_path;
148169
} else {
170+
/* scheme is pre-validated above to contain only [a-zA-Z0-9+.-] */
149171
ret->scheme = zend_string_init(s, (e-s), 0);
150-
php_replace_controlchars(ZSTR_VAL(ret->scheme), ZSTR_LEN(ret->scheme));
151172

152173
if (e + 2 < ue && *(e + 2) == '/') {
153174
s = e + 3;
@@ -172,7 +193,7 @@ PHPAPI php_url *php_url_parse_ex2(char const *str, size_t length, bool *has_port
172193
p = e + 1;
173194
pp = p;
174195

175-
while (pp < ue && pp - p < 6 && isdigit(*pp)) {
196+
while (pp < ue && pp - p < 6 && php_url_is_ascii_digit((unsigned char) *pp)) {
176197
pp++;
177198
}
178199

0 commit comments

Comments
 (0)