-
Notifications
You must be signed in to change notification settings - Fork 357
parser.c: various optimizations and simplifications #888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ext/json/ext/parser/parser.c
Outdated
| } | ||
|
|
||
| static inline int rstring_cache_cmp(const char *str, const long length, VALUE rstring) | ||
| static inline FORCE_INLINE int rstring_cache_cmp(const char *str, const long length, VALUE rstring) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is called in a hot loop in rstring_cache_fetch. Forcing it inline helps a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an idea, does reducing the number of rstring_cache_cmp calls improve performance?
Like fixed-size hash tables. It only needs single hash calculation and single compare.
The drawback is that it doesn't guarantee first 63 keys(JSON_RVALUE_CACHE_CAPA) to be always cached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tompng I'm not sure I exactly understand what you are suggesting.
I used a sorted array over a hash table so that in most case we can avoid hashing the string. But I'm open to other implementations if it's not too complex and it performs better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that was my misunderstanding, and my benchmark string (4..10 lenghth string) was too short...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did partially implement a hashtable on a different branch. See rstring_ht_fetch.
This was a very simple hashtable using FNV1a to hash the string and open addressing. Note that FNV1a was faster than rb_memhash on my machine for these strings. My testing wasn't exhaustive. Edit: My guess is FNV1a was faster because the strings were short enough and/or the FNV1a code was inlined and rb_memhash wasn't.
I limited the probe length length so as to prevent any maliciously crafted inputs from devolving the performance to O(n).
| static inline int rstring_cache_cmp(const char *str, const long length, VALUE rstring) | ||
| static inline FORCE_INLINE int rstring_cache_cmp(const char *str, const long length, VALUE rstring) | ||
| { | ||
| #if defined(__BYTE_ORDER__) && (__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__) && defined(__has_builtin) && __has_builtin(__builtin_bswap64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the profile in samply shows we spend a bit of time in memcmp. Comparing 8 bytes at a time is helpful assuming memcmp isn't already doing that. Even if memcmp is optimized but we're only comparing strings less than JSON_RVALUE_CACHE_MAX_ENTRY_LENGTH (55 bytes) and this removes the function all overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected compilers to essentially inline memcmp, like they do with memcpy etc, but perhaps not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: I refactored that code a bit.
| if (a != b) { | ||
| a = __builtin_bswap64(a); | ||
| b = __builtin_bswap64(b); | ||
| return (a < b) ? -1 : 1; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strings are ordered lexicographically in the cache. We need to reverse the bytes in a and b to ensure this is compatible with a lexicographic ordering of byte-by-byte comparisons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we care, do we? As long as the ordering is consistent, it's fine.
ext/json/ext/parser/parser.c
Outdated
| if (RB_UNLIKELY(memchr(str, '\\', length))) { | ||
| // We assume the overwhelming majority of names don't need to be escaped. | ||
| // But if they do, we have to fallback to the slow path. | ||
| return Qfalse; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this can ever evaluate to true. This is called from two places: json_string_fastpath and json_string_unescape. The only place both of those are called is in json_decode_string.
static inline VALUE json_decode_string(JSON_ParserState *state, JSON_ParserConfig *config, const char *start, const char *end, bool escaped, bool is_name)
{
VALUE string;
bool intern = is_name || config->freeze;
bool symbolize = is_name && config->symbolize_names;
if (escaped) {
string = json_string_unescape(state, start, end, is_name, intern, symbolize);
} else {
string = json_string_fastpath(state, start, end, is_name, intern, symbolize);
}
return string;
}
Since we know that json_string_fastpath is only called when there are no escapes, this shouldn't be necessary.
Additionally, we should remove the rstring_cache_fetch call in json_string_unescape as we know those strings shouldn't be cached.
c92bd6a to
5c7e741
Compare
ext/json/ext/parser/parser.c
Outdated
| if (is_name && state->in_array) { | ||
| VALUE cached_key; | ||
| if (RB_UNLIKELY(symbolize)) { | ||
| cached_key = rsymbol_cache_fetch(&state->name_cache, string, bufferSize); | ||
| } else { | ||
| cached_key = rstring_cache_fetch(&state->name_cache, string, bufferSize); | ||
| } | ||
|
|
||
| if (RB_LIKELY(cached_key)) { | ||
| return cached_key; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rstring_cache_fetch used to check for escapes and return Qfalse if a \ was found. I believe that means it would only ever return Qfalse in this method.
It doesn't have to be multiple PRs, but at least multiple commits would be good. It would allow to describe each change individually, and measure the gain as well. |
11037ad to
441d1ae
Compare
|
NB: I force pushed you branch with a couple cleanups. |
04a6124 to
a4e99bf
Compare
| } | ||
|
|
||
| static inline int rstring_cache_cmp(const char *str, const long length, VALUE rstring) | ||
| #if defined(__BYTE_ORDER__) && (__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__) && defined(__has_builtin) && __has_builtin(__builtin_bswap64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also applies to previous SWAR optimizations:
We gated these optimizations to Little Endian, which is fine, but they also assume 64bit arch, perhaps we should skip them on 32bit?
| for (; i+8 <= length; i += 8) { | ||
| uint64_t a, b; | ||
| memcpy(&a, str + i, 8); | ||
| memcpy(&b, rptr + i, 8); | ||
| if (a != b) { | ||
| a = __builtin_bswap64(a); | ||
| b = __builtin_bswap64(b); | ||
| return (a < b) ? -1 : 1; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at godbolt ASM, seems like this is exactly the code clang generates when memcmp is called with a constant len.
So I think we could just replace that code with something like:
for (; i+8 <= length; i += 8) {
int cmp = memcmp(str + i, rptr + i, 8);
if (cmp) {
return cmp;
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, GCC doesn't: https://godbolt.org/z/7fTT6GzMs
a4e99bf to
cb6391e
Compare
Closes: ruby#888 - Mark it as `inline`. - Use `RSTRING_GETMEM`, instead of `RSTRING_LEN` and `RSTRING_PTR`. - Use an inlinable version of `memcmp`. ``` == Parsing activitypub.json (58160 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 11766.6 i/s after: 12272.1 i/s - 1.04x faster == Parsing twitter.json (567916 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 1333.2 i/s after: 1422.0 i/s - 1.07x faster == Parsing citm_catalog.json (1727030 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 656.3 i/s after: 673.1 i/s - 1.03x faster == Parsing float parsing (2251051 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 276.8 i/s after: 276.4 i/s - same-ish: difference falls within error ``` Co-Authored-By: Scott Myron <[email protected]>
Closes: #888 - Mark it as `inline`. - Use `RSTRING_GETMEM`, instead of `RSTRING_LEN` and `RSTRING_PTR`. - Use an inlinable version of `memcmp`. ``` == Parsing activitypub.json (58160 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 11766.6 i/s after: 12272.1 i/s - 1.04x faster == Parsing twitter.json (567916 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 1333.2 i/s after: 1422.0 i/s - 1.07x faster == Parsing citm_catalog.json (1727030 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 656.3 i/s after: 673.1 i/s - 1.03x faster == Parsing float parsing (2251051 bytes) ruby 3.4.6 (2025-09-16 revision dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 276.8 i/s after: 276.4 i/s - same-ish: difference falls within error ``` Co-Authored-By: Scott Myron <[email protected]>
Closes: ruby/json#888 - Mark it as `inline`. - Use `RSTRING_GETMEM`, instead of `RSTRING_LEN` and `RSTRING_PTR`. - Use an inlinable version of `memcmp`. ``` == Parsing activitypub.json (58160 bytes) ruby 3.4.6 (2025-09-16 revision ruby/json@dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 11766.6 i/s after: 12272.1 i/s - 1.04x faster == Parsing twitter.json (567916 bytes) ruby 3.4.6 (2025-09-16 revision ruby/json@dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 1333.2 i/s after: 1422.0 i/s - 1.07x faster == Parsing citm_catalog.json (1727030 bytes) ruby 3.4.6 (2025-09-16 revision ruby/json@dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 656.3 i/s after: 673.1 i/s - 1.03x faster == Parsing float parsing (2251051 bytes) ruby 3.4.6 (2025-09-16 revision ruby/json@dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 276.8 i/s after: 276.4 i/s - same-ish: difference falls within error ``` ruby/json@a67d1a1af4 Co-Authored-By: Scott Myron <[email protected]>
|
Thanks, all these small optimizations have been merged in some reworked form. The In the common case where we have an array of similar objects: [
{"foo": 1, "bar": 2},
{"foo": 3, "bar": 4}
]I'm wondering if we could keep the first parsed Hash around and before going to the string cache, we could first check if the current key we're parsing matches the first object's key at the same position. This however is contingent on having an efficient way to access the Hash keys by position, because if we need to call |
Actually, we could get them for the |
|
Looking at >> data["statuses"].map(&:keys).tally.each { |a, c| p a; p c }; nil
["metadata", "created_at", "id", "id_str", "text", "source", "truncated", "in_reply_to_status_id", "in_reply_to_status_id_str", "in_reply_to_user_id", "in_reply_to_user_id_str", "in_reply_to_screen_name", "user", "geo", "coordinates", "place", "contributors", "retweet_count", "favorite_count", "entities", "favorited", "retweeted", "lang"]
20
["metadata", "created_at", "id", "id_str", "text", "source", "truncated", "in_reply_to_status_id", "in_reply_to_status_id_str", "in_reply_to_user_id", "in_reply_to_user_id_str", "in_reply_to_screen_name", "user", "geo", "coordinates", "place", "contributors", "retweeted_status", "retweet_count", "favorite_count", "entities", "favorited", "retweeted", "possibly_sensitive", "lang"]
8
["metadata", "created_at", "id", "id_str", "text", "source", "truncated", "in_reply_to_status_id", "in_reply_to_status_id_str", "in_reply_to_user_id", "in_reply_to_user_id_str", "in_reply_to_screen_name", "user", "geo", "coordinates", "place", "contributors", "retweeted_status", "retweet_count", "favorite_count", "entities", "favorited", "retweeted", "lang"]
65
["metadata", "created_at", "id", "id_str", "text", "source", "truncated", "in_reply_to_status_id", "in_reply_to_status_id_str", "in_reply_to_user_id", "in_reply_to_user_id_str", "in_reply_to_screen_name", "user", "geo", "coordinates", "place", "contributors", "retweet_count", "favorite_count", "entities", "favorited", "retweeted", "possibly_sensitive", "lang"]
7However, the first 17 keys are always the same, so I think there's potential there. We can stop checking after the first mismatch, that would still be 17 out of 23-25 keys (68-73%) that could be obtained with a single As for >> data["orderedItems"].map(&:keys).tally
=> {["id", "type", "actor", "published", "to", "cc", "object"] => 20}And for >> data["performances"].map(&:keys).tally
=> {["eventId", "id", "logo", "name", "prices", "seatCategories", "seatMapImage", "start", "venueCode"] => 243}What I'm less clear on is for nested hashes, e.g. in {
"statuses": [
{
"metadata": {
"result_type": "recent",
"iso_language_code": "ja"
},
"created_at": "Sun Aug 31 00:29:15 +0000 2014",
"id": 505874924095815700,Here I'm not sure how we could provide the keys for |
|
I had a similar idea but I wasn't sure how to implement it, so I filed it in the "things to think about while I workout" bucket. Sketching out my current idea... Define a new type to keep track of the index keys in their insertion order. Limit this to a relatively small number. #define JSON_INDEX_KEYS_CAPA 32
typedef struct _json_object_keys {
VALUE entries[JSON_INDEX_KEYS_CAPA];
int len;
} JSON_ObjectKeys;Add the following to JSON_ObjectKeys *object_keys;Since When we hit a case '[': {
state->cursor++;
json_eat_whitespace(state);
long stack_head = state->stack->head;
JSON_ObjectKeys current;
current.length = 0;
JSON_ObjectKeys *previous_keys = state->object_keys;
if (peek(state) == ']') {
state->cursor++;
return json_push_value(state, config, json_decode_array(state, config, 0));
} else {
state->current_nesting++;
if (RB_UNLIKELY(config->max_nesting && (config->max_nesting < state->current_nesting))) {
rb_raise(eNestingError, "nesting of %d is too deep", state->current_nesting);
}
state->in_array++;
// Only set the object_keys if we have at least one element.
state->object_keys = ¤t;
json_parse_any(state, config);
}
while (true) {
json_eat_whitespace(state);
const char next_char = peek(state);
if (RB_LIKELY(next_char == ',')) {
state->cursor++;
if (config->allow_trailing_comma) {
json_eat_whitespace(state);
if (peek(state) == ']') {
continue;
}
}
json_parse_any(state, config);
continue;
}
if (next_char == ']') {
state->cursor++;
long count = state->stack->head - stack_head;
state->current_nesting--;
state->in_array--;
state->object_keys = previous_keys;
return json_push_value(state, config, json_decode_array(state, config, count));
}
raise_parse_error("expected ',' or ']' after array value", state);
}
break;
}We then probably need to keep track of the array index we're currently parsing.. if it's The object parsing code needs to change too, if we're on That's about as far as I've made it... |
Closes: ruby/json#888 - Mark it as `inline`. - Use `RSTRING_GETMEM`, instead of `RSTRING_LEN` and `RSTRING_PTR`. - Use an inlinable version of `memcmp`. ``` == Parsing activitypub.json (58160 bytes) ruby 3.4.6 (2025-09-16 revision ruby/json@dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 11766.6 i/s after: 12272.1 i/s - 1.04x faster == Parsing twitter.json (567916 bytes) ruby 3.4.6 (2025-09-16 revision ruby/json@dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 1333.2 i/s after: 1422.0 i/s - 1.07x faster == Parsing citm_catalog.json (1727030 bytes) ruby 3.4.6 (2025-09-16 revision ruby/json@dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 656.3 i/s after: 673.1 i/s - 1.03x faster == Parsing float parsing (2251051 bytes) ruby 3.4.6 (2025-09-16 revision ruby/json@dbd83256b1) +YJIT +PRISM [arm64-darwin24] Comparison: before: 276.8 i/s after: 276.4 i/s - same-ish: difference falls within error ``` ruby/json@a67d1a1af4 Co-Authored-By: Scott Myron <[email protected]>
This PR has a collection of small optimizations and simplifications. I'm happy to split this in to multiple PRs if you'd prefer.
I will leave comments inline to describe the changes.
Additionally it fixes a build issue. power_assert 3.0.0 was released that breaks CI as it drops support for Ruby
< 3.1.Benchmark
M1 Macbook Air:
Run 1
Run 2
Macbook Pro M4 Pro
Run 1
Run 2
Intel(R) Core(TM) i7-8850H laptop
Run 1
Run 2