Skip to content

Conversation

@ChrisBenua
Copy link
Contributor

In short: T.self is _JSONStringDictionaryDecodableMarker.Type, value as? _JSONStringDictionaryEncodableMarker and value as? _JSONDirectArrayEncodable are really slow operations. And the bigger binary gets the slower this check works for the first time for each pair of class/struct/enum and protocol.

But checking whether current type conforms to _JSONStringDictionaryDecodableMarker and _JSONStringDictionaryEncodableMarker is needed only when we use custom keyDecoding/EncodingStrategy. Comments in code confirm this:

/// A marker protocol used to determine whether a value is a `String`-keyed `Dictionary`
/// containing `Decodable` values (in which case it should be exempt from key conversion strategies).
///
/// The marker protocol also provides access to the type of the `Decodable` values,
/// which is needed for the implementation of the key conversion strategy exemption.
private protocol _JSONStringDictionaryDecodableMarker {
    static var elementType: Decodable.Type { get }
}
/// A marker protocol used to determine whether a value is a `String`-keyed `Dictionary`
/// containing `Encodable` values (in which case it should be exempt from key conversion strategies).
private protocol _JSONStringDictionaryEncodableMarker { }

So we can easily skip this checks when default keyDecoding/EncodingStrategy is used.

For details: see this issue: #1480

@ChrisBenua
Copy link
Contributor Author

Benchmarking results

Collected using this command swift package --allow-writing-to-package-directory benchmark baseline compare new_coders --target JSONBenchmarks --format markdown

Comparing results between 'old_coders' and 'Current_run'

Host 'Christians-MacBook-Pro.local' with 8 'arm64' processors with 16 GB memory, running:
Darwin Kernel Version 24.6.0: Mon Jul 14 11:30:34 PDT 2025; root:xnu-11417.140.69~1/RELEASE_ARM64_T8103

JSONBenchmarks

Canada-decodeFromJSON metrics

Time (total CPU): results within specified thresholds, fold down for details.

Time (total CPU) (μs) * p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 27 27 27 27 27 27 28 112
Current_run 27 27 28 28 28 28 28 109
Δ 0 0 1 1 1 1 0 -3
Improvement % 0 0 -4 -4 -4 -4 0 -3

Throughput (# / s): results within specified thresholds, fold down for details.

Throughput (# / s) (K) p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 38 38 37 37 37 37 36 112
Current_run 37 36 36 36 36 36 36 109
Δ -1 -2 -1 -1 -1 -1 0 -3
Improvement % -3 -5 -3 -3 -3 -3 0 -3

Canada-encodeToJSON metrics

Time (total CPU): results within specified thresholds, fold down for details.

Time (total CPU) (μs) * p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 54 54 54 54 55 55 55 56
Current_run 54 54 54 54 55 55 55 56
Δ 0 0 0 0 0 0 0 0
Improvement % 0 0 0 0 0 0 0 0

Throughput (# / s): results within specified thresholds, fold down for details.

Throughput (# / s) (K) p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 19 18 18 18 18 18 18 56
Current_run 19 18 18 18 18 18 18 56
Δ 0 0 0 0 0 0 0 0
Improvement % 0 0 0 0 0 0 0 0

Twitter-decodeFromJSON metrics

Time (total CPU): results within specified thresholds, fold down for details.

Time (total CPU) (ns) * p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 2150 2163 2169 2198 2224 2292 2448 1374
Current_run 2152 2165 2167 2175 2204 2247 2477 1379
Δ 2 2 -2 -23 -20 -45 29 5
Improvement % 0 0 0 1 1 2 -1 5

Throughput (# / s): results within specified thresholds, fold down for details.

Throughput (# / s) (K) p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 465 463 461 456 450 435 389 1374
Current_run 465 463 462 460 454 446 404 1379
Δ 0 0 1 4 4 11 15 5
Improvement % 0 0 0 1 1 3 4 5

Twitter-encodeToJSON metrics

Time (total CPU): results within specified thresholds, fold down for details.

Time (total CPU) (ns) * p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 1078 1083 1085 1089 1094 1155 1314 2752
Current_run 1085 1091 1093 1095 1100 1135 1280 2738
Δ 7 8 8 6 6 -20 -34 -14
Improvement % -1 -1 -1 -1 -1 2 3 -14

Throughput (# / s): results within specified thresholds, fold down for details.

Throughput (# / s) (K) p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 928 924 922 920 915 864 629 2752
Current_run 922 918 917 914 910 882 782 2738
Δ -6 -6 -5 -6 -5 18 153 -14
Improvement % -1 -1 -1 -1 -1 2 24 -14

@ChrisBenua
Copy link
Contributor Author

Updated benchmarks after last commit:

Host 'Christians-MacBook-Pro.local' with 8 'arm64' processors with 16 GB memory, running:
Darwin Kernel Version 24.6.0: Mon Jul 14 11:30:34 PDT 2025; root:xnu-11417.140.69~1/RELEASE_ARM64_T8103

JSONBenchmarks

Canada-decodeFromJSON metrics

Time (total CPU): results within specified thresholds, fold down for details.

Time (total CPU) (μs) * p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 27 27 27 27 27 27 28 112
Current_run 26 26 26 26 27 27 27 114
Δ -1 -1 -1 -1 0 0 -1 2
Improvement % 4 4 4 4 0 0 4 2

Throughput (# / s): results within specified thresholds, fold down for details.

Throughput (# / s) (K) p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 38 38 37 37 37 37 36 112
Current_run 38 38 38 38 37 37 37 114
Δ 0 0 1 1 0 0 1 2
Improvement % 0 0 3 3 0 0 3 2

Canada-encodeToJSON metrics

Time (total CPU): results within specified thresholds, fold down for details.

Time (total CPU) (μs) * p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 54 54 54 54 55 55 55 56
Current_run 54 54 54 54 54 56 56 56
Δ 0 0 0 0 -1 1 1 0
Improvement % 0 0 0 0 2 -2 -2 0

Throughput (# / s): results within specified thresholds, fold down for details.

Throughput (# / s) (K) p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 19 18 18 18 18 18 18 56
Current_run 19 18 18 18 18 18 18 56
Δ 0 0 0 0 0 0 0 0
Improvement % 0 0 0 0 0 0 0 0

Twitter-decodeFromJSON metrics

Time (total CPU): results within specified thresholds, fold down for details.

Time (total CPU) (ns) * p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 2150 2163 2169 2198 2224 2292 2448 1374
Current_run 2168 2175 2179 2181 2191 2255 2451 1374
Δ 18 12 10 -17 -33 -37 3 0
Improvement % -1 -1 0 1 1 2 0 0

Throughput (# / s): results within specified thresholds, fold down for details.

Throughput (# / s) (K) p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 465 463 461 456 450 435 389 1374
Current_run 462 460 459 459 457 444 408 1374
Δ -3 -3 -2 3 7 9 19 0
Improvement % -1 -1 0 1 2 2 5 0

Twitter-encodeToJSON metrics

Time (total CPU): results within specified thresholds, fold down for details.

Time (total CPU) (ns) * p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 1078 1083 1085 1089 1094 1155 1314 2752
Current_run 1090 1096 1099 1102 1106 1134 1265 2723
Δ 12 13 14 13 12 -21 -49 -29
Improvement % -1 -1 -1 -1 -1 2 4 -29

Throughput (# / s): results within specified thresholds, fold down for details.

Throughput (# / s) (K) p0 p25 p50 p75 p90 p99 p100 Samples
old_coders 928 924 922 920 915 864 629 2752
Current_run 918 913 911 908 906 883 791 2723
Δ -10 -11 -11 -12 -9 19 162 -29
Improvement % -1 -1 -1 -1 -1 2 26 -29

@ChrisBenua
Copy link
Contributor Author

@jmschonfeld can you please run CI tests? I don't have permission for that.

Thanks in advance!

@jmschonfeld
Copy link
Contributor

@swift-ci please test

@ChrisBenua
Copy link
Contributor Author

@jmschonfeld all CI checks were successful.

Do you find the solution satisfactory? If so, could you please advise on the next steps for getting the PR merged?

@ChrisBenua
Copy link
Contributor Author

@jmschonfeld I've removed unnecessary diff and made identical isDefault computed-property in both files.

Can we run CI checks again?

Does this approach meet your expectations? If so, I'd appreciate your guidance on how we can proceed with merging the PR.

@jmschonfeld
Copy link
Contributor

@swift-ci please test

Copy link
Contributor

@jmschonfeld jmschonfeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change seems ok to me on a read through, but I'll let @kperryua do a more in-depth review since he's had more experience investigating performance in this area

Copy link
Contributor

@kperryua kperryua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory this looks fine, but I'd concerned about the not-spectacular benchmark results. You mentioned your benchmark performs around 35% better. Do you have any explanation for the discrepancy?

return .number(decimal.description)
} else if let encodable = value as? _JSONStringDictionaryEncodableMarker {
} else if !options.keyEncodingStrategy.isDefault, let encodable = value as? _JSONStringDictionaryEncodableMarker {
return try self.wrap(encodable as! [String:Encodable], for: additionalKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to use _specializingCast here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, no, we can't use it here. _specializingCast internally performs type equality check, and here we use as? to check protocol conformance. so _specializingCast is inapplicable here

Copy link
Contributor Author

@ChrisBenua ChrisBenua Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inline(__always)
internal func _specializingCast<Input, Output>(_ value: Input, to type: Output.Type) -> Output? {
  guard Input.self == Output.self else { return nil } // this check will always fail when we check for protocol conformance
  return _identityCast(value, to: type)
}

@ChrisBenua
Copy link
Contributor Author

In theory this looks fine, but I'd concerned about the not-spectacular benchmark results. You mentioned your benchmark performs around 35% better. Do you have any explanation for the discrepancy?

@kperryua so the reason why we can't observe boost I've mentioned in my thread on forums.swift.org when running benchmarks is quite simple: when running benchmarks - we run the same decoding/encoding 1.000.000 times without relaunching. So the catch is that swift_conformsToProtocol method works really slow only for the first time for each pair of arguments (class/enum/struct and protocol)! So overhead from swift_conformsToProtocol on first iteration is barely noticeable when averaging 1.000.000 iterations results.

That's why I've created my own benchmark to illustrate how massive overhead can cause swift_conformsToProtocol. I've described it in my thread on forums.swift.org, here it is. Shortly, I've generated 10k codable classes, they're united in 2500 groups of 4 classes. So benchmark is simple: I decode the same 320 bytes json 2500 times but to different classes (to A1, A5... A9997) exactly once to make sure swift_conformsToProtocol does not hit its internal in-memory-cache.

Here you can see benchmark results, I'll paste results here also:

JSONDecoder

In this benchmark I've measured performance in 4 variations:

  • standard JSONDecoder
  • standard JSONDecoder + String as CodingKey
  • optimized JSONDecoder
  • optimized JSONDecoder + String as CodingKey
quantile 0.25 0.5 0.75
standard JSONDecoder 5.81 s 5.826 s 5.86 s
standard JSONDecoder + String as CodingKey 3.24 s (↑44%) 3.26 s (↑44%) 3.29 s (↑43.9%)
optimized JSONDecoder 2.64 s (↑55%) 2.65 s (↑55%) 2.66 s (↑54.6%)
optimized JSONDecoder + String as CodingKey 0.113 s (↑98%) 0.114 s (↑98%) 0.116 s (↑98%)
JSONEncoder

In this benchmark I've measured performance in 4 variations:

  • standard JSONEncoder
  • standard JSONEncoder + String as CodingKey
  • optimized JSONEncoder
  • optimized JSONEncoder + String as CodingKey
quantile 0.25 0.5 0.75
standard JSONEncoder 8.06 s 8.08 s 8.12 s
standard JSONEncoder + String as CodingKey 5.49 s (↑32%) 5.52 s (↑32%) 5.55 s (↑32%)
optimized JSONEncoder 2.67 s (↑67%) 2.68 s (↑67%) 2.69 s (↑67%)
optimized JSONEncoder + String as CodingKey 0.148 s (↑98.1%) 0.149 s (↑98.2%) 0.151 s (↑98.1%)

So you can see how devastating is overhead from swift_conformsToProtocol when decoding or encoding for the first time.

And about 35% boost. I was comparing the latest version of this PR with previous one where I've been using as? instead of _specializingCast in _asDirectArrayEncodable function. And I was using String as CodingKeys to see remove any other Swift Runtime overhead. And _specializingCast version performed 35% better then as? version of _asDirectArrayEncodable function.

Also, I've merged the same optimisation to ZippyJSON and to ReerJSON.

@kperryua
Copy link
Contributor

@ChrisBenua
Thank you for the summary.

I think this is another interesting tradeoff point, similar to the direct-array optimization (though to a lesser degree). The question becomes whether the swift_conformsToProtocol cache is worthwhile or not. Of course, if a client is using many different types, or only performing a small number of encode or decode operations per process lifetime, the cache is pure unwanted overhead. However, there will definitely be clients that run for long periods of time for which the cost of the cache is amortized away into irrelevance.

It seems clear that in the average long-running case, the benchmarks suggest that the swift_conformsToProtocol solution is preferred. However, I does seem like the regression for this case is minimal, and the potential benefit for clients that do run for short lifetimes could see a more significant benefit.

In your opinion, does this change achieve the optimal balance between these two use cases?

@ChrisBenua
Copy link
Contributor Author

@ChrisBenua Thank you for the summary.

I think this is another interesting tradeoff point, similar to the direct-array optimization (though to a lesser degree). The question becomes whether the swift_conformsToProtocol cache is worthwhile or not. Of course, if a client is using many different types, or only performing a small number of encode or decode operations per process lifetime, the cache is pure unwanted overhead. However, there will definitely be clients that run for long periods of time for which the cost of the cache is amortized away into irrelevance.

It seems clear that in the average long-running case, the benchmarks suggest that the swift_conformsToProtocol solution is preferred. However, I does seem like the regression for this case is minimal, and the potential benefit for clients that do run for short lifetimes could see a more significant benefit.

In your opinion, does this change achieve the optimal balance between these two use cases?

@kperryua Thanks for sharing your thoughts on this subject!

You're absolutely right! There is minimal regression in performance while running benchmarks, and, for sure, it can affect some users of JSONDecoder/JSONEncoder which decode/encode the same types many-many times. But I'm sure that it will be barely noticeable difference in this case.

But if we optimise first decoding/encoding we can achieve better performance in mobile apps startup scenarios! For example, my team and I have measured performance of Foundation.JSONDecoder and our version which is pretty similar with this PR - and we've received massive improvements: total time spent on JSONDecoder.decode reduced for 50%, the same for JSONEncoder.encode. I'll post detailed measurements below:

Measurements

We have 80k measurements from different devices. ~40k with optimized JSONDecoder and JSONEncoder and ~40k with standard JSONDecoder and JSONEncoder with duration logging.

quantile 0.1 0.25 0.5 0.75 0.9
standard JSONDecoder 198 ms 282 ms 422 ms 667 ms 1017 ms
optimized JSONDecoder 100 ms 133 ms 200 ms 322 ms 528 ms
Difference ↑49.5% ↑52.8% ↑52.6% ↑51.7% ↑48.1%

And for JSONEncoder:

quantile 0.1 0.25 0.5 0.75 0.9
standard JSONEncoder 59 ms 94 ms 159 ms 289 ms 547 ms
optimized JSONEncoder 14 ms 30 ms 73 ms 135 ms 220 ms
Difference ↑76% ↑68% ↑54% ↑53.2% ↑59.8%

And now custom JSONDecoder/Encoder with changes from this PR is used by 95% users of our app.

But we have very large app with over 150k protocol conformance descriptors, so for small apps there will be slightly less boost.

Most application loading scenarios follow a similar pattern:

  • Executing URLSession requests
  • Parsing the results (typically involving distinct model sets for each screen)
  • Displaying the majority of the data on-screen, while caching the remainder for future use

By implementing the proposed changes, we can significantly reduce the time required to parse results during the initial load. In my view, this optimization has the potential to enhance loading performance across a wide range of applications, offering substantial benefits with minimal drawbacks.

I presented this solution at a recent local offline conference, and I am pleased to report that at least two companies have already adopted the optimized version of JSONDecoder/Encoder and deployed it to production. They subsequently contacted me to share comparable performance improvements.

One relatively small application achieved a 25% improvement using the modified JSONDecoder/Encoder from this pull request. Another application, similar in size to ours, reported a 40% boost.

As a performance engineer, I am genuinely excited by the extent to which these relatively minor changes can yield significant optimizations across numerous applications. I would greatly appreciate your support and alignment with this perspective.

@ChrisBenua ChrisBenua requested a review from kperryua September 30, 2025 11:18
Copy link
Contributor

@kperryua kperryua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thank you for the useful discussion. Your comments about application loading time are especially compelling.

@kperryua kperryua merged commit 97b4581 into swiftlang:main Sep 30, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants