-
Notifications
You must be signed in to change notification settings - Fork 7.8k
feat: Support exception tagging for HTTP response #88818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support exception tagging for HTTP response #88818
Conversation
|
Workflow [PR], commit [848832b] Summary: ❌
|
Fixes: ClickHouse#75175 This PR adds consistent and reliable way to return exceptions occured in mid-stream during serving HTTP request. It implements the proposal agreed on the attached issue. Everytime the exception occurs, it will be written in following form (which makes it easier for clients to parse the exception message reliably) ``` HTTP/1.1 200 OK↵ X-ClickHouse-Exception-Tag: PU1FNUFH98↵ ↵ bodybodybodybody↵ __exception__PU1FNUFH98↵ Big bam accrued right while reading the data↵ 45 PU1FNUFH98__exception__↵ ``` Here 45 is the size of the Big bam accrued right while reading the data↵ message, and PU1FNUFH98 is the exception token. Also this behavior change comes into play only when the settings `http_exception_tagging` is 1 (default 0) Otherwise it's a no-op Signed-off-by: Kaviraj <[email protected]>
d056aec to
b8ee6cc
Compare
slvrtrn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tavplubix PTAL
src/Core/Settings.cpp
Outdated
| DECLARE(UInt64, http_response_buffer_size, 0, R"( | ||
| The number of bytes to buffer in the server memory before sending a HTTP response to the client or flushing to disk (when http_wait_end_of_query is enabled). | ||
| )", 0) \ | ||
| DECLARE(Bool, http_exception_tagging, false, R"( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also test if we can enable it for a readonly user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also curious: does it throw if it is set on an older CH version, e.g. do we need to track CH version on the language client side before setting it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a chat with @slabko and @tavplubix. We decided not to have this settings at all.
Given, we made sure we have a \r\n before between exception marker and a tag. It wouldn't break any existing behavior and this feature can be enabled by default without needing any toggle.
I will remove the this setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it's not supposed to break any existing behavior (in the worst case, a weird string will be added to exception message), it still makes sense to use "Backward Incompatible Change" as the changelog category just for awareness (because who knows how some trash-scripts may parse the exception message)
Also, afair we were considering disabling http_write_exception_in_output_format by default (but it's also okay to keep this setting as is and disable it on the client side if the client expects new exception format for all output formats including XML/JSON*)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(but it's also okay to keep this setting as is and disable it on the client side if the client expects new exception format for all output formats including XML/JSON*)
As a side note, we strive to minimise the number of settings configured by language clients to prevent changes to the default server behaviour and potential permission issues for read-only users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of a setting that is supposed to be changed by language clients (like other format/protocol settings). Because if you use a language client - you don't care about how the client communicates with the server, the client hides this from you. But the client may need to change settings like this.
potential permission issues for read-only users
This is the main problem. We can make such settings changeable for readonly users (it's already possible with <changeable_in_readonly/> settings constraint) if you think that it can be useful for other format/protocol settings as well. If not - we can change the default of this setting (but old versions will still have the old default, and you will have similar problems for a year until the old versions reach EOL)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a chat with @tavplubix about this. We have following options.
-
Disable
http_write_exception_in_output_formatby default so that language clients will have consistent exception block (as implemented in this PR) for all the formats including JSON/XML without sending any additional settings. But it will be only available in future clickhouse server versions. Language clients still have to deal with it for old clickhouse server anyway. And this would still be a problem for ReadOnly user (who cannot change this settings) for older clickhouse server. -
We keep the setting as it is now and Make this settings
http_write_exception_in_output_formatchangeable even by ReadOnly user. But again we cannot backport this and still only be available in new clickhouse server and language clients have to handle it for older clickhouse servers. -
Do both (1) and (2). Again only for future clickhouse server. No backport.
@tavplubix also mentioned, for option (2), we can backport it on the ClickHouse cloud though. But still we have to tell our both OSS users and Cloud users about this difference in our docs.
IMHO, I personally think (1) is good option without much "docs" complexity. And on the client side, we can do always pass http_write_exception_in_output_format=0 for now till older versions reach EOL. And for the ReadOnly user, if the parsing failed -> we look for __exception__ marker.
I will bring this up in today's team sync and see what everyone thinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for option #1 to change the default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 1 for sure
Signed-off-by: Kaviraj <[email protected]>
After a discussion with people, we agreed this can be default behavior without needing any toggle by new settings. Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
| echo "Test 2: Verify exception tag is alphanumeric" | ||
| TAG=$(${CLICKHOUSE_CURL} -sS -D - "${CLICKHOUSE_URL}" --data-binary "SELECT 1" | grep "X-ClickHouse-Exception-Tag" | cut -d':' -f2 | tr -d ' \r\n') | ||
| if [[ "$TAG" =~ ^[A-Za-z0-9]+$ ]] && [[ ${#TAG} -eq 8 ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed that getRandomASCIIString (unlike FunctionRandomPrintableASCII) only uses [a-z], which means 26^8 = 37.6 bits of entropy, which is probably good enough, but it's much less than UUIDv4 has, for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. given it's "per-request", do you think it may cause collision at our peek scale? happy to change it in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we care about collisions between different requests?
The main purposeof having a random string is to be able to distinguish the exception marker from the data in the response. It's already quite unlikely to find the same random string in the data, but why not make it longer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. @tavplubix explained the actual need for bigger tag :) . It's not to avoid any collision. It's to make sure tag is random enough not to appear in actual dataset.
updated to 16 byte (from 8 bytes previously)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use UUID? Do we have any performance concerns in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can. It's just not that useful I think. Given we have entropy of 26^16 = 75.2 bits now, it's more than good enough.
|
|
||
| echo "Test 3: Check exception marker format on error" | ||
| # Get a fresh TAG from the error response itself | ||
| RESPONSE=$(${CLICKHOUSE_CURL} -sS -D - "${CLICKHOUSE_URL}&http_response_buffer_size=1" --data-binary "SELECT throwIf(number=3, 'there is a exception') FROM system.numbers SETTINGS max_block_size=1" 2>&1 || true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was failing because of buffering: the result is too small, so the exception happens before clickhouse-server starts sending the response (so clickhouse-server can respond 500 and just send the exception in the body)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tavplubix for this change :). Makes sense.
Also I notice the same two tests still failing now.
Digging into it, I noticed. It's always Test 3 or Test 4. It's always exception marker is missing in the response sometimes. I think reason is still sometimes the test get directly 500 from the server. Looks like we may also tweak other settings to run this deterministically?
2025-10-28 19:22:02 Test 4: Check reverse marker format (length + tag + __exception__)
2025-10-28 19:22:02 -Reverse marker found: OK
2025-10-28 19:22:02 +Reverse marker not found
025-10-29 02:31:45 Test 3: Check exception marker format on error
2025-10-29 02:31:45 -Tagged exception marker found: OK
2025-10-29 02:31:45 +Tagged exception marker not found
And for the buffering failures. It does expect the exception in the mid-stream may be?
2025-10-28 23:55:02 Reason: having exception in stdout:
2025-10-28 23:55:02 FAIL 4 max_result_bytes=4000000&http_response_buffer_size=2000000&http_wait_end_of_query=0 5000000
1. Remove unused `getExceptionTag()` 2. Add exception tag to `Access-Control-Expose-Headers` 3. Calculate the proper trim size of exception message. Making sure the whole exception block doesn't exceed MAX_EXCEPTION_SIZE 4. Update the tests accordingly Signed-off-by: Kaviraj <[email protected]>
Needed as we added exception tag header X-ClickHouse-Exception-Tag in Access-Control-Expose-Headers Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
1. Uses the http buffering, response size, block size consistently 2. Remove any additional LIMIT clause Signed-off-by: Kaviraj <[email protected]>
|
|
||
| // 2 bytes represents - \r\n | ||
| // 1 byte represents - ' ' (space between <message_length> <TAG>) in the above exception block format | ||
| size_t size_message_excluded = 2 + EXCEPTION_MARKER.size() + 2 + EXCEPTION_TAG_LENGTH + 2 + sizeof(size_t) + 1 + EXCEPTION_TAG_LENGTH + 2 + EXCEPTION_MARKER.size() + 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sizeof(size_t) = 8, while toString(size_t_variable).size() can be up to 20 (18446744073709551615 fits into size_t). Although limited_message.size() is limited and 8 bytes is more than enough, this looks confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but limited_message.size() return size_t and the actual size depends on the arch no? say 4-byte in 32-bit machine and 8 byte in 64-bit machine.
That was main rationale for me to do that way.
I'm happy to hard-code it to 8 byte if you can confirm this won't be a problem :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but limited_message.size() return size_t and the actual size depends on the arch no? say 4-byte in 32-bit machine and 8 byte in 64-bit machine.
Yes, you're right, but we don't build clickhouse for 32-bit architectures, it's not supported
However, if we did - this code would be incorrect:
- sizeof(size_t) would be 4, but this is for binary representation of the number
- limited_message.size() can be, for example, 10000 (it's still way below the limit), but you need 5 bytes to write this number as string because there are 5 decimal digits, and it doesn't fit into 4 bytes
So this would be a bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. Updated to 8 bytes.
1. Change exception tag size to 16 bytes 2. Hardcode the exception size to 8 bytes Signed-off-by: Kaviraj <[email protected]>
Updated settings changes history to reflect new and modified settings.
|
Could you please update docs as well? |
|
@kavirajk, @tavplubix, @mshustov: before merging this one, let's double-check the implementation with one of the language clients and see if it is ergonomic enough to use. Since 25.10 is out, we still have plenty of time before 25.11. |
Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
|
@tavplubix when implementing the support on client side (in Golang client), I found a subtle bug. Bug: Currently during exception, CH server doesn't send Steps to reproduce with CH server built on this PR Notice there is no zero chunk
|
ClickHouse does this intentionally. See #68800 |
Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
Signed-off-by: Kaviraj <[email protected]>
1. Hard tabs -> spaces 2. Context for shell commands Signed-off-by: Kaviraj <[email protected]>
docs/en/interfaces/http.md
Outdated
| ``` | ||
|
|
||
| Where `<TAG>` is a random tag of 16bytes and `<error message>` can be up to 16 KiB max (exact length can be found in `<message_length>`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<TAG>is a random tag
Should we mention the X-ClickHouse-Exception-Tag header here?
<error message>can be up to 16 KiB max
The whole block is up to 16 KiB max, the error message is slightly shorter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes to both 👍
Signed-off-by: Kaviraj <[email protected]>
|
@tavplubix thanks for the review ❤️ Addressed the doc comments. The failing tests looks unrelated. Let me know if needed anything before merging the PR :) |
239ac0e
ClickHouse now implements an improved format for mid-stream exceptions: ClickHouse/ClickHouse#88818 This change adopts the new format and uses it when possible. Otherwise, it falls back to the older approach of searching for the __exception__ marker at the end of the buffered data.
ClickHouse now implements an improved format for mid-stream exceptions: ClickHouse/ClickHouse#88818 This change adopts the new format and uses it when possible. Otherwise, it falls back to the older approach of searching for the __exception__ marker at the end of the buffered data.
ClickHouse now implements an improved format for mid-stream exceptions: ClickHouse/ClickHouse#88818 This change adopts the new format and uses it when possible. Otherwise, it falls back to the older approach of searching for the __exception__ marker at the end of the buffered data.
ClickHouse now implements an improved format for mid-stream exceptions: ClickHouse/ClickHouse#88818 This change adopts the new format and uses it when possible. Otherwise, it falls back to the older approach of searching for the __exception__ marker at the end of the buffered data.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Support exception tagging for HTTP results response to enable clients to parse exceptions more reliably. Resolves #75175. The setting
http_write_exception_in_output_formatis disabled by default for consistency across formats.Documentation entry for user-facing changes
Details
This PR adds a consistent and reliable way to return exceptions that occur mid-stream during HTTP request serving. The behavior is enabled by default and adds an X-ClickHouse-Exception-Tag header to every response.
When an exception occurs, it will be written in the following form which makes it easier for clients to parse the exception message reliably:
Here 45 is the size of the exception message, and PU1FNUFH98 is the exception token.
Resolves #75175