Fix: non utf8 schema support by dimbo4ka · Pull Request #288 · yandex/ch-backup

dimbo4ka · 2026-02-02T16:11:13Z

Summary by Sourcery

Add support for handling non-UTF-8 ClickHouse schemas and validate compatibility via integration tests.

Bug Fixes:

Fallback to latin-1 when reading table metadata and downloaded data that contain non-UTF-8 bytes to prevent schema loading failures.

Enhancements:

Extend ClickHouse test client with a raw-bytes execute method to support non-UTF-8 query payloads in tests.

Tests:

Add integration scenarios covering backup/restore of UTF-8 schemas with diverse characters and of tables whose schemas contain invalid UTF-8 bytes.

sourcery-ai · 2026-02-02T16:11:22Z

Reviewer's Guide

Adds latin-1 fallback support for non-UTF-8 ClickHouse table schemas during backup/restore and introduces integration tests (including a raw-bytes ClickHouse client method) to verify behavior with both valid UTF-8 and invalid-UTF-8 metadata.

Sequence diagram for fallback decoding when loading table metadata

sequenceDiagram
    participant BackupManager
    participant TableLogic as TableLogic_table_module
    participant FileSystem

    BackupManager->>TableLogic: _load_create_statement_from_disk(table)
    TableLogic->>FileSystem: read_text(utf-8)
    alt utf-8 readable
        FileSystem-->>TableLogic: create_statement (utf-8)
        TableLogic-->>BackupManager: create_statement
    else UnicodeDecodeError
        TableLogic->>TableLogic: log warning (non UTF-8 bytes)
        TableLogic->>FileSystem: read_text(latin-1)
        FileSystem-->>TableLogic: create_statement (latin-1)
        TableLogic-->>BackupManager: create_statement
    end

Sequence diagram for fallback decoding in storage loader download_data

sequenceDiagram
    participant RestoreProcess
    participant StorageLoader
    participant PhysicalLoader as PhysicalLoader_impl

    RestoreProcess->>StorageLoader: download_data(remote_path, encoding)
    StorageLoader->>PhysicalLoader: download_data(remote_path, is_async, encryption)
    PhysicalLoader-->>StorageLoader: data (bytes)
    alt encoding is falsy
        StorageLoader-->>RestoreProcess: data (bytes)
    else encoding is truthy
        StorageLoader->>StorageLoader: data.decode(utf-8)
        alt utf-8 succeeds
            StorageLoader-->>RestoreProcess: text (utf-8)
        else UnicodeDecodeError
            StorageLoader->>StorageLoader: data.decode(latin-1)
            StorageLoader-->>RestoreProcess: text (latin-1)
        end
    end

Updated class diagram for storage loader and ClickHouse test client

classDiagram
    class StorageLoader {
        +download_data(remote_path, encoding, is_async, encryption) any
        +download_file(remote_path, local_path, is_async, encryption) void
    }

    class PhysicalLoader {
        +download_data(remote_path, is_async, encryption) bytes
    }

    StorageLoader --> PhysicalLoader : uses

    class TableModule {
        +_load_create_statement_from_disk(table) str
    }

    class Table {
        +database str
        +name str
        +metadata_path str
    }

    TableModule --> Table : reads_metadata

    class ClickHouseTestClient {
        +query(sql) str
        +query_raw_bytes(sql) bytes
    }

    ClickHouseTestClient ..> StorageLoader : used_in_integration_tests

File-Level Changes

Change	Details	Files
Add latin-1 fallback when reading table metadata create statements from disk if UTF-8 decoding fails.	Wrap metadata file UTF-8 read in try/except around UnicodeDecodeError in _load_create_statement_from_disk On UnicodeDecodeError, log a warning including database and table names about non-UTF-8 bytes Re-read the metadata file using latin-1 encoding as a fallback	`ch_backup/logic/table.py`
Add latin-1 fallback when decoding downloaded backup data that is not valid UTF-8.	Change download_data to always attempt UTF-8 decoding when an encoding is requested On UnicodeDecodeError while decoding downloaded data as UTF-8, retry decoding using latin-1 Preserve the behavior of returning raw bytes when no encoding is supplied	`ch_backup/storage/loader.py`
Extend ClickHouse test client to support executing raw byte queries for non-UTF-8 payloads.	Introduce execute_raw method that posts raw bytes using the existing _query helper Keep existing execute string-based method unchanged for normal queries	`tests/integration/modules/clickhouse.py`
Add integration steps and features to test schema encoding compatibility, including non-UTF-8 table metadata.	Add a step definition to create a table whose CREATE TABLE statement is sent as cp1251-encoded bytes with a Cyrillic column name Insert test data into the non-UTF-8 table for backup/restore validation Create a new feature spec that backs up and restores multiple UTF-8-encoded tables (ASCII, emoji, Cyrillic, Chinese) and validates schema and data equality Add a scenario that backs up a table with invalid UTF-8 metadata and validates it can be restored and queried for existence	`tests/integration/steps/clickhouse.py` `tests/integration/features/schema_encoding_compatibility.feature`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

In StorageLoader.download_data, the encoding argument is now ignored and UTF-8/latin-1 is hardcoded; either honor the passed encoding (including supporting non-UTF-8 encodings) or remove the parameter to avoid a misleading API.
Consider adding a warning log when download_data falls back from UTF-8 to latin-1 decoding, similar to _load_create_statement_from_disk, to make non-UTF-8 data easier to detect in production.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `StorageLoader.download_data`, the `encoding` argument is now ignored and UTF-8/latin-1 is hardcoded; either honor the passed `encoding` (including supporting non-UTF-8 encodings) or remove the parameter to avoid a misleading API.
- Consider adding a warning log when `download_data` falls back from UTF-8 to latin-1 decoding, similar to `_load_create_statement_from_disk`, to make non-UTF-8 data easier to detect in production.

## Individual Comments

### Comment 1
<location> `ch_backup/storage/loader.py:142-147` </location>
<code_context>
             remote_path, is_async=is_async, encryption=encryption
         )
-        return data.decode(encoding) if encoding else data
+        if not encoding:
+            return data
+        try:
+            return data.decode("utf-8")
+        except UnicodeDecodeError:
+            return data.decode("latin-1")

     def download_file(
</code_context>

<issue_to_address>
**issue (bug_risk):** The `encoding` parameter is now ignored when truthy, which is a behavior change and likely a bug.

With the new logic, callers that pass `encoding="ascii"`, `"utf-16"`, or rely on strict decoding will now always get a UTF-8→latin-1 fallback instead, which can change behavior or hide errors.

If you want a UTF-8→latin-1 fallback while still respecting the caller’s choice, you could do something like:

```python
if not encoding:
    return data
try:
    return data.decode(encoding)
except UnicodeDecodeError:
    if encoding.lower() == "utf-8":
        return data.decode("latin-1")
    raise
```

so that the fallback only applies when UTF-8 was actually requested.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

ch_backup/storage/loader.py

aalexfvk · 2026-02-02T20:28:54Z

tests/integration/modules/clickhouse.py

Instead of separate execute_raw

def execute(self, query: Union[str, bytes]) -> None: if isinstance(query, bytes): ...

aalexfvk · 2026-02-02T20:30:08Z

tests/integration/features/schema_encoding_compatibility.feature

@@ -0,0 +1,63 @@
+Feature: Non-UTF-8 schema encoding support


Do this tests fail without changes ?

aalexfvk · 2026-02-02T21:38:47Z

tests/integration/steps/clickhouse.py

        ch_client.execute(table_schema)
+
+
+@given("we have created non-UTF-8 test table on {node}")


Let's make we execute query on {node:w} with encoding {encoding:w} step instead of this.
It seems more versatile.

We can't parameterize the encoding in the step since Gherkin doesn't support non-UTF-8 text

How about escaping like this ("Привет" в cp1251). Most likely the steps and/or client will need to be refined

""" CREATE TABLE test_db_01.table_rus ( EventDate DateTime, CounterID UInt32, `\xcf\xf0\xe8\xe2\xe5\xf2` UInt32 ) ENGINE = MergeTree() PARTITION BY CounterID % 10 ORDER BY (CounterID, EventDate) """

aalexfvk · 2026-02-02T21:42:16Z

ch_backup/storage/loader.py

-    def download_data(
-        self, remote_path, is_async=False, encryption=False, encoding="utf-8"
-    ):
+    def download_data(self, remote_path, is_async=False, encryption=False):


Let's just replace encoding="latin-1"

Can there be a problem (corrupted text) ?

ch-backup/ch_backup/clickhouse/client.py

Line 85 in 8802592

data=query.encode("utf-8"),

If we read data as latin-1 but send it to ClickHouse as utf-8 encoded ?

Let's just replace encoding="latin-1"

this change breaks tests for valid UTF-8 data:

ASSERT FAILED: Expected: <{('test_db', 'table_ascii'): "CREATE TABLE test_db.table_ascii (`id` Int32, `name_ascii` String COMMENT 'ascii') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_chinese'): "CREATE TABLE test_db.table_chinese (`id` Int32, `name_试` String COMMENT '试') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_cyrillic'): "CREATE TABLE test_db.table_cyrillic (`id` Int32, `name_абвгд` String COMMENT 'абвгд') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_emoji'): "CREATE TABLE test_db.table_emoji (`id` Int32, `name_😈` String COMMENT '😈') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192"}> but: was <{('test_db', 'table_ascii'): "CREATE TABLE test_db.table_ascii (`id` Int32, `name_ascii` String COMMENT 'ascii') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_chinese'): "CREATE TABLE test_db.table_chinese (`id` Int32, `name_è¯\x95` String COMMENT 'è¯\x95') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_cyrillic'): "CREATE TABLE test_db.table_cyrillic (`id` Int32, `name_Ð°Ð±Ð²Ð³Ð´` String COMMENT 'Ð°Ð±Ð²Ð³Ð´') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_emoji'): "CREATE TABLE test_db.table_emoji (`id` Int32, `name_ð\x9f\x98\x88` String COMMENT 'ð\x9f\x98\x88') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192"}>

the issue is that bytes.decode() returns a str (unicode string), not bytes. When we decode utf-8 bytes as latin-1, each byte becomes a separate character

dimbo4ka added 2 commits February 2, 2026 16:06

fix: add latin-1 fallback for non-UTF-8 table metadata

75a4b6e

test: add integration test for non-UTF-8 medata handling

57f4f67

sourcery-ai bot reviewed Feb 2, 2026

View reviewed changes

ch_backup/storage/loader.py Outdated Show resolved Hide resolved

refactor: improve download_data function structure

ebe55b8

aalexfvk reviewed Feb 2, 2026

View reviewed changes

dimbo4ka added 2 commits February 3, 2026 11:20

refactor: move logic from execute_raw into execute

81f5f12

tests: refactor step

ef17309

		ch_client.execute(table_schema)


		@given("we have created non-UTF-8 test table on {node}")

Conversation

dimbo4ka commented Feb 2, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for fallback decoding when loading table metadata

Sequence diagram for fallback decoding in storage loader download_data

Updated class diagram for storage loader and ClickHouse test client

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aalexfvk Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

aalexfvk Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

dimbo4ka Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

aalexfvk Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

dimbo4ka Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

aalexfvk Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aalexfvk Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

aalexfvk Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

dimbo4ka Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dimbo4ka commented Feb 2, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Feb 2, 2026 •

edited

Loading

aalexfvk Feb 3, 2026 •

edited

Loading

dimbo4ka Feb 3, 2026 •

edited

Loading