Skip to content

Fix: non utf8 schema support#288

Open
dimbo4ka wants to merge 5 commits intoyandex:mainfrom
dimbo4ka:fix/non-utf8-schema-support
Open

Fix: non utf8 schema support#288
dimbo4ka wants to merge 5 commits intoyandex:mainfrom
dimbo4ka:fix/non-utf8-schema-support

Conversation

@dimbo4ka
Copy link

@dimbo4ka dimbo4ka commented Feb 2, 2026

Summary by Sourcery

Add support for handling non-UTF-8 ClickHouse schemas and validate compatibility via integration tests.

Bug Fixes:

  • Fallback to latin-1 when reading table metadata and downloaded data that contain non-UTF-8 bytes to prevent schema loading failures.

Enhancements:

  • Extend ClickHouse test client with a raw-bytes execute method to support non-UTF-8 query payloads in tests.

Tests:

  • Add integration scenarios covering backup/restore of UTF-8 schemas with diverse characters and of tables whose schemas contain invalid UTF-8 bytes.

@sourcery-ai
Copy link

sourcery-ai bot commented Feb 2, 2026

Reviewer's Guide

Adds latin-1 fallback support for non-UTF-8 ClickHouse table schemas during backup/restore and introduces integration tests (including a raw-bytes ClickHouse client method) to verify behavior with both valid UTF-8 and invalid-UTF-8 metadata.

Sequence diagram for fallback decoding when loading table metadata

sequenceDiagram
    participant BackupManager
    participant TableLogic as TableLogic_table_module
    participant FileSystem

    BackupManager->>TableLogic: _load_create_statement_from_disk(table)
    TableLogic->>FileSystem: read_text(utf-8)
    alt utf-8 readable
        FileSystem-->>TableLogic: create_statement (utf-8)
        TableLogic-->>BackupManager: create_statement
    else UnicodeDecodeError
        TableLogic->>TableLogic: log warning (non UTF-8 bytes)
        TableLogic->>FileSystem: read_text(latin-1)
        FileSystem-->>TableLogic: create_statement (latin-1)
        TableLogic-->>BackupManager: create_statement
    end
Loading

Sequence diagram for fallback decoding in storage loader download_data

sequenceDiagram
    participant RestoreProcess
    participant StorageLoader
    participant PhysicalLoader as PhysicalLoader_impl

    RestoreProcess->>StorageLoader: download_data(remote_path, encoding)
    StorageLoader->>PhysicalLoader: download_data(remote_path, is_async, encryption)
    PhysicalLoader-->>StorageLoader: data (bytes)
    alt encoding is falsy
        StorageLoader-->>RestoreProcess: data (bytes)
    else encoding is truthy
        StorageLoader->>StorageLoader: data.decode(utf-8)
        alt utf-8 succeeds
            StorageLoader-->>RestoreProcess: text (utf-8)
        else UnicodeDecodeError
            StorageLoader->>StorageLoader: data.decode(latin-1)
            StorageLoader-->>RestoreProcess: text (latin-1)
        end
    end
Loading

Updated class diagram for storage loader and ClickHouse test client

classDiagram
    class StorageLoader {
        +download_data(remote_path, encoding, is_async, encryption) any
        +download_file(remote_path, local_path, is_async, encryption) void
    }

    class PhysicalLoader {
        +download_data(remote_path, is_async, encryption) bytes
    }

    StorageLoader --> PhysicalLoader : uses

    class TableModule {
        +_load_create_statement_from_disk(table) str
    }

    class Table {
        +database str
        +name str
        +metadata_path str
    }

    TableModule --> Table : reads_metadata

    class ClickHouseTestClient {
        +query(sql) str
        +query_raw_bytes(sql) bytes
    }

    ClickHouseTestClient ..> StorageLoader : used_in_integration_tests
Loading

File-Level Changes

Change Details Files
Add latin-1 fallback when reading table metadata create statements from disk if UTF-8 decoding fails.
  • Wrap metadata file UTF-8 read in try/except around UnicodeDecodeError in _load_create_statement_from_disk
  • On UnicodeDecodeError, log a warning including database and table names about non-UTF-8 bytes
  • Re-read the metadata file using latin-1 encoding as a fallback
ch_backup/logic/table.py
Add latin-1 fallback when decoding downloaded backup data that is not valid UTF-8.
  • Change download_data to always attempt UTF-8 decoding when an encoding is requested
  • On UnicodeDecodeError while decoding downloaded data as UTF-8, retry decoding using latin-1
  • Preserve the behavior of returning raw bytes when no encoding is supplied
ch_backup/storage/loader.py
Extend ClickHouse test client to support executing raw byte queries for non-UTF-8 payloads.
  • Introduce execute_raw method that posts raw bytes using the existing _query helper
  • Keep existing execute string-based method unchanged for normal queries
tests/integration/modules/clickhouse.py
Add integration steps and features to test schema encoding compatibility, including non-UTF-8 table metadata.
  • Add a step definition to create a table whose CREATE TABLE statement is sent as cp1251-encoded bytes with a Cyrillic column name
  • Insert test data into the non-UTF-8 table for backup/restore validation
  • Create a new feature spec that backs up and restores multiple UTF-8-encoded tables (ASCII, emoji, Cyrillic, Chinese) and validates schema and data equality
  • Add a scenario that backs up a table with invalid UTF-8 metadata and validates it can be restored and queried for existence
tests/integration/steps/clickhouse.py
tests/integration/features/schema_encoding_compatibility.feature

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In StorageLoader.download_data, the encoding argument is now ignored and UTF-8/latin-1 is hardcoded; either honor the passed encoding (including supporting non-UTF-8 encodings) or remove the parameter to avoid a misleading API.
  • Consider adding a warning log when download_data falls back from UTF-8 to latin-1 decoding, similar to _load_create_statement_from_disk, to make non-UTF-8 data easier to detect in production.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `StorageLoader.download_data`, the `encoding` argument is now ignored and UTF-8/latin-1 is hardcoded; either honor the passed `encoding` (including supporting non-UTF-8 encodings) or remove the parameter to avoid a misleading API.
- Consider adding a warning log when `download_data` falls back from UTF-8 to latin-1 decoding, similar to `_load_create_statement_from_disk`, to make non-UTF-8 data easier to detect in production.

## Individual Comments

### Comment 1
<location> `ch_backup/storage/loader.py:142-147` </location>
<code_context>
             remote_path, is_async=is_async, encryption=encryption
         )
-        return data.decode(encoding) if encoding else data
+        if not encoding:
+            return data
+        try:
+            return data.decode("utf-8")
+        except UnicodeDecodeError:
+            return data.decode("latin-1")

     def download_file(
</code_context>

<issue_to_address>
**issue (bug_risk):** The `encoding` parameter is now ignored when truthy, which is a behavior change and likely a bug.

With the new logic, callers that pass `encoding="ascii"`, `"utf-16"`, or rely on strict decoding will now always get a UTF-8→latin-1 fallback instead, which can change behavior or hide errors.

If you want a UTF-8→latin-1 fallback while still respecting the caller’s choice, you could do something like:

```python
if not encoding:
    return data
try:
    return data.decode(encoding)
except UnicodeDecodeError:
    if encoding.lower() == "utf-8":
        return data.decode("latin-1")
    raise
```

so that the fallback only applies when UTF-8 was actually requested.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of separate execute_raw

def execute(self, query: Union[str, bytes]) -> None:
  if isinstance(query, bytes):
    ...

@@ -0,0 +1,63 @@
Feature: Non-UTF-8 schema encoding support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this tests fail without changes ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

ch_client.execute(table_schema)


@given("we have created non-UTF-8 test table on {node}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make we execute query on {node:w} with encoding {encoding:w} step instead of this.
It seems more versatile.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't parameterize the encoding in the step since Gherkin doesn't support non-UTF-8 text

Copy link
Contributor

@aalexfvk aalexfvk Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about escaping like this ("Привет" в cp1251). Most likely the steps and/or client will need to be refined

    """
    CREATE TABLE test_db_01.table_rus (
        EventDate DateTime,
        CounterID UInt32,
        `\xcf\xf0\xe8\xe2\xe5\xf2` UInt32
    )
    ENGINE = MergeTree()
    PARTITION BY CounterID % 10
    ORDER BY (CounterID, EventDate)
    """

def download_data(
self, remote_path, is_async=False, encryption=False, encoding="utf-8"
):
def download_data(self, remote_path, is_async=False, encryption=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just replace encoding="latin-1"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be a problem (corrupted text) ?

data=query.encode("utf-8"),

If we read data as latin-1 but send it to ClickHouse as utf-8 encoded ?

Copy link
Author

@dimbo4ka dimbo4ka Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just replace encoding="latin-1"

this change breaks tests for valid UTF-8 data:

      ASSERT FAILED: 
      Expected: <{('test_db', 'table_ascii'): "CREATE TABLE test_db.table_ascii (`id` Int32, `name_ascii` String COMMENT 'ascii') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_chinese'): "CREATE TABLE test_db.table_chinese (`id` Int32, `name_试` String COMMENT '试') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_cyrillic'): "CREATE TABLE test_db.table_cyrillic (`id` Int32, `name_абвгд` String COMMENT 'абвгд') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_emoji'): "CREATE TABLE test_db.table_emoji (`id` Int32, `name_😈` String COMMENT '😈') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192"}>
           but: was <{('test_db', 'table_ascii'): "CREATE TABLE test_db.table_ascii (`id` Int32, `name_ascii` String COMMENT 'ascii') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_chinese'): "CREATE TABLE test_db.table_chinese (`id` Int32, `name_è¯\x95` String COMMENT 'è¯\x95') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_cyrillic'): "CREATE TABLE test_db.table_cyrillic (`id` Int32, `name_абвгд` String COMMENT 'абвгд') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192", ('test_db', 'table_emoji'): "CREATE TABLE test_db.table_emoji (`id` Int32, `name_ð\x9f\x98\x88` String COMMENT 'ð\x9f\x98\x88') ENGINE = MergeTree ORDER BY id SETTINGS index_granularity = 8192"}>

the issue is that bytes.decode() returns a str (unicode string), not bytes. When we decode utf-8 bytes as latin-1, each byte becomes a separate character

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants