[Discuss] Split connector code into streaming and batch #93

Claudenw · 2025-08-26T11:24:51Z

Claudenw
Aug 26, 2025
Maintainer

Pulled here from #73

@SamoylovMD I thought we could hash out how to do this here, rather than on the issue ticket.

The proposal is to split the current monolithic sink connector into separate artifacts, one for streaming via Storage Write API, another through GCS load jobs. It would make the code much easier to understand and support.

Claudenw · 2025-08-26T11:25:23Z

Claudenw
Aug 26, 2025
Maintainer Author

Are you thinking that there would be 2 distinct connectors or one connector that could execute either type of connection?

We do have some common code in another Aiven sponsored connector library that is intended to take an the Kafka records, preprocess them in standard ways and then deliver them to a backend for writing. Perhaps the same architecture here would work?

0 replies

SamoylovMD · 2025-08-27T15:50:52Z

SamoylovMD
Aug 27, 2025

Hey @Claudenw, thanks for opening this.

I assume that by other connector library you mean cloud-storage-connectors-for-apache-kafka? I'm not from Aiven so I don't have a strong opinion on sharing parts between Aiven-owned codebase.

In my opinion, two different connector classes with separate logic would be more appropriate. There is a lot to share in configuration part at least, so there could be some common parent class.

The main problem I see is that the logic of batch connector isn't straightforward. In my fork (which I'm trying to get rid of) there are some patches, for example, a fix for the bug in original Confluent connector, which I'm not sure if present in the current version of connector. But it is only an example of a deeper issue - the process which does GCS-BQ load lives separate from the connector, and 'commit' happening in consumer doesn't mean the record was successfully uploaded to BigQuery, only that it was saved to some file in GCS.

BigQuery has limits on load jobs, so the thread managing load jobs may fail any time. In this case, what should connector do? I feel like there is not enough documentation on the edge cases like this, and maybe we could add more configuration options for the desired behaviour. For example, we could allow load thread to commit offsets instead of the main connector thread. That would be tricky from implementation perspective, but sounds like a valid alternative to the current behaviour from users perspective.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discuss] Split connector code into streaming and batch #93

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

[Discuss] Split connector code into streaming and batch #93

Uh oh!

Claudenw Aug 26, 2025 Maintainer

Replies: 2 comments

Uh oh!

Uh oh!

Claudenw Aug 26, 2025 Maintainer Author

Uh oh!

Uh oh!

SamoylovMD Aug 27, 2025

Claudenw
Aug 26, 2025
Maintainer

Claudenw
Aug 26, 2025
Maintainer Author

SamoylovMD
Aug 27, 2025