Replies: 2 comments
-
Are you thinking that there would be 2 distinct connectors or one connector that could execute either type of connection? We do have some common code in another Aiven sponsored connector library that is intended to take an the Kafka records, preprocess them in standard ways and then deliver them to a backend for writing. Perhaps the same architecture here would work? |
Beta Was this translation helpful? Give feedback.
-
Hey @Claudenw, thanks for opening this. I assume that by other connector library you mean cloud-storage-connectors-for-apache-kafka? I'm not from Aiven so I don't have a strong opinion on sharing parts between Aiven-owned codebase. In my opinion, two different connector classes with separate logic would be more appropriate. There is a lot to share in configuration part at least, so there could be some common parent class. The main problem I see is that the logic of batch connector isn't straightforward. In my fork (which I'm trying to get rid of) there are some patches, for example, a fix for the bug in original Confluent connector, which I'm not sure if present in the current version of connector. But it is only an example of a deeper issue - the process which does GCS-BQ load lives separate from the connector, and 'commit' happening in consumer doesn't mean the record was successfully uploaded to BigQuery, only that it was saved to some file in GCS. BigQuery has limits on load jobs, so the thread managing load jobs may fail any time. In this case, what should connector do? I feel like there is not enough documentation on the edge cases like this, and maybe we could add more configuration options for the desired behaviour. For example, we could allow load thread to commit offsets instead of the main connector thread. That would be tricky from implementation perspective, but sounds like a valid alternative to the current behaviour from users perspective. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Pulled here from #73
@SamoylovMD I thought we could hash out how to do this here, rather than on the issue ticket.
The proposal is to split the current monolithic sink connector into separate artifacts, one for streaming via Storage Write API, another through GCS load jobs. It would make the code much easier to understand and support.
Beta Was this translation helpful? Give feedback.
All reactions