-
Notifications
You must be signed in to change notification settings - Fork 66
Description
Problem description
In the new FLIP-27 source, we still only support a global recovery that relies on the Flink's externally induced source reader API(added in FLINK-20270) + Pravega checkpoint. This cannot make the story complete for a local/region/subgraph recovery. This would have a greater impact and takes a longer time for a job recovery, which is a concern in the production usage.
To support a local recovery, Flink would like the connector to have a "split" abstraction to have its own state(like file/kafka partition+offset) and Flink checkpoint is consist of the list of all these states. In a taskmanager failure case, it will perform a local recovery, that restart a new reader and recover all the "splits" that assigned by the faulty reader to the state(position) stored in the last checkpoint, and also recover the affected downstream operator states.
Although Pravega has a story to involve a Position for local recovery, but due to the async nature and segments may auto-scale, a group of positions in one reader group cannot make up a complete streamcut for checkpoint, and recovering with the newly-spawned reader may get data loss as the related operators rolls its state back to the last checkpoint, meanwhile the other online readers take over the segment and continue reading.
In general, it is a problem that Pravega has two different stories for global and local recovery which cannot live together to fit the Flink interface.
Problem location
Suggestions for an improvement