Skip to content

Partial recovery in FLIP-27 source #663

@crazyzhou

Description

@crazyzhou

Problem description
In the new FLIP-27 source, we still only support a global recovery that relies on the Flink's externally induced source reader API(added in FLINK-20270) + Pravega checkpoint. This cannot make the story complete for a local/region/subgraph recovery. This would have a greater impact and takes a longer time for a job recovery, which is a concern in the production usage.
To support a local recovery, Flink would like the connector to have a "split" abstraction to have its own state(like file/kafka partition+offset) and Flink checkpoint is consist of the list of all these states. In a taskmanager failure case, it will perform a local recovery, that restart a new reader and recover all the "splits" that assigned by the faulty reader to the state(position) stored in the last checkpoint, and also recover the affected downstream operator states.
Although Pravega has a story to involve a Position for local recovery, but due to the async nature and segments may auto-scale, a group of positions in one reader group cannot make up a complete streamcut for checkpoint, and recovering with the newly-spawned reader may get data loss as the related operators rolls its state back to the last checkpoint, meanwhile the other online readers take over the segment and continue reading.
In general, it is a problem that Pravega has two different stories for global and local recovery which cannot live together to fit the Flink interface.

Problem location

Suggestions for an improvement

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions