Partial recovery in FLIP-27 source

**Problem description**
In the new FLIP-27 source, we still only support a global recovery that relies on the Flink's externally induced source reader API(added in [FLINK-20270](https://issues.apache.org/jira/browse/FLINK-20270)) + Pravega checkpoint. This cannot make the story complete for a local/region/subgraph recovery. This would have a greater impact and takes a longer time for a job recovery, which is a concern in the production usage.
To support a local recovery, Flink would like the connector to have a "split" abstraction to have its own state(like file/kafka partition+offset) and Flink checkpoint is consist of the list of all these states. In a taskmanager failure case, it will perform a local recovery, that restart a new reader and recover all the "splits" that assigned by the faulty reader to the state(position) stored in the last checkpoint, and also recover the affected downstream operator states.
Although Pravega has a story to involve a `Position` for local recovery, but due to the async nature and segments may auto-scale, a group of positions in one reader group cannot make up a complete streamcut for checkpoint, and recovering with the newly-spawned reader may get data loss as the related operators rolls its state back to the last checkpoint, meanwhile the other online readers take over the segment and continue reading.
In general, it is a problem that Pravega has two different stories for global and local recovery which cannot live together to fit the Flink interface.

**Problem location**


**Suggestions for an improvement**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial recovery in FLIP-27 source #663

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Partial recovery in FLIP-27 source #663

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions