Infinite Provider Push #4863

correiaafonso12 · 2025-03-12T12:09:20Z

correiaafonso12
Mar 12, 2025

Hey everyone,

I'd like to discuss how we may extend the Connector's capabilities to allow transferring infinite data using the Provider-PUSH flow. This discussion defines this concept, and will be the basis for the eventual implementation.

Problem

According to the DSP specification: "Data may be finite or non-finite. This applies to either push and pull transfers. Finite data is data that is defined by a finite set, for example, machine learning data or images. After finite data transmission has finished, the TP is completed. Non-finite data is data that is defined by an infinite set or has no specified end, for example, streams or an API endpoint. With non-finite data, a TP will continue indefinitely until either the Consumer or Provider explicitly terminates the transmission." Provider-PUSH transfers transition the Transfer Processes and Data Flow to final states when the process is completed, and therefore do not allow infinite data to be exchanged.

Workarounds

Let's look at the existing workarounds to deal with infinite data by applying them against the following example:

Company A wants to consume from Company B, with Company B sending a discrete set of data every Monday. This represents a Provider-PUSH transfer (company B is the Provider and wants to push data to company A) that deals with infinite data (the data is available every Monday, so no determined endpoint).

Use the Consumer-PULL flow

The Consumer-PULL flow allows infinite data to be transferred because EDRs remain valid across multiple data transfers, enabling Company A to request data from Company B every Monday. While this approach is suitable for many use cases, some scenarios require the Provider-PUSH flow due to its asynchronous nature, overcoming the limitations of synchronous protocols such as data size and latency. The Consumer-PULL flow is also impractical when there are multiple Providers (e.g., Company B, C, D, etc.) because company A would need to request data from each Provider individually. With the Provider-PUSH, the Providers could push their data without Company A requesting it, simplifying the process.

Transfer Data Chunks Individually

To create the illusion that infinite data is actually finite, we may treat each data chunk individually and transfer them using the existing Provider-PUSH mechanism. The issue with this approach is that the Consumer must request each chunk every Monday, creating a new Transfer Process in the process. This increases the overall complexity, as there is a recurrent step which must always be performed to ensure the correct data exchange, and also adds computing costs, as each new Transfer Process must process multiple states throughout its lifecycle.

Invert The Participants Roles

We could also enable infinite data to be transferred by reversing the participants' roles and leveraging the Consumer-PULL capabilities. From the Connector's perspective, lets treat Company A as the Data Provider and Company B as the Data Consumer, despite Company B holding the desired data. Company A creates an asset with a data address pointing to where the data must be stored, for example an API endpoint expecting a body from an HTTP request, and offers it. Company B negotiates it and starts a Consumer-PULL transfer. If company B adds the desired data in the body of the HTTP request it performs to company A's dataplane, this data may be proxied to the data address of the asset, and company A will be able to store it. Although this only works for HTTP data addresses natively, the dataplane may easily be extended to support any desired technology. In actuality, this is an anti-pattern and should not be used. Since the Connector applies the usage policies on company B instead of company A, data sovereignty is not ensured.

Proposed Approach

Analyzing the workarounds reveals that the main obstacle to transferring infinite data is that the communication channel between the Connectors is closed after the initial data transfer. If Transfer Processes remain active after exhanging data, they may be reused, allowing the Provider to trigger new transfers as new data becomes available. The following image illustrates the expected behavior for pushing infinite data:

Note that closing Transfer Processes can be achieved by the existing terminating mechanism, so no new development is needed.

UC1 - Identify infinite Assets

As a Provider, I want to identity my Assets that provide infinite data, so that their transfer doesn't automatically finalize the Transfer Process.

Add "isInfinite" property to Asset
- Assets already can receive any user-typed String property
- "isInfinite" property will be identified as a boolean value (similar to "isCatalog" property)
- The property default value will be "false" to keep current behavior
Keep negotiation and TransferProcess creation unchanged
- The "isInfinite" property will appear in the dataset of the catalog, so the Consumer knows he's negotiating infinite data
Add a "keepAlive" boolean field to the DataFlowStartMessage
- The field will have the value of the "isInfinite" property, added at DataFlowStartMessage creation in DataPlaneSignalingFlowController
- JsonObjectFromDataFlowStartMessageTransformer and JsonObjectToDataFlowStartMessageTransformer will be updated accordingly
Add a "keepAlive" boolean field to the DataFlow
- This field identifies if the DataFlow should be completed / terminated, or kept open after the data transfer
- The field will be mapped from DataFlowStartMessage's "keepAlive", added at DataFlow creation in DataPlaneManagerImpl
- The dataplane-schema.sql must be updated to create a new column when using the SQL store
Add an AWAITING state to the DataFlowStates
- This state represents DataFlows that performed a successful data transfer and are awaiting a trigger to start another data transfer
- This state will not be processed by the state machine manager
Perform the data transfer
- The DataPlaneManagerImpl will be updated so that if the transfer was successful and "keepAlive" is true, the DataFlow transitions to AWAITING
  - Since the dataplane will not emit completion or termination of the TransferProcess via the Control API, it will remain STARTED
- All other transitions remain unchanged
  - The retry mechanism will handle a configurable amount of failures
  - After exhausting the number of retries, DataFlow and TransferProcesses will be terminated, populating the error details and giving visibility to what went wrong

UC2 - Trigger a Transfer Process

As a Provider, I want to trigger the transfer of new data to Consumers using an active Transfer Process, so that partners may receive updates from my data source.

Create a new endpoint in TransferProcessApiV3 / TransferProcessApiV3Controller to trigger transfer processes
- POST /{id}/trigger
- Responses should be
  - 204 when the request was received successfully
  - 400 if the request was malformed or fails validation
  - 404 if the transfer process for given id does not exist
Create a new TriggerTransferCommand, receiving the transfer process id
Add a new "trigger" method to TransferProcessService / TransferProcessServiceImpl
- This method should receive a TriggerTransferCommand and return a ServiceResult<Void>
- This method will execute the handler for the created command
Create a new TriggerTransferCommandHandler
- The handler must be registered at TransferProcessCommandExtension
- The handler will verify if the command is executable
  - TransferProcess must be of PROVIDER type
  - TransferProcess must be of PUSH flow type
  - TransferProcess must be in STARTED state
  - Underlying Asset must be infinite
- If every validation succeeds, the TransferProcess is transitioned to RESUMING
  - This will restart the DataFlow, creating another data transfer, which is the desired logic
  - "Starting" a DataFlow updates the existing one, moving it from AWAITING state to RECEIVED
  - The Consumer controlplane will be requested to transition to STARTED once again, which should not have an effect since it will already be in that state

Future Work

I see that there are some very interesting future topic that may derive from this initial approach on infinite data.

Receive a batch of Transfer Processes to be trigerred based on a query (eg. all eligible Transfer Processes for a certain asset)
Make the Connector "listen" for changes in the data source, and transfer them immediately

Nevertheless, I think it would make sense to provide an initial discussion on this issue, implement a basis for this concept, and then improve it step-by-step. I'm happy to hear your thoughts on this.

jimmarino · 2025-03-12T14:03:59Z

jimmarino
Mar 12, 2025
Collaborator

I am struggling to understand what the issue is, but I doubt it will involve the significant changes you suggest. Can you please summarize in 3-4 sentences with reference to specific code and without going into possible solutions what the issue is?

1 reply

correiaafonso12 Mar 13, 2025
Author

Hey @jimmarino. Sure.

Since infinite data has no predetermined endpoint, data sources producing it won’t have it all available at a given point in time. According to the DSP, the way to deal with this data is to keep the TransferProcess active until manually terminated, reflecting that more data will be available and transferred in the future.

Looking at the DataPlaneManagerImpl, specifically to the DataFlow lifecycle for Provider-PUSH transfers, after exchanging the currently available data chunk (in other words, the data present in the data source at the time of transfer) the DataFlow transitions to either COMPLETED or FAILED state.

// DataPlaneManagerImpl, lines 246 - 284
private boolean processReceived(DataFlow dataFlow) {
    ...
 
    return entityRetryProcessFactory.retryProcessor(dataFlow)
            .doProcess(Process.<DataFlow, Object, StreamResult<Object>>future("Start data flow", (d, v) -> transferService.transfer(request))
                    .entityReload(store::findByIdAndLease))
            .onSuccess((f, r) -> {
                if (f.getState() != STARTED.code()) {
                    return;
                }
 
                if (r.succeeded()) {
                    f.transitToCompleted(); // Transfer was successful
                } else {
                    f.transitToFailed(r.getFailureDetail()); // Transfer was not successful
                }
 
                update(f);
            })
            .onFailure((f, t) -> {
                f.transitToReceived();
                update(f);
            })
            .onFinalFailure((f, t) -> {
                f.transitToFailed(t.getMessage());
                update(f);
            })
            .execute();
}

Processing these states emits completion / failure of the Transfer Process to the controlplane via the Control API, transitioning them to terminating states.

// DataPlaneManagerImpl, lines 286 - 308
private boolean processCompleted(DataFlow dataFlow) {
    var response = transferProcessClient.completed(dataFlow.toRequest()); // Notifies the controlplane
    ...
}
 
private boolean processFailed(DataFlow dataFlow) {
    var response = transferProcessClient.failed(dataFlow.toRequest(), dataFlow.getErrorDetail()); // Notifies the controlplane
    ....
}

// TransferProcessControlApiController, lines 48 - 62
@POST
@Path("/{processId}/complete")
@Override
public void complete(@PathParam("processId") String processId) {
    transferProcessService.complete(processId).orElseThrow(exceptionMapper(TransferProcess.class, processId));
}
 
@POST
@Path("/{processId}/fail")
@Override
public void fail(@PathParam("processId") String processId, TransferProcessFailStateDto request) {
    validator.validate(request).orElseThrow(ValidationFailureException::new);
 
    transferProcessService.terminate(new TerminateTransferCommand(processId, request.getErrorMessage())).orElseThrow(exceptionMapper(TransferProcess.class, processId));
}

Since the Transfer Process is now finalized, it can't be reused to continue transferring the infinite data.

ndr-brt · 2025-03-13T16:12:10Z

ndr-brt
Mar 13, 2025
Collaborator

It looks like you are trying to transform the Connector into a Message Queue.

My suggestion would be to use a message queue implementation as data plane, there are many of them out there, with different features provided. You can see an example on how this can be achieved in the sample repository

8 replies

correiaafonso12 Mar 18, 2025
Author

One correction: it is not "infinite" but "non-finite", which is an important distinction.

True. I've been using both terms interchangeably, but "non-finite" better represents what I'm proposing. Will use it in the future.

Also, note that there is no need to model non-finite data using a separate asset property. Instead, you should define a data format, as that implies the semantics of data transfer.

Could you please elaborate on this? I chose specifying this asset property because I see "non-finiteness" as a property of the data. Moreover, asset properties are already displayed in the dataset when the catalog is requested, so a Consumer also knows that they're negotiating non-finite data.

I see what you are saying with keeping all transfer behavior in the data plane, and not changing the control plane. I've also started looking into some integration tools that could help me handle non-finite data. There are a lot of open points, and I'll need to go back to the "drawing board". The changes to the control plane had two main goals:

Make non-finite data identifiable
Allow Providers to trigger a new data transfer

These are some of the issues I'll address for my next attempt.

My suggestion here would be to implement this in your project and, if that demonstrates to be solid and valuable to be upstreamed, it could be proposed through an adoption request

That seems a good idea. Nevertheless, I find this discussion helpful, as changes are easier to make in the design phase.

Overall, thank you @jimmarino and @ndr-brt for your insights.

jimmarino Mar 18, 2025
Collaborator

One correction: it is not "infinite" but "non-finite", which is an important distinction.

True. I've been using both terms interchangeably, but "non-finite" better represents what I'm proposing. Will use it in the future.

Also, note that there is no need to model non-finite data using a separate asset property. Instead, you should define a data format, as that implies the semantics of data transfer.

Could you please elaborate on this? I chose specifying this asset property because I see "non-finiteness" as a property of the data. Moreover, asset properties are already displayed in the dataset when the catalog is requested, so a Consumer also knows that they're negotiating non-finite data.

I see what you are saying with keeping all transfer behavior in the data plane, and not changing the control plane. I've also started looking into some integration tools that could help me handle non-finite data. There are a lot of open points, and I'll need to go back to the "drawing board". The changes to the control plane had two main goals:

Make non-finite data identifiable

Allow Providers to trigger a new data transfer

These are some of the issues I'll address for my next attempt.

My suggestion here would be to implement this in your project and, if that demonstrates to be solid and valuable to be upstreamed, it could be proposed through an adoption request

That seems a good idea. Nevertheless, I find this discussion helpful, as changes are easier to make in the design phase.

Overall, thank you @jimmarino and @ndr-brt for your insights.

The high order bit (i.e. main point I am expressing) is that you should not wrap or otherwise proxy a messaging system, especially with the Data Plane Framework. This will inevitably lead to bottlenecks and breaking qualities of service offered by those implementations such as reliable delivery and backpressure support.

Use the data catalog format as that is an input for data plane selection. It makes sense to model streams as a format since format implies the "transfer type". An asset describes the type of data from a business perspective, not how it is delivered.

correiaafonso12 Mar 24, 2025
Author

Hello @jimmarino and @ndr-brt
I delved deeper into the replies you gave me, and I believe there are some misconceptions that I wish to discuss further

Firstly, I'd like to assert that non-finite data is not inherently tied to messaging or streaming. Just as a message represents a finite fragment of a potentially non-finite dataset available in a messaging queue:

a JSON payload represents a finite fragment of a potentially non-finite dataset exposed by an API endpoint,
a file represents a finite fragment of a potentially non-finite dataset provided by a file storage system,
...

The DSP specification corroborates this, referring "an API endpoint" as an example for non-finite data. In my opinion, the finiteness of a dataset depends on its business context, and not on its format or where it is stored. Therefore, any data source technology is capable of producing non-finite data.

Issue #463 mentions that

The Data Plane Framework is designed to only work with finite data transfers and small payload, latent non-finite transfers such as events. High-volume or latency-sensitive streaming (non-finite) data transfers should be handled by other DataFlowController implementations that delegate to specialized third-party infrastructure such as Kafka.

This feature proposal targets infrequent or periodical asynchronous data transfers, initiated by a manual trigger on the Provider side, as opposed to high-volume, low-latency, real-time delivery of data. Considering that the current DataFlowController implementation, and it's PipelineService allow any supported data source to interop with any supported data sink, I believe that the dataplane framework is ideal for this scenario.

I'd also want to clarify that my goal is not to "publish" data to the Connector, transforming it into a message broker. Similarly to how a consumer may trigger a data transfer with an EDR whenever we wants new data, this proposal intends to enable the Provider to trigger data transfers whenever his business context requires it, using the existing data source.

I would also like to touch on the existing implementation of Kafka as a data source, as my initial assessment in the previous comment was incorrect. The current implementation actively listens to the topic, constantly polling messages and transferring them immediately after they arrive. This works for non-finite data, as the transmission remains active until manually terminated. While I believe this feature to be very helpful for many use cases, it does not solve the problem this proposal tackles. For example, message queues are designed to be polled, but polling for changes in other technologies, such as APIs or file storage systems, is more complex. This feature proposal does not try to compete with this approach, as both are useful for different use cases.

As a final note, this proposal aims to keep the dataplane's current behavior - get data from the source and push it to the sink - adding only the option to trigger it by the Provider's demand.

jimmarino Mar 24, 2025
Collaborator

I'd also want to clarify that my goal is not to "publish" data to the Connector, transforming it into a message broker. Similarly to how a consumer may trigger a data transfer with an EDR whenever we wants new data, this proposal intends to enable the Provider to trigger data transfers whenever his business context requires it, using the existing data source.

I would also like to touch on the existing implementation of Kafka as a data source, as my initial assessment in the previous comment was incorrect. The current implementation actively listens to the topic, constantly polling messages and transferring them immediately after they arrive. This works for non-finite data, as the transmission remains active until manually terminated. While I believe this feature to be very helpful for many use cases, it does not solve the problem this proposal tackles. For example, message queues are designed to be polled, but polling for changes in other technologies, such as APIs or file storage systems, is more complex. This feature proposal does not try to compete with this approach, as both are useful for different use cases.

As a final note, this proposal aims to keep the dataplane's current behavior - get data from the source and push it to the sink - adding only the option to trigger it by the Provider's demand.

We are aware of the existing Kafka implementation, and that is precisely why we do not recommend that approach. While it is OK for small "demo" loads, it does not scale and needs to be reworked. It will not scale because of throughput but also the way it manages threads, which is relevant for infrequent transfer scenarios.

To be completely blunt: There is no need to introduce the complexity you are arguing for. The issue you cite was written before we introduced the Dataplane Signalling API and state-machine tracking in the DPF. Now that those are in place, there are better ways of tackling this issue, as we have explained.

ndr-brt Mar 25, 2025
Collaborator

yes, to be honest I would remove the "kafka push" samples and leaving the pull one, that's the way it should be designed (maybe it could be also improved leveraging on the provisioning phase).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Infinite Provider Push #4863

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 9 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Infinite Provider Push #4863

Uh oh!

correiaafonso12 Mar 12, 2025

Problem

Workarounds

Use the Consumer-PULL flow

Transfer Data Chunks Individually

Invert The Participants Roles

Proposed Approach

UC1 - Identify infinite Assets

As a Provider, I want to identity my Assets that provide infinite data, so that their transfer doesn't automatically finalize the Transfer Process.

UC2 - Trigger a Transfer Process

As a Provider, I want to trigger the transfer of new data to Consumers using an active Transfer Process, so that partners may receive updates from my data source.

Future Work

Replies: 2 comments · 9 replies

Uh oh!

Uh oh!

jimmarino Mar 12, 2025 Collaborator

Uh oh!

correiaafonso12 Mar 13, 2025 Author

Uh oh!

ndr-brt Mar 13, 2025 Collaborator

Uh oh!

correiaafonso12 Mar 18, 2025 Author

Uh oh!

jimmarino Mar 18, 2025 Collaborator

Uh oh!

correiaafonso12 Mar 24, 2025 Author

Uh oh!

jimmarino Mar 24, 2025 Collaborator

Uh oh!

ndr-brt Mar 25, 2025 Collaborator

correiaafonso12
Mar 12, 2025

Replies: 2 comments 9 replies

jimmarino
Mar 12, 2025
Collaborator

correiaafonso12 Mar 13, 2025
Author

ndr-brt
Mar 13, 2025
Collaborator

correiaafonso12 Mar 18, 2025
Author

jimmarino Mar 18, 2025
Collaborator

correiaafonso12 Mar 24, 2025
Author

jimmarino Mar 24, 2025
Collaborator

ndr-brt Mar 25, 2025
Collaborator