You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RDF Connect is a modular framework for building and executing multilingual data processing pipelines using RDF as the configuration and orchestration layer.
15
17
16
18
It enables fine-grained, reusable processor components that exchange streaming data, allowing workflows to be described declaratively across programming languages and environments.
17
19
RDF Connect is especially well suited for data transformation, integration, and linked data publication.
18
20
21
+
# Usage Paths
22
+
23
+
<div class=note>
24
+
This section is complete in terms of content, but may be reorganized or rewritten for clarity during editorial review.
25
+
</div>
26
+
27
+
Depending on your use case, you may only need a subset of this specification:
28
+
29
+
* **Pipeline Authors**: Start with [[#getting-started]] and focus on [[#pipeline]].
30
+
* **Processor Developers**: Read [[#processor]] and [[#runner]].
31
+
* **Platform Maintainers**: Read all sections, including implementation notes.
32
+
19
33
# Design Goals
20
34
35
+
<div class=note>
36
+
This section is complete in terms of content, but may be reorganized or rewritten for clarity during editorial review.
37
+
</div>
38
+
21
39
* **Language Agnosticism**: Integrate processor components written in different languages (e.g., JavaScript, Python, shell).
22
40
* **Streaming by Default**: Pass data between processors, via an orchestrator, to support large-scale and real-time data.
23
41
* **Semantic Configuration**: Use RDF to define processors, pipelines, inputs, and outputs.
@@ -43,11 +61,38 @@ Processors can be written in any language and executed with a runner.
43
61
44
62
A runner is an execution strategy for processors — for example, a processor in Javascript is a class that is started using a `NodeRunner`.
45
63
64
+
## Orchestrator
65
+
66
+
The orchestrator is the core component responsible for executing a pipeline.
67
+
It reads the configuration, initializes runners, dispatches processor instantiations, and coordinates data flow between them.
68
+
It acts as the runtime conductor that interprets RDF Connect’s declarative configuration.
69
+
46
70
## Reader / Writer
47
71
48
72
Readers and Writers are components that define how data is streamed into and out of a processor.
49
73
These provide an idiomatic way to transport streaming data between processors.
50
74
75
+
76
+
# Getting Started
77
+
78
+
<div class=issue>
79
+
🚧 This section is a work in progress and will be expanded soon.
80
+
</div>
81
+
82
+
This section provides a high-level overview of how to define and run a pipeline in RDF Connect. The rest of the specification provides detail on how each part works.
83
+
84
+
Here's a simple example:
85
+
86
+
```turtle
87
+
ex:pipeline a rdfc:Pipeline ;
88
+
rdfc:instantiates ex:myRunner ;
89
+
rdfc:processor ex:myProcessor .
90
+
```
91
+
92
+
Once a pipeline is fully described using RDF, it is handed off to the orchestrator.
93
+
The orchestrator parses the configuration, resolves all runner and processor definitions, and initiates execution.
94
+
95
+
51
96
# SHACL as Configuration Schema
52
97
53
98
RDF Connect uses SHACL [[shacl]] not only as a data validation mechanism but also as a schema language for defining the configuration interface of components such as processors and runners.
@@ -248,19 +293,63 @@ This results in the following JSON object:
248
293
```
249
294
250
295
251
-
# Getting Started
252
296
253
-
This section provides a high-level overview of how to define and run a pipeline in RDF Connect. The rest of the specification provides detail on how each part works.
297
+
# RDF Connect by Layer
254
298
255
-
Here's a simple example:
256
299
257
-
```turtle
258
-
ex:pipeline a rdfc:Pipeline ;
259
-
rdfc:instantiates ex:myRunner ;
260
-
rdfc:processor ex:myProcessor .
261
-
```
300
+
## Orchestrator
301
+
302
+
The orchestrator is the central runtime entity in RDF Connect.
303
+
It reads the pipeline configuration, sets up the runners, initiates processors, and routes messages between them.
304
+
It ensures the dataflow graph described by the pipeline is brought to life across isolated runtimes.
305
+
The orchestrator acts as a coordinator, not an executor. Each runner is responsible for running the actual processor code, but the orchestrator ensures the pipeline as a whole behaves as intended.
306
+
307
+
Responsibilities:
308
+
309
+
* Parse the pipeline RDF.
310
+
* Load SHACL shapes for processors and runners.
311
+
* Validate and coerce configuration to structured JSON.
312
+
* Instantiate runners.
313
+
* Start processors.
314
+
* Mediate messages (data and control).
315
+
* Handle retries, streaming, and backpressure.
316
+
317
+
<div class=note>
318
+
The remained of this section is intended for developers building custom runners or integrating RDF Connect into infrastructure.
319
+
</div>
320
+
321
+
### Protobuf Messaging Protocol
322
+
323
+
Communication between the orchestrator and the runners happens using a strongly typed protocol defined in Protocol Buffers (protobuf).
324
+
This enables language-independent and efficient communication across processes and machines.
325
+
326
+
### Startup Flow
327
+
328
+
The following diagram describes the startup sequence handled by the orchestrator. This includes validating pipeline structure, instantiating runners, and initializing processor instances.
329
+
330
+
<pre class=include>
331
+
path: ./startup.mdd
332
+
</pre>
333
+
334
+
### Message Handling Flow
335
+
336
+
Once processors are running, the orchestrator handles incoming messages and forwards them to the appropriate reader instances, based on their declared channels.
337
+
338
+
<pre class=include>
339
+
path: ./message.mdd
340
+
</pre>
341
+
342
+
343
+
### Streaming Messages
344
+
345
+
For large messages or real-time input, RDF Connect supports a streaming model.
346
+
Instead of sending entire payloads as a single message, the message can be broken into chunks and sends them over time.
347
+
This is handled by the StreamChunk message type.
348
+
349
+
<pre class=include>
350
+
path: ./streamMessage.mdd
351
+
</pre>
262
352
263
-
# RDF Connect by Layer
264
353
265
354
## Runner
266
355
@@ -289,20 +378,22 @@ Pipelines connect processors using runners to form a processing graph.
289
378
290
379
This is the main unit most users interact with when defining workflows.
291
380
381
+
292
382
# Ontology Reference
293
383
294
384
The RDF Connect ontology provides the terms used in RDF pipeline definitions. See the full [RDF Connect Ontology](https://w3id.org/rdf-connect/ontology.ttl) for details.
295
385
296
-
# Usage Paths
297
-
298
-
Depending on your use case, you may only need a subset of this specification:
299
386
300
-
* **Pipeline Authors**: Start with [[#getting-started]] and focus on [[#pipeline]].
301
-
* **Processor Developers**: Read [[#processor]] and [[#runner]].
302
-
* **Platform Maintainers**: Read all sections, including implementation notes.
387
+
# Putting It All Together: Example Flow and Use Case
303
388
304
-
## Putting It All Together: Example Flow and Use Case
389
+
<div class=issue>
390
+
🚧 This section is a work in progress and will be expanded soon.
391
+
</div>
305
392
306
393
307
394
308
395
396
+
<script type=module>
397
+
import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@11/dist/mermaid.esm.min.mjs';
0 commit comments