You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>This program gives us a bit of a flavor for what a timely dataflow program might look like, including a bit of what Rust looks like, without getting too bogged down in weird stream processing details. Not to worry; we will do that in just a moment!</p>
<p>Why would we want to make our life so complicated? The main reason is that we can make our program <em>reactive</em>, so that we can run it without knowing ahead of time the data we will use, and it will respond as we produce new data.</p>
<p>Timely dataflow may be a different programming model than you are used to, but if you can adapt your program to it there are several benefits.</p>
178
178
<ul>
179
179
<li>
180
-
<p><strong>Data Parallelism</strong>: The operators in timely dataflow are largely "data-parallel", meaning they can operate on independent parts of the data concurrently. This allows the underlying system to distribute timely dataflow computations across multiple parallel workers. These can be threads on your computer, or even threads across computers in a cluster you have access to. This distribution typically improves the throughput of the system, and lets you scale to larger problems with access to more resources (computation, communication, and memory).</p>
180
+
<p><strong>Data Parallelism</strong>: The operators in timely dataflow are largely "data-parallel", meaning they can operate on independent parts of the data concurrently. This allows the underlying system to distribute timely dataflow computations across multiple parallel workers. These can be threads on your computer, or even threads across computers in a cluster you have access to. This distribution typically improves the throughput of the system, and lets you scale to larger problems with access to more resources (computation, communication, and memory).</p>
181
181
</li>
182
182
<li>
183
183
<p><strong>Streaming Data</strong>: The core data type in timely dataflow is a <em>stream</em> of data, an unbounded collection of data not all of which is available right now, but which instead arrives as the computation proceeds. Streams are a helpful generalization of static data sets, which are assumed available at the start of the computation. By expressing your program as a computation on streams, you've explained both how it should respond to static input data sets (feed all the data in at once) but also how it should react to new data that might arrive later on.</p>
<p>Is timely dataflow always applicable? The intent of this research project is to remove layers of abstraction fat that prevent you from expressing anything your computer can do efficiently in parallel.</p>
193
193
<p>Under the covers, your computer (the one on which you are reading this text) is a dataflow processor. When your computer <em>reads memory</em> it doesn't actually wander off to find the memory, it introduces a read request into your memory controller, an independent component that will eventually return with the associated cache line. Your computer then gets back to work on whatever it was doing, hoping the responses from the controller return in a timely fashion.</p>
194
-
<p>Academically, I treat "my computer can do this, but timely dataflow cannot" as a bug. There are degrees, of course, and timely dataflow isn't on par with the processor's custom hardware designed to handle low level requests efficiently, but <em>algorithmically</em>, the goal is that anything you can do efficiently with a computer you should be able to express in timely dataflow.</p>
194
+
<p>Academically, I treat "my computer can do this, but timely dataflow cannot" as a bug. There are degrees, of course, and timely dataflow isn't on par with the processor's custom hardware designed to handle low level requests efficiently, but <em>algorithmically</em>, the goal is that anything you can do efficiently with a computer you should be able to express in timely dataflow.</p>
<p>One could re-imagine the sorting process as moving data around, and indeed this is what happens when large clusters need to be brought to bear on such a task, but that doesn't help you at all if what you needed was to sort your single allocation. A library like <ahref="https://github.com/nikomatsakis/rayon">Rayon</a> would almost surely be better suited to the task.</p>
183
183
<hr/>
184
184
<p>Dataflow systems are also fundamentally about breaking apart the execution of your program into independently operating parts. However, many programs are correct only because some things happen <em>before</em> or <em>after</em> other things. A classic example is <ahref="https://en.wikipedia.org/wiki/Depth-first_search">depth-first search</a> in a graph: although there is lots of work to do on small bits of data, it is crucial that the exploration of nodes reachable along a graph edge complete before the exploration of nodes reachable along the next graph edge.</p>
185
-
<p>Although there is plenty of active research on transforming algorithms from sequential to parallel, if you aren't clear on how to express your program as a dataflow program then timely dataflow may not be a great fit. At the very least, the first step would be "fundamentally re-imagine your program", which can be a fine thing to do, but is perhaps not something you would have to do with your traditional program.</p>
185
+
<p>Although there is plenty of active research on transforming algorithms from sequential to parallel, if you aren't clear on how to express your program as a dataflow program then timely dataflow may not be a great fit. At the very least, the first step would be "fundamentally re-imagine your program", which can be a fine thing to do, but is perhaps not something you would have to do with your traditional program.</p>
186
186
<hr/>
187
-
<p>Timely dataflow is in a bit of a weird space between language library and runtime system. This means that it doesn't quite have the stability guarantees a library might have (when you call <code>data.sort()</code> you don't think about "what if it fails?"), nor does it have the surrounding infrastructure of a <ahref="https://www.microsoft.com/en-us/research/project/dryadlinq/">DryadLINQ</a> or <ahref="https://spark.apache.org">Spark</a> style of experience. Part of this burden is simply passed to you, and this may be intolerable depending on your goals for your program.</p>
187
+
<p>Timely dataflow is in a bit of a weird space between language library and runtime system. This means that it doesn't quite have the stability guarantees a library might have (when you call <code>data.sort()</code> you don't think about "what if it fails?"), nor does it have the surrounding infrastructure of a <ahref="https://www.microsoft.com/en-us/research/project/dryadlinq/">DryadLINQ</a> or <ahref="https://spark.apache.org">Spark</a> style of experience. Part of this burden is simply passed to you, and this may be intolerable depending on your goals for your program.</p>
<p>The most important part of dataflow programming is the <em>independence</em> of the components. When you write a dataflow program, you provide the computer with flexibility in how it executes your program. Rather than insisting on a specific sequence of instructions the computer should follow, the computer can work on each of the components as it sees fit, perhaps even sharing the work with other computers.</p>
<p>While we want to enjoy the benefits of dataflow programming, we still need to understand whether and how our computation progresses. In traditional imperative programming we could reason that because instructions happen in some order, then once we reach a certain point all work (of a certain type) must be done. Instead, we will tag the data that move through our dataflow with <em>timestamps</em>, indicating (roughly) when they would have happened in a sequential execution.</p>
183
-
<p>Timestamps play at least two roles in timely dataflow: they allow dataflow components to make sense of the otherwise unordered inputs they see ("ah, I received the data in <em>this</em> order, but I should behave as if it arrived in <em>this</em> order"), and they allow the user (and others) to reason about whether they have seen all of the data with a certain timestamp.</p>
183
+
<p>Timestamps play at least two roles in timely dataflow: they allow dataflow components to make sense of the otherwise unordered inputs they see ("ah, I received the data in <em>this</em> order, but I should behave as if it arrived in <em>this</em> order"), and they allow the user (and others) to reason about whether they have seen all of the data with a certain timestamp.</p>
184
184
<p>Timestamps allow us to introduce sequential structure into our program, without requiring actual sequential execution.</p>
<p>In a traditional imperative program, if we want to return the maximum of a set of numbers, we just scan all the numbers and return the maximum. We don't have to worry about whether we've considered <em>all</em> of the numbers yet, because the program makes sure not to provide an answer until it has consulted each number.</p>
187
187
<p>This simple task is much harder in a dataflow setting, where numbers arrive as input to a component that is tracking the maximum. Before releasing a number as output, the component must know if it has seen everything, as one more value could change its answer. But strictly speaking, nothing we've said so far about dataflow or timestamps provide any information about whether more data might arrive.</p>
188
-
<p>If we combine dataflow program structure with timestamped data in such a way that as data move along the dataflow their timestamps only increase, we are able to reason about the <em>progress</em> of our computation. More specifically, at any component in the dataflow, we can reason about which timestamps we may yet see in the future. Timestamps that are no longer possible are considered "passed", and components can react to this information as they see fit.</p>
188
+
<p>If we combine dataflow program structure with timestamped data in such a way that as data move along the dataflow their timestamps only increase, we are able to reason about the <em>progress</em> of our computation. More specifically, at any component in the dataflow, we can reason about which timestamps we may yet see in the future. Timestamps that are no longer possible are considered "passed", and components can react to this information as they see fit.</p>
189
189
<p>Continual information about the progress of a computation is the only basis of coordination in timely dataflow, and is the lightest touch we could think of.</p>
if *x > 1 && (2 .. limit + 1).all(|i| x % i > 0) {
236
-
println!("{} is prime", x);
236
+
println!("{} is prime", x);
237
237
}
238
238
})</code></pre>
239
239
<p>We don't really care that much about the order (we just want the results), and we have written such a simple primality test that we are going to be thrilled if we can distribute the work across multiple cores.</p>
<p>This is also a fine time to point out that dataflow programming is not religion. There is an important part of our program up above that is imperative:</p>
273
273
<pre><codeclass="language-ignore rust"> let limit = (*x as f64).sqrt() as u64;
274
274
if *x > 1 && (2 .. limit + 1).all(|i| x % i > 0) {
275
-
println!("{} is prime", x);
275
+
println!("{} is prime", x);
276
276
}
277
277
</code></pre>
278
278
<p>This is an imperative fragment telling the <code>inspect</code> operator what to do. We <em>could</em> write this as a dataflow fragment if we wanted, but it is frustrating to do so, and less efficient. The control flow fragment lets us do something important, something that dataflow is bad at: the <code>all</code> method above <em>stops</em> as soon as it sees a factor of <code>x</code>.</p>
0 commit comments