Add support for streaming stdout/stderr from Child invocations#75
Add support for streaming stdout/stderr from Child invocations#75brianmario wants to merge 11 commits intomasterfrom
Conversation
| end | ||
|
|
||
| # Exception raised when output streaming is aborted early. | ||
| class Aborted < StandardError |
There was a problem hiding this comment.
This could maybe be CallerAborted or something more specific?
lib/posix/spawn/child.rb
Outdated
| @err << chunk | ||
| end | ||
| end | ||
| end |
There was a problem hiding this comment.
This whole block is pretty gross, but the alternative may involve being tricky (and less readable?) with Ruby.
There was a problem hiding this comment.
This might be the "tricky" Ruby you're talking about, but it seems to me that when streaming is not requested, you could set @stdout_block anyway, to
Proc.new do |chunk|
@out << chunk
false
end
(like you do for the tests) and do the equivalent for @stderr_block. Then you could avoid the inner conditionals here.
To shrink the code even further, you could set
@blocks = { stdout => @stdout_block, stderr => @stderr_block }
(in which case you wouldn't even need @stdout_block and @stderr_block anymore, but you get the idea) then this whole processing code could become
if @blocks[fd].call(chunk)
raise Aborted
end
lib/posix/spawn/child.rb
Outdated
| end | ||
|
|
||
| if @streaming && abort | ||
| raise Aborted |
There was a problem hiding this comment.
I don't love raising here, but it enforces proper cleanup (and killing the subprocess) up on
posix-spawn/lib/posix/spawn/child.rb
Line 168 in 0c02f33
mhagger
left a comment
There was a problem hiding this comment.
I added some comments. It would be great to have docs, too.
lib/posix/spawn/child.rb
Outdated
| @err << chunk | ||
| end | ||
| end | ||
| end |
There was a problem hiding this comment.
This might be the "tricky" Ruby you're talking about, but it seems to me that when streaming is not requested, you could set @stdout_block anyway, to
Proc.new do |chunk|
@out << chunk
false
end
(like you do for the tests) and do the equivalent for @stderr_block. Then you could avoid the inner conditionals here.
To shrink the code even further, you could set
@blocks = { stdout => @stdout_block, stderr => @stderr_block }
(in which case you wouldn't even need @stdout_block and @stderr_block anymore, but you get the idea) then this whole processing code could become
if @blocks[fd].call(chunk)
raise Aborted
end
lib/posix/spawn/child.rb
Outdated
| end | ||
| end | ||
|
|
||
| if @streaming && abort |
There was a problem hiding this comment.
I think @streaming && is redundant here, since when @streaming is not set, abort retains its initial value, false.
| }) | ||
| end | ||
| end | ||
|
|
There was a problem hiding this comment.
There are no tests that involve reading more than one BUFSIZE worth of output, or reading from both stdout and stderr. Those might be worthwhile additions.
There was a problem hiding this comment.
We should also add a test for passing in a minimal custom object, just to ensure the interface contract is maintained. May be a good time to use a spy.
No longer applicable as we are using Procs now.
This requires that the stdout and stderr stream objects passed respond to `#write` and `#string` methods.
mclark
left a comment
There was a problem hiding this comment.
I'm liking the duck typed object interface MUCH better than the two Procs. I think that was the right way to go in this case.
lib/posix/spawn/child.rb
Outdated
| bin_encoding = Encoding::BINARY | ||
| [stdin, stdout, stderr].each do |fd| | ||
| fd.set_encoding('BINARY', 'BINARY') | ||
| fd.set_encoding(bin_encoding, bin_encoding) |
There was a problem hiding this comment.
Can we do this on a per fd basis?
bin_encoding = Encoding::BINARY
[stdin, stdout, stderr].each do |fd|
fd.set_encoding(bin_encoding, bin_encoding) if fd.respond_to?(:set_encoding)
endAlso, are we intentionally dropping the force_encoding calls on stdout and stderr here?
lib/posix/spawn/child.rb
Outdated
| @out.force_encoding('BINARY') | ||
| @err.force_encoding('BINARY') | ||
| input = input.dup.force_encoding('BINARY') if input | ||
| @stdout_buffer.set_encoding(bin_encoding) |
There was a problem hiding this comment.
Are these duplicate calls to the above intentional? We really can't assume these objects respond to these methods any more.
lib/posix/spawn/child.rb
Outdated
| abort = false | ||
| if chunk | ||
| if fd == stdout | ||
| abort = (@stdout_buffer.write(chunk) == 0) |
There was a problem hiding this comment.
I don't feel like this is a safe way to test for aborting the operation. The output object could be simply refusing to write the current chunk but is not done consuming the stream.
Why not use an exception for this test instead? If the consumer raises Posix::Spawn::Aborted then we clearly know to abort.
lib/posix/spawn/child.rb
Outdated
|
|
||
| # maybe we've hit our max output | ||
| if max && ready[0].any? && (@out.size + @err.size) > max | ||
| if max && ready[0].any? && (@stdout_buffer.size + @stderr_buffer.size) > max |
There was a problem hiding this comment.
again, we can't assume there is a #size method on these objects...
I think we should probably keep a local count of the bytes we have written instead of calling #size anyway. We can't trust these objects any more as they could be anything.
| }) | ||
| end | ||
| end | ||
|
|
There was a problem hiding this comment.
We should also add a test for passing in a minimal custom object, just to ensure the interface contract is maintained. May be a good time to use a spy.
No longer applicable as we are using Procs now.
| end | ||
|
|
||
| # Exception raised when output streaming is aborted early. | ||
| class Aborted < StandardError |
|
Oh right, we also need some thorough docs on this once we've nailed down the exact 🦆 interface we are using |
lib/posix/spawn/child.rb
Outdated
| chunk = nil | ||
| begin | ||
| buf << fd.readpartial(BUFSIZE) | ||
| chunk = fd.readpartial(BUFSIZE) |
There was a problem hiding this comment.
If I'm understanding this right, the old code would always append directly into the final buffer, whereas this one reads a chunk and then appends that chunk to the buffer. Not knowing anything about how Ruby operates under the hood, is this a potential performance problem? It should just be an extra memcpy in the worst case, but I recall that we've hit bottlenecks on reading into Ruby before. I suspect those were mostly about arrays and not buffers, though (e.g., reading each line into its own buffer can be slow).
There was a problem hiding this comment.
I could be wrong here (cc @tenderlove) but I'm pretty sure the previous code would actually create a new string (down inside readpartial), then that string would have been appended to buf. Which would require potentially resizing that buffer first, then the memcpy.
This new code just keeps that first string as a local var first, so we can later determine where to write it. In the default case we're using a StringIO so the result is essentially the same as before (potential buffer resize then a copy). Though iirc we saw some pretty significant speedups by using an array to keep track of chunks, then calling join at the end when it was all needed. The reason for that is because it avoids the reallocation of the resulting buffer as we're reading, and instead allows join to allocate a buffer exactly the size that's needed then copying all the chunks in to it.
Basically this (the old way):
buffer = ""
buffer << "one,"
buffer << "two,"
buffer << "three"
return buffervs this (the array optimized version I just mentioned):
buffer = []
buffer << "one,"
buffer << "two,"
buffer << "three"
# buffer is an array with 3 elements at this point
# and this join call figures out how big all of the strings inside are, then creates a single buffer to append them to.
return buffer.joinUsing that approach efficiently may change the API contract here slightly though...
There was a problem hiding this comment.
@brianmario Ah, right, that sort of return value optimization would be pretty easy to implement, and would mean we end up with the same number of copies. Though if we're just counting memcpys anyway, I suspect it doesn't matter much either way.
The reason for that is because it avoids the reallocation of the resulting buffer as we're reading, and instead allows join to allocate a buffer exactly the size that's needed then copying all the chunks in to it.
Interesting. It sounds like appending doesn't grow the buffer aggressively in that case, because you should be able to get amortized constant-time.
Anyway. We're well out of my level for intelligent discussion of Ruby internals. The interesting result is whether reading the output of a spawned cat some-gigabyte-file is measurably any different. Probably not, but it's presumably easy to test.
brianmario
left a comment
There was a problem hiding this comment.
So, I decided to go back to the proc-based API because the requirements on the caller are much simpler. The objects passed as streams need only respond to call with an arity of 1 (the current chunk) and return a boolean. true on success false to abort (note this is opposite from how I originally had it, though I think it makes more sense).
I'll keep going on tests and the documentation changes, but wanted to give folks one last chance to review this direction.
| streams = {stdout => @stdout_stream, stderr => @stderr_stream} | ||
|
|
||
| bytes_seen = 0 | ||
| chunk_buffer = "" |
There was a problem hiding this comment.
This buffer is reused by readpartial below. Internally, so far as I can tell, it will be resized to BUFSIZE on the first call to readpartial and then that underlying buffer will be reused from then on out.
There was a problem hiding this comment.
Due to the issue with appending to strings that have been mentioned, we might want to consider having #readpartial allocate a new string and give ownership of it to the stream, but since we're now re-using this buffer, it's probably already more efficient than what we had before, so we can probably leave it until we actually find a perf issue we can trace back to this specifically.
There was a problem hiding this comment.
Another "" that might be better as String.new here.
| end | ||
| end | ||
|
|
||
| [@out, @err] |
There was a problem hiding this comment.
I decided to just drop returning these in an attempt at consistency since these one or both of these ivars are useless if we're streaming.
| streams = {stdout => @stdout_stream, stderr => @stderr_stream} | ||
|
|
||
| bytes_seen = 0 | ||
| chunk_buffer = "" |
There was a problem hiding this comment.
Due to the issue with appending to strings that have been mentioned, we might want to consider having #readpartial allocate a new string and give ownership of it to the stream, but since we're now re-using this buffer, it's probably already more efficient than what we had before, so we can probably leave it until we actually find a perf issue we can trace back to this specifically.
| begin | ||
| Child.new('yes', :streams => {:stdout => stdout_stream}, :max => limit) | ||
| rescue POSIX::Spawn::MaximumOutputExceeded | ||
| end |
There was a problem hiding this comment.
This should probably be a assert_raises block so we assert that the exception was raised, as-is this would not fail the test even if we never raise the error.
|
This doesn't seem to be new, but the semantics of While I was playing around with the added test here, I noticed that all chunks are 16kB (which seems to be the default pipe buffer size here) so the |
|
@carlosmn went ahead and added a failing test for that. Will get things fixed up so we only ever hand the caller |
| end | ||
| @options.delete(:chdir) if @options[:chdir].nil? | ||
|
|
||
| @out, @err = "", "" |
There was a problem hiding this comment.
Might want to use String.new to avoid breaking when someone passes --enable-frozen-string-literal.
| @out << chunk | ||
|
|
||
| true | ||
| end |
There was a problem hiding this comment.
@stdout_stream = @out.method(:<<)
🚲 🏠
There was a problem hiding this comment.
No I love it! Being able to use method was one of the main reasons for going back to the proc-based API ;)
| streams = {stdout => @stdout_stream, stderr => @stderr_stream} | ||
|
|
||
| bytes_seen = 0 | ||
| chunk_buffer = "" |
There was a problem hiding this comment.
Another "" that might be better as String.new here.
|
Any plans to merge this? I'd like to use it :-D |
As the title says, this adds support for passing a block for receiving chunks of output stdout/stderr as they're being read.
I'm not super happy with how the API turned out, and I could probably add a few more tests - so suggestions are definitely welcome.
@tmm1 @peff @piki @carlosmn @simonsj @vmg @scottjg @mclark @arthurschreiber @tma in case either of you have time to review this 🙏