Conversation
|
Let me know if this is still welcome |
|
Hi @JAicewizard - now that the dust has settled a bit on the migration: yes, this is still welcome! Feel free to ping me once there's a review to be left here. :) |
84954f6 to
39958b1
Compare
39958b1 to
8e9fbd2
Compare
8e9fbd2 to
957d854
Compare
|
@taniabogatsch I really havn't been able to work on much on this over the past couple of months, but this should work now! I also noticed some nice improvements like projected chunks :) |
mlafeldt
left a comment
There was a problem hiding this comment.
Thanks for this big contribution!
Here's my initial review.
In addition to the inline remarks, I also noticed that this could use more comprehensive tests, e.g.
- FillRow() returns error mid-stream
- FillChunk() returns error
- AppendDataChunk() fails (StateError)
- Multiple goroutines fail simultaneously
Some work needed before merge, but the direction is solid.
|
Thanks for the review! I see you also worked on getting zig extensions to build nicely, I will definitely try that out as my current setup is quite outdated and annoying to work with! Besides that, applied all the suggestions locally, still working on getting the error tests in place. With the error workgroup, I don't think testing multiple simultaneous errors is necessary, as the error handling itself is in a package from the Go project and I expect it to function. We could add one, but it would be a maintenance burden (defining a parallel table source + test takes up quite a bit of code) and be flaky at best. |
294e8ee to
c7a7ee4
Compare
|
I am not entirely sure how to test " |
a5db2c8 to
ba322ae
Compare
ba322ae to
66b94ce
Compare
This currently also includes the changes of marcboeker#483, but I will rebase once that is merged.
This adds multiple ways to append, using all the variations of table sources.
Using just row-based is actually slower than the current solution, which may be expected as it operates very similarly but with extra steps. The real performance gains come from the parallel implementations.
The benchmarks don't show much improvement (~25% for both parallel row and parallel chunk), however this is due to the fact that they are both bottlenecked on appending the chunks in DuckDB itself. If instead the bottleneck would be the data ingestion/computation on the user-side (instead of a simple counter), this would be much more favorable to the parallel variations. Currently they barely show up as parallel on my laptop.
Another improvement would be setting entire vectors of data at a time, instead of individual values, this could be done in a future PR.