- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.7k
feat: support table sample #16505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support table sample #16505
Conversation
| It would be better to add more details about the PR, such as: | 
| 
 | 
| I suggest to first open an issue to describe full syntax and semantics of this table sample feature, and also include the reference system (like postgres). After we have reached some agreement, then we can start implementing. There is another implementation that seems to have several syntax difference than this PR #16325 @theirix We had a previous discussion that DF can include features for postgres syntax. However if it's referencing other systems, then it might need more discussion and wider approval. | 
| 
 Updated, and this PR implements Spark style sample. | 
| @2010YOUY01 thank you for pointing this out. @chenkovsky, it looks like both our PRs solve the same sampling problem from different approaches. The direction of my PR is to continue improving random filtering (as in #13268) by enhancing a predicate-based sampling, as previously discussed with @alamb here. The sampling logic differs between databases, and in my PR implementation and review process, we have already begun addressing some subtle semantics differences for Postgres, DuckDB, Hive etc. | 
| 
 I considered random filtering before, but I found it's hard to implement poisson sample and seed. then I bring spark's design here. | 
| some comments were added in cargo file today. datafusion/datafusion/sql/Cargo.toml Line 49 in 20a723b 
 it makes sense to me. changing dependency in datafusion-sql should be careful. | 
| Thank you @chenkovsky -- this looks really neat. However, I have been wondering recently how many more feature we can / should be adding to DataFusion core. For example, as we add more features it also makes it harder to use DataFusion in WASM I wonder if you have have considered potentially adding support for TABLESAMPLE using only DataFusion extension points? Like with a custom user defined operator and optimizer pass? This is not to say we should/should not merge this PR -- we can evaluate that separately. But I think we should start a larger conversation of "what features should we be including in the core" | 
| Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. | 
Which issue does this PR close?
Close #16533
Rationale for this change
Currently table sample is not supported.
What changes are included in this PR?
support table sample.
it's row level.
three sample methods are supported.
Are these changes tested?
UT
Are there any user-facing changes?
Yes, If the user uses the match statement for logical plan, the user needs to add sample into match statement.