-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation
Description
Paper Writing
- Create an Overleaf project (https://www.overleaf.com/5719128663sfwhtyzcgfmq)
- Experiment section
- Discuss outline
- Main method section
- Background section
- Related work section
- Introduction section
Our Implementation
- (T1) In place of ArrayRow, we have another class for DataFrame
- Conceptually, we process partition by partition. What is a partition? A partition = A dataframe (with projected/subset of columns).
- We need to pass a series of (DataFrame, Meta)
- How can we read/construct DataFrame in partitions?
- rust arrow already supports iterator-based reads (we doubt it has)
- pre-partition a table into multiple csv files
- (T2, which is blocked by T1) Convert join-free TPC-H queries into a node-based structure
- (T3) hash table thing (for joins, it is better to keep hash tables, rather than constructing them for every partition)
- (T4, blocked by T3) Convert with-join TPC-H queries into a node-based structure
- (T5) Do something for ``where subquery''
- (T6, blocked T5) Convert remaining (i.e., with-subquery) TPC-H queries
- Example: https://github.com/illinoisdata/DeepOLA/blob/main/rust/runtime/examples/tpch/q1.rs
- Supawit: working on estimation logic
Experiments: end-to-end
- Setup
- Dataset: We mainly use skewed TPC-H datasets (1GB)
- Comparison I against OLA: Ours vs WanderJoin (Q3, Q7, Q10)
- Comparison II against OLA: Ours vs ProgressiveDB (Q1, Q6)
- Ours
- Convert TPC-H queries into our Nodes Tracker: Link
- Collect numbers for 1GB (toy experiments) Google Sheet: Link
- Collect numbers for 10GB
50GB (we may use different scales) - Query analysis document
- Others
- Traditional query engines (non OLA)
- Polars (https://github.com/pola-rs/polars)
- Try polars
- Postgres
- Arav: Presto (Hive connector, read csv data (not parquet)):
- link to our dataset
- https://github.com/illinoisdata/labwiki/wiki/Setting-up-a-Presto-cluster
- Upload CSV files to HDFS
- Create (external) tables on CSV files
- Run some test queries
- Arav will discuss with Yongjoo how to run TPC-H queries
- Upload the results of 22 TPC-H queries on Presto for 10GB dataset to our Google sheet
- Create a PR containing 22 TPC-H queries runnable on Presto (Assign Nikhil for reviewer)
- To collect reliable query latencies of 22 TPC-H queries for Presto on 10 GB dataset
- Learn how to clear cache (turn off and on Presto, and measure latency again; clear OS cache; learn if we can clear cache HDFS)
- Write script to measure cold-start latency
- Polars (https://github.com/pola-rs/polars)
- OLA engines
- Arav: Wander join / XDB (http://www.cs.utah.edu/~lifeifei/papers/wanderjoin.pdf)
- Use the same Azure machines (only for this experiment)
- This is the Github repository: https://github.com/InitialDLab/XDB
- Test 3 TPC-H queries
- Wanderjoin is very accurate; so trying to vary the distribution of TPCH dataset
- Measure the latency of 3 TPC-H queries (Numbers: 3, 7, 10) (https://docs.google.com/spreadsheets/d/1Qy9cytnXFpkjA1mEkU44dDxOIHsSSwKoKNxKHQAreqM/edit?usp=sharing)
- Suwen: ProgressiveDB (http://www.vldb.org/pvldb/vol12/p1814-berg.pdf)
- Check its GitHub repository (https://github.com/DataManagementLab/progressiveDB)
- See what resources we need (e.g., Ubuntu machine), Try to understand limitation (limited set of queries)
- Discuss with Yongjoo how to set things up actually
- IntelliJ -> Maven (
mvn compileormvn build-> jar files.) - Will update code to GitHub (as a new branch)
- Will double data type supports
- Converted data types from double/float to int using Pandas
- Will bulk insert CSV data to Postgres (or ProgressiveDB)
- Analyze why partitioning (or preparing) for 1GB tpch dataset is taking so long
- Start measure latencies (exact query examples, definition of latencies)
- Google sheet
- Google sheet for skewed data
- Sample data for running polars
- Arav: Wander join / XDB (http://www.cs.utah.edu/~lifeifei/papers/wanderjoin.pdf)
- Traditional query engines (non OLA)
Experiments: estimation accuracy
- Task I: Estimate sum (for both (1) randomly shuffled data and (2) somehow sorted data)
- Use TPC-H query 12 because the query is relatively simple.
- Prepare datasets (i.e., with randomly generated
o_orderpriorityand systematically generatedo_orderpriority) - Ours
- Closed form (https://web.ma.utexas.edu/users/parker/sampling/woreplshort.htm)
- Bayesian (https://asterix.ics.uci.edu/pub/vldb11-oa.pdf)
- Task II: Estimate count-distinct (for both (1) randomly shuffle data and (2) somehow sorted data)
- Use TPCH query 16 because that seems to be the only count-distinct query.
- Prepare datasets (i.e., shuffled
partsuppand non-shuffledpartsupp) - Ours
- Closed form (http://vldb.org/conf/1995/P311.PDF)
Experiments: others
- Impact of batch size
Reference
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation