Skip to content

Comments

add @pivot_wider + Cte generation improvments#125

Merged
drizk1 merged 12 commits intomainfrom
cte-gen
Apr 12, 2025
Merged

add @pivot_wider + Cte generation improvments#125
drizk1 merged 12 commits intomainfrom
cte-gen

Conversation

@drizk1
Copy link
Member

@drizk1 drizk1 commented Mar 29, 2025

ex below just need to add tidy selection.

names_from only supports one col right now

rentin = dt(db, "https://raw.githubusercontent.com/tidyverse/tidyr/refs/heads/main/data-raw/us_rent_income.csv")
julia> @chain rentin begin 
       @pivot_wider(names_from = variable, values_from = [estimate, moe])
       @aside @show_query _
       @collect
       end
WITH cte_1 AS (
SELECT  *
        FROM 'https://raw.githubusercontent.com/tidyverse/tidyr/refs/heads/main/data-raw/us_rent_income.csv' AS us_rent_income ),
cte_2 AS (
SELECT GEOID, NAME,
        ANY_VALUE(estimate) FILTER(WHERE variable = 'income') AS income_estimate,
        ANY_VALUE(moe) FILTER(WHERE variable = 'income') AS income_moe,
        ANY_VALUE(estimate) FILTER(WHERE variable = 'rent') AS rent_estimate,
        ANY_VALUE(moe) FILTER(WHERE variable = 'rent') AS rent_moe
        FROM cte_1
        GROUP BY GEOID, NAME)  
SELECT *
        FROM cte_2
52×6 DataFrame
 Row │ GEOID   NAME                  income_estimate  income_moe  rent_estimate  rent_moe 
     │ String  String                String           String      String         String   
─────┼────────────────────────────────────────────────────────────────────────────────────
   1 │ 01      Alabama               24476            136         747            3
   2 │ 02      Alaska                32940            508         1200           13
   3 │ 04      Arizona               27517            148         972            4
   4 │ 05      Arkansas              23789            165         709            5
   5 │ 06      California            29454            109         1358           3
   6 │ 08      Colorado              32401            109         1125           5
   7 │ 09      Connecticut           35326            195         1123           5
   8 │ 10      Delaware              31560            247         1076           10
   9 │ 11      District of Columbia  43198            681         1424           17
  10 │ 12      Florida               25952            70          1077           3
julia> @chain fish  begin 
           @pivot_wider(names_from = "Station", values_from = "value")
           @aside  @show_query _
           @collect
           end
WITH cte_1 AS (
SELECT  *
        FROM 'https://github.com/Myfanwy/ReproducibleExamples/raw/master/encounterhistories/fishdata.csv' AS fishdata ),
cte_2 AS (
SELECT TagID,
        ANY_VALUE(value) FILTER(WHERE Station = 'Release') AS Release_value,
        ANY_VALUE(value) FILTER(WHERE Station = 'I80_1') AS I80_1_value,
        ANY_VALUE(value) FILTER(WHERE Station = 'Lisbon') AS Lisbon_value,
        ANY_VALUE(value) FILTER(WHERE Station = 'Rstr') AS Rstr_value,
        ANY_VALUE(value) FILTER(WHERE Station = 'Base_TD') AS Base_TD_value,
        ANY_VALUE(value) FILTER(WHERE Station = 'BCE') AS BCE_value,
        ANY_VALUE(value) FILTER(WHERE Station = 'BCW') AS BCW_value,
        ANY_VALUE(value) FILTER(WHERE Station = 'BCE2') AS BCE2_value,
        ANY_VALUE(value) FILTER(WHERE Station = 'BCW2') AS BCW2_value,
        ANY_VALUE(value) FILTER(WHERE Station = 'MAE') AS MAE_value,
        ANY_VALUE(value) FILTER(WHERE Station = 'MAW') AS MAW_value
        FROM cte_1
        GROUP BY TagID)  
SELECT *
        FROM cte_2
19×12 DataFrame
 Row │ TagID  Release_value  I80_1_value  Lisbon_value  Rstr_value  Base_TD_value  BCE_value  ⋯
     │ Int64  Int64          Int64        Int64         Int64       Int64          Int64      ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────
   1 │  4842              1            1             1           1              1          1  ⋯
   2 │  4843              1            1             1           1              1          1
   3 │  4844              1            1             1           1              1          1
   4 │  4845              1            1             1           1              1          0
   5 │  4847              1            1             1           0              0          0  ⋯
   6 │  4848              1            1             1           1              0          0
   7 │  4849              1            1             0           0              0          0
   8 │  4850              1            1             0           1              1          1
   9 │  4851              1            1             0           0              0          0

@drizk1
Copy link
Member Author

drizk1 commented Mar 31, 2025

Pivot wider is inherently eager as it needs the distinct values from the names_from column.

Right now, the macro and underlying function actively pulls them before building the equivalent sql.

It might be nice additionally to allow names_from to also accept tuples or pairs where one is the column name for names from (as it needs to know so it can be dropped), and the other part would be the distinct values (if the user for some reason already knows them)

I was running a pivot on 10 gigs of data and was like hmm I already know the distinct values might be nice to save some query time but open to ideas

@drizk1
Copy link
Member Author

drizk1 commented Apr 4, 2025

Adds lazy pivot method where a Tuple of (col_name, [:column_values] can be provided as the names_from argument, for the sql generation. If its not given, @pivot_wider will query the relevant column to pull the distinct column values needed to make the new column names.

test = (:variable, [:income, :rent]);
@eval @chain rentin begin 
       @pivot_wider(names_from = $test, 
       values_from = estimate:moe)
       @aside @show_query _
       @collect
end

@drizk1 drizk1 merged commit dec61c2 into main Apr 12, 2025
4 of 5 checks passed
@drizk1 drizk1 deleted the cte-gen branch April 12, 2025 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant