Skip to content

Commit 1163b98

Browse files
drizk1rdboyeskdpsingh
authored
fixes unnest_wider bug w missing key (#143)
* fixes unnest_wider bug w missing key * address an edge case, improves stacktrace with throw * adds test to get code cov up for nest.jl * makes names when unnesting wider more explicit so that unnesting seomthing w name that already exists doenst overwrite that name * fix json example * fix json part2 * remove json example for now * remove json from docs toml * copy and paste og nesting docs * learn to read, add json ex change default for nesting name * learn to scroll * fixes dif length arrays with longer * groupby nest test 4 codecov * unnest tuples wider support * basic logging for main verbs (#138) * basic logging for select, mutate, and transmute * unit testing for logs * remove deepdiffs dependency * adds tests for logs on the rest of the functions * typo fix * add mutate numbers to log * adds join logging, fix cov x * Fix esedge case for logging with grouped data frames. * :newsize mode logs correct type * add detail for row_change and col_change * add brief docs, bump v, up news * fixes log when grouped mutate, adds fillmissing, dropmissing log support * fixed fxn call * fix join log if stmnt, bump cov attempt w 2tests * add slice log support * change slice_min_max to not use`@filter` bc of logging msg dupes * adds unite, sep, sep_rows * adds logging for nests * minor docs edits for settings * exclude log.jl from code coverage for now --------- Co-authored-by: Daniel Rizk <[email protected]> Co-authored-by: Karandeep Singh <[email protected]> * fixes count n issue (#145) * fixes count n issue * gets rid of xs lines in conversion to improve testing * revert type converts, add tests * basic logging for main verbs (#138) * basic logging for select, mutate, and transmute * unit testing for logs * remove deepdiffs dependency * adds tests for logs on the rest of the functions * typo fix * add mutate numbers to log * adds join logging, fix cov x * Fix esedge case for logging with grouped data frames. * :newsize mode logs correct type * add detail for row_change and col_change * add brief docs, bump v, up news * fixes log when grouped mutate, adds fillmissing, dropmissing log support * fixed fxn call * fix join log if stmnt, bump cov attempt w 2tests * add slice log support * change slice_min_max to not use`@filter` bc of logging msg dupes * adds unite, sep, sep_rows * adds logging for nests * minor docs edits for settings * exclude log.jl from code coverage for now --------- Co-authored-by: Daniel Rizk <[email protected]> Co-authored-by: Karandeep Singh <[email protected]> * Updated NEWS.md * Set `fail-on-error` to false for coveralls, removed excluded coverage from log.jl. --------- Co-authored-by: Randall Boyes <[email protected]> Co-authored-by: Karandeep Singh <[email protected]> * Updated NEWS.md --------- Co-authored-by: Randall Boyes <[email protected]> Co-authored-by: Karandeep Singh <[email protected]>
1 parent d492599 commit 1163b98

File tree

6 files changed

+183
-40
lines changed

6 files changed

+183
-40
lines changed

NEWS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
## v.0.17.0 - 2025-03-24
44
- Bugfix: `@count()` can now be called multiple times. If column `n` already exists, then the new column containing the count will be `nn` (and so on).
5+
- Bugfix: `@unnest_wider()` now works on data where keys are missing
56
- Adds logging ability to track changes to data frames with `TidierData_set("log", true)`
67
- Adds docs describing logging and code printing
78

docs/Project.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,4 @@ Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
1111
StableRNGs = "860ef19b-820b-49d6-a774-d7a799459cd3"
1212
TidierData = "fe2206b3-d496-4ee9-a338-6a095c4ece80"
1313
Unitful = "1986cc42-f94f-5a68-af5c-568840ba703d"
14+
JSON = "682c06a0-de6a-54ab-a142-c8b1cf79cde6"

docs/examples/UserGuide/nesting.jl

Lines changed: 33 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,15 +11,15 @@ nested_df = @nest(df4, n2 = starts_with("a"), n3 = y:yz)
1111
# To return to the original dataframe, you can unnest wider and then longer.
1212

1313
@chain nested_df begin
14-
@unnest_wider(n3:n2)
14+
@unnest_wider(n3:n2, names_sep = nothing)
1515
@unnest_longer(y:ab)
1616
end
1717

1818
# Or you can unnest longer and then wider.
1919

2020
@chain nested_df begin
2121
@unnest_longer(n3:n2)
22-
@unnest_wider(n3:n2)
22+
@unnest_wider(n3:n2, names_sep = nothing)
2323
end
2424

2525
# ## `@unnest_longer`
@@ -67,5 +67,35 @@ df3 = DataFrame(
6767

6868
@chain df3 begin
6969
@unnest_wider(y)
70-
@unnest_longer(a:c, keep_empty = true)
70+
@unnest_longer(y_a:y_c, keep_empty = true)
71+
end
72+
73+
# ## unnest JSON files
74+
75+
using JSON
76+
77+
json_str = """
78+
{
79+
"name": "Chris",
80+
"age": 23,
81+
"address": {
82+
"city": "New York",
83+
"country": "America"
84+
},
85+
"friends": [
86+
{
87+
"name": "Emily",
88+
"hobbies": [ "biking", "music", "gaming" ]
89+
},
90+
{
91+
"name": "John",
92+
"hobbies": [ "soccer", "gaming" ]
93+
}
94+
]
95+
}
96+
""";
97+
json_df = DataFrame(JSON.parse(json_str))
98+
99+
@chain json_df begin
100+
@unnest_wider(address, friends)
71101
end

src/docstrings.jl

Lines changed: 81 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -3223,7 +3223,7 @@ Unnest specified columns of arrays or dictionaries into wider format dataframe w
32233223
# Arguments
32243224
- `df`: A DataFrame.
32253225
- `columns`: Columns to be unnested. These columns should contain arrays, dictionaries, dataframes, or tuples. Dictionarys headings will be converted to column names.
3226-
- `names_sep`: An optional string to specify the separator for creating new column names. If not provided, defaults to no separator.
3226+
- `names_sep`: An optional string to specify the separator for creating new column names. If not provided, defaults to `_`.
32273227
32283228
# Examples
32293229
```jldoctest
@@ -3233,11 +3233,11 @@ julia> df = DataFrame(name = ["Zaki", "Farida"], attributes = [
32333233
32343234
julia> @unnest_wider(df, attributes)
32353235
2×3 DataFrame
3236-
Row │ name city age
3237-
│ String String Int64
3238-
─────┼────────────────────────────
3239-
1 │ Zaki New York 25
3240-
2 │ Farida Los Angeles 30
3236+
Row │ name attributes_city attributes_age
3237+
│ String String Int64
3238+
─────┼─────────────────────────────────────────
3239+
1 │ Zaki New York 25
3240+
2 │ Farida Los Angeles 30
32413241
32423242
julia> df2 = DataFrame(a=[1, 2], b=[[1, 2], [3, 4]], c=[[5, 6], [7, 8]])
32433243
2×3 DataFrame
@@ -3247,13 +3247,54 @@ julia> df2 = DataFrame(a=[1, 2], b=[[1, 2], [3, 4]], c=[[5, 6], [7, 8]])
32473247
1 │ 1 [1, 2] [5, 6]
32483248
2 │ 2 [3, 4] [7, 8]
32493249
3250-
julia> @unnest_wider(df2, b:c, names_sep = "_")
3250+
julia> @unnest_wider(df2, b:c, names_sep = "")
32513251
2×5 DataFrame
3252-
Row │ a b_1 b_2 c_1 c_2
3252+
Row │ a b1 b2 c1 c2
32533253
│ Int64 Int64 Int64 Int64 Int64
32543254
─────┼───────────────────────────────────
32553255
1 │ 1 1 2 5 6
32563256
2 │ 2 3 4 7 8
3257+
3258+
3259+
julia> a1=Dict("a"=>1, "b"=>Dict("c"=>1, "d"=>2)); a2=Dict("a"=>1, "b"=>Dict("c"=>1)); a=[a1;a2]; df=DataFrame(a);
3260+
3261+
julia> @unnest_wider(df, b)
3262+
2×3 DataFrame
3263+
Row │ a b_c b_d
3264+
│ Int64 Int64 Int64?
3265+
─────┼───────────────────────
3266+
1 │ 1 1 2
3267+
2 │ 1 1 missing
3268+
3269+
julia> a0=Dict("a"=>0, "b"=>0); a1=Dict("a"=>1, "b"=>Dict("c"=>1, "d"=>2)); a2=Dict("a"=>2, "b"=>Dict("c"=>2)); a3=Dict("a"=>3, "b"=>Dict("c"=>3)); a=[a0;a1;a2;a3]; df3=DataFrame(a);
3270+
3271+
julia> @unnest_wider(df3, b)
3272+
4×3 DataFrame
3273+
Row │ a b_c b_d
3274+
│ Int64 Int64? Int64?
3275+
─────┼─────────────────────────
3276+
1 │ 0 missing missing
3277+
2 │ 1 1 2
3278+
3 │ 2 2 missing
3279+
4 │ 3 3 missing
3280+
3281+
julia> df = DataFrame(x1 = ["one", "two", "three"], x2 = [(1, "a"), (2, "b"), (3, "c")])
3282+
3×2 DataFrame
3283+
Row │ x1 x2
3284+
│ String Tuple…
3285+
─────┼──────────────────
3286+
1 │ one (1, "a")
3287+
2 │ two (2, "b")
3288+
3 │ three (3, "c")
3289+
3290+
julia> @unnest_wider df x2
3291+
3×3 DataFrame
3292+
Row │ x1 x2_1 x2_2
3293+
│ String Int64 String
3294+
─────┼───────────────────────
3295+
1 │ one 1 a
3296+
2 │ two 2 b
3297+
3 │ three 3 c
32573298
```
32583299
"""
32593300

@@ -3388,7 +3429,7 @@ julia> @chain df begin
33883429
33893430
julia> @chain df begin
33903431
@nest(data = b:c_2)
3391-
@unnest_wider(data)
3432+
@unnest_wider(data, names_sep = nothing)
33923433
end
33933434
5×4 DataFrame
33943435
Row │ a b c_1 c_2
@@ -3402,7 +3443,7 @@ julia> @chain df begin
34023443
34033444
julia> @chain df begin
34043445
@nest(data = -a)
3405-
@unnest_wider(data) # wider first
3446+
@unnest_wider(data, names_sep = nothing) # wider first
34063447
@unnest_longer(-a) # then longer
34073448
end
34083449
15×4 DataFrame
@@ -3428,27 +3469,38 @@ julia> @chain df begin
34283469
julia> @chain df begin
34293470
@nest(data = -a)
34303471
@unnest_longer(data) # longer first
3431-
@unnest_wider(-a) # then wider
3472+
@unnest_wider(-a) # then wider, names sep defualting to "_"
34323473
end
34333474
15×4 DataFrame
3434-
Row │ a b c_2 c_1
3435-
│ Char Int64 Int64 Int64
3436-
─────┼───────────────────────────
3437-
1 │ a 1 31 16
3438-
2 │ a 2 32 17
3439-
3 │ a 3 33 18
3440-
4 │ b 4 34 19
3441-
5 │ b 5 35 20
3442-
6 │ b 6 36 21
3443-
7 │ c 7 37 22
3444-
8 │ c 8 38 23
3445-
9 │ c 9 39 24
3446-
10 │ d 10 40 25
3447-
11 │ d 11 41 26
3448-
12 │ d 12 42 27
3449-
13 │ e 13 43 28
3450-
14 │ e 14 44 29
3451-
15 │ e 15 45 30
3475+
Row │ a data_b data_c_2 data_c_1
3476+
│ Char Int64 Int64 Int64
3477+
─────┼──────────────────────────────────
3478+
1 │ a 1 31 16
3479+
2 │ a 2 32 17
3480+
3 │ a 3 33 18
3481+
4 │ b 4 34 19
3482+
5 │ b 5 35 20
3483+
6 │ b 6 36 21
3484+
7 │ c 7 37 22
3485+
8 │ c 8 38 23
3486+
9 │ c 9 39 24
3487+
10 │ d 10 40 25
3488+
11 │ d 11 41 26
3489+
12 │ d 12 42 27
3490+
13 │ e 13 43 28
3491+
14 │ e 14 44 29
3492+
15 │ e 15 45 30
3493+
3494+
julia> @chain df @group_by(a) @nest(data = b:c_2) @ungroup()
3495+
5×2 DataFrame
3496+
Row │ a data
3497+
│ Char DataFrame
3498+
─────┼─────────────────────
3499+
1 │ a 3×3 DataFrame
3500+
2 │ b 3×3 DataFrame
3501+
3 │ c 3×3 DataFrame
3502+
4 │ d 3×3 DataFrame
3503+
5 │ e 3×3 DataFrame
34523504
```
34533505
"""
34543506

src/nests.jl

Lines changed: 66 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
function unnest_wider(df::Union{DataFrame, GroupedDataFrame}, cols; names_sep::Union{String, Nothing}=nothing)
1+
function unnest_wider(df::Union{DataFrame, GroupedDataFrame}, cols; names_sep::Union{String, Nothing}="_")
22
is_grouped = df isa GroupedDataFrame
33
grouping_columns = is_grouped ? groupcols(df) : Symbol[]
44
df_copy = copy(is_grouped ? parent(df) : df)
@@ -49,19 +49,56 @@ function unnest_wider(df::Union{DataFrame, GroupedDataFrame}, cols; names_sep::U
4949
for item in df_copy[!, col]
5050
union!(keys_set, keys(item))
5151
end
52-
52+
5353
for key in keys_set
5454
new_col_name = names_sep === nothing ? Symbol(key) : Symbol(string(col, names_sep, key))
55-
df_copy[!, new_col_name] = getindex.(df_copy[!, col], key)
56-
end
55+
df_copy[!, new_col_name] = get.(df_copy[!, col], Ref(key), missing)
56+
end
5757

5858
elseif col_type <: Array
5959
n = length(first(df_copy[!, col]))
6060
for i in 1:n
6161
new_col_name = names_sep === nothing ? Symbol(string(col, i)) : Symbol(string(col, names_sep, i))
62-
df_copy[!, new_col_name] = getindex.(df_copy[!, col], i)
62+
try
63+
df_copy[!, new_col_name] = getindex.(df_copy[!, col], i)
64+
catch
65+
throw("Try using `@unnest_longer($col)` before `@unnest_wider(attribute)`")
66+
end
67+
end
68+
elseif col_type <: Tuple || (col_type <: Union{Tuple, Missing})
69+
nonmissing = filter(x -> x !== missing, df_copy[!, col])
70+
n = length(first(nonmissing))
71+
for i in 1:n
72+
new_col_name = names_sep === nothing ? Symbol(string(col, i)) : Symbol(string(col, names_sep, i))
73+
try
74+
df_copy[!, new_col_name] = getindex.(df_copy[!, col], i)
75+
catch
76+
throw("Error unnesting tuple from column $col. Try using `@unnest_longer($col)` before `@unnest_wider(attribute)`")
77+
end
78+
end
79+
80+
elseif any(x -> x isa Dict, df_copy[!, col])
81+
keys_set = Set{String}()
82+
for item in df_copy[!, col]
83+
if item isa Dict
84+
union!(keys_set, keys(item))
85+
end
86+
end
87+
for key in keys_set
88+
new_col_name = names_sep === nothing ? Symbol(key) : Symbol(string(col, names_sep, key))
89+
df_copy[!, new_col_name] = [item isa Dict ? get(item, key, missing) : missing for item in df_copy[!, col]]
90+
end
91+
elseif any(x -> x isa Pair, df_copy[!, col])
92+
keys_set = Set{Any}()
93+
for item in df_copy[!, col]
94+
if item isa Pair
95+
push!(keys_set, item.first)
96+
end
97+
end
98+
for key in keys_set
99+
new_col_name = names_sep === nothing ? Symbol(string(key)) : Symbol(string(col, names_sep, key))
100+
df_copy[!, new_col_name] = [item isa Pair && item.first == key ? item.second : missing for item in df_copy[!, col]]
63101
end
64-
65102
else
66103
error("Column $col contains neither dictionaries nor arrays nor DataFrames")
67104
end
@@ -78,13 +115,14 @@ function unnest_wider(df::Union{DataFrame, GroupedDataFrame}, cols; names_sep::U
78115
return df_copy
79116
end
80117

118+
81119
"""
82120
$docstring_unnest_wider
83121
"""
84122
macro unnest_wider(df, exprs...)
85123
exprs = parse_blocks(exprs...)
86124

87-
names_sep = :(nothing)
125+
names_sep = :("_")
88126
if length(exprs) >= 2 && isa(exprs[end], Expr) && exprs[end].head == :(=) && exprs[end].args[1] == :names_sep
89127
names_sep = esc(exprs[end].args[2])
90128
exprs = exprs[1:end-1]
@@ -100,6 +138,8 @@ macro unnest_wider(df, exprs...)
100138
return df_expr
101139
end
102140

141+
using DataFrames
142+
103143
function unnest_longer(df::Union{DataFrame, GroupedDataFrame}, cols; indices_include::Union{Nothing, Bool}=nothing, keep_empty::Bool=false)
104144
is_grouped = df isa GroupedDataFrame
105145
grouping_columns = is_grouped ? groupcols(df) : Symbol[]
@@ -116,10 +156,28 @@ function unnest_longer(df::Union{DataFrame, GroupedDataFrame}, cols; indices_inc
116156
x for x in df_copy[!, col]]
117157
end
118158

159+
# Pad rows if columns have different lengths.
160+
for i in 1:nrow(df_copy)
161+
# Collect lengths of each non-missing iterable in this row
162+
current_lengths = [length(df_copy[i, col]) for col in column_symbols if !ismissing(df_copy[i, col])]
163+
if !isempty(current_lengths)
164+
maxlen = maximum(current_lengths)
165+
for col in column_symbols
166+
if !ismissing(df_copy[i, col])
167+
arr = df_copy[i, col]
168+
if length(arr) < maxlen
169+
df_copy[i, col] = vcat(arr, fill(missing, maxlen - length(arr)))
170+
end
171+
end
172+
end
173+
end
174+
end
175+
119176
# Apply filter if keep_empty is false
120177
if !keep_empty
121178
df_copy = filter(row -> !any(ismissing, [row[col] for col in column_symbols]), df_copy)
122179
end
180+
123181
# Flatten the dataframe
124182
flattened_df = flatten(df_copy, column_symbols)
125183

@@ -139,6 +197,7 @@ function unnest_longer(df::Union{DataFrame, GroupedDataFrame}, cols; indices_inc
139197
end
140198
return flattened_df
141199
end
200+
142201

143202
"""
144203
$docstring_unnest_longer

test/runtests.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ DocMeta.setdocmeta!(TidierData, :DocTestSetup, :(using TidierData); recursive=tr
88

99
doctest(TidierData)
1010

11-
end
11+
end
1212

1313
using TidierData
1414
using Test

0 commit comments

Comments
 (0)