Add support for KeyedArrays as a Tables sink #21

rofinn · 2020-09-15T02:20:28Z

We have two main use cases for this:

Wanting to construct KeyedArrays from adhoc DataFrames
Wanting to populate a pre-allocated Array from LibPQ.jl results. For example, we know how many timestamps, locations, ids, etc we expect to get back from our DB.

We've been doing this internally for AxisArrays for a while, but never got around to open sourcing it.

src/tables.jl

mcabbott

Looks good, thanks for the PR!

I scribbled a few comments, but haven't actually tried this yet. One general one is: does it behave OK with bad input? Mentioning a non-existent column name probably gives a sensible error already.

populate! will also silently over-write data. I wonder whether it can (or should) give some warning. For instance could it keep a BitSet with LinearIndex(I) as it goes, perhaps that would not cost much speed?

src/tables.jl

rofinn · 2020-09-15T15:51:59Z

does it behave OK with bad input? Mentioning a non-existent column name probably gives a sensible error already.

Yeah, most of the errors fallback on whatever the table chooses, which is why I left those tests out. For example, with a columntable providing incorrect columns names will throw a field error and dataframes will throw an ArgumentError. I could add a couple tests for those specific examples for documentation purposes if you want though?

populate! will also silently over-write data. I wonder whether it can (or should) give some warning. For instance could it keep a BitSet with LinearIndex(I) as it goes, perhaps that would not cost much speed?

That's correct. My only concern was how best to report the warning without spamming the user. Here the options I can think of:

Always overwrite and document it which is easy and inexpensive
Error on overwrite unless overwrite is true. We can include the info about the index in the error. Since we aren't throwing a warning this won't spam the user.
Warn on every overwrite condition, but this will be noisy
Warn only once, but then we either need to save information about which indices were overwritten or provide a minimal summary statement.

We could also probably do a combination of 2 and 4 if you like?

rofinn · 2020-09-15T17:34:08Z

Okay, looks like doing options (2) and (4) with a mask has minimal impact on performance. The code is just a bit uglier now.

julia> @benchmark AxisKeys.populate!($C, $df, :value)
BenchmarkTools.Trial:
  memory estimate:  16.25 MiB
  allocs estimate:  601857
  --------------
  minimum time:     29.026 ms (0.00% GC)
  median time:      30.158 ms (0.00% GC)
  mean time:        32.022 ms (5.83% GC)
  maximum time:     45.322 ms (12.11% GC)
  --------------
  samples:          157
  evals/sample:     1

julia> @benchmark populate2!($C, $df, :value)
BenchmarkTools.Trial:
  memory estimate:  19.63 MiB
  allocs estimate:  581729
  --------------
  minimum time:     27.898 ms (0.00% GC)
  median time:      30.080 ms (0.00% GC)
  mean time:        31.350 ms (6.97% GC)
  maximum time:     40.393 ms (15.46% GC)
  --------------
  samples:          160
  evals/sample:     1

mcabbott · 2020-09-15T17:38:48Z

Re errors that sounds fine, just as long as it doesn't return unhelpful "can't iterate Nothing" things too easily.

Re overwriting, maybe I'm over-thinking this, and should regard it as conceptually A[inds] .= B which is quite happy to silently overwrite... in which case 1? I guess that seems less surprising for populate! than for the constructor. But agree that 1 or 2 seem safest, rather than noise & warnings. Perhaps I leave this to your judgement, since you seem to have a use case, and revisit if anyone else trips over it.

nalimilan

Sounds cool! How did you choose the names wrapdims though? I wouldn't have guess what it does. :-) Couldn't we find something more explicit?

FWIW I agree that it would be better to throw an error instead of printing warnings by default: either you're OK with duplicates and you should pass force=true, or you're not and a warning isn't really a safe way of signalling the problem.

src/tables.jl

mcabbott · 2020-09-16T13:06:09Z

How did you choose the names wrapdims though? I wouldn't have guess what it does. :-) Couldn't we find something more explicit?

It was wrap but I thought that was too likely to collide... any suggestions? It's a bit of a sloppy do-what-I-mean function, which adds one or two wrappers, adjusts lengths of ranges & tries to sort out offset indexing, and will also try to digest AxisArrays etc. Leaving the actual type constructor to do exactly what it's told, else error. Any suggestions?

Could also mention that Type refers to the return type. In general in Julia that kind of argument goes first BTW.

It's the key-vector type, such as UniqueVector, not the overall. But going first might indeed be better; there are methods which accept arrays not tables which could change to match.

nalimilan · 2020-09-16T13:22:47Z

Actually I hadn't realized wrapdims already exists. So AFAICT it's a kind of constructor, but which handles wrapping the input into multiple layers of types if needed. My first idea would be to use the KeyedArray constructor for that. The case where people provide only dimension names to get a NamedDims array could be handled by the NamedDims constructor. At least in my mind these are quite different operations.

mcabbott · 2020-09-16T13:30:06Z

The basic constructor allows KeyedArray(M, (kv1, kv2)) and KeyedArray(M, n1=kv1, n2=kv2), with the latter making a KeyedArray(NamedDimsArray(... pair. What wrapdims adds is things like wrapdims(A, let='a':'z') will adjust the range for you, wrapdims(A, UniqueVector, kv1, kv2, kv3) will wrap the key-vectors first, and wrapdims(::NamedArray) will convert this (or at least, it will try if the fieldnames match).

nalimilan · 2020-09-16T13:36:25Z

OK. Is there any reason why these methods couldn't be KeyedArrays constructors? The automatic adjustment of the range sounds like a perfect match for a keyword argument, and others can work with multiple dispatch, right?

mcabbott · 2020-09-16T13:47:53Z

At the moment these aren't type-stable, it's possible that the first two could be made so:

using AxisKeys, OffsetArrays
@code_warntype wrapdims(rand(3), alpha='a':'z')
@code_warntype wrapdims(zeros(-2:2), 10:10:50)

using NamedArrays
@code_warntype wrapdims(NamedArray(rand(3)))

Right now they are handy things for humans, but I think getting the length / indices wrong deep inside some function should be an error, not silently corrected.

The last is a convenient hack, so as not to need Requires... although in fact NamedDims uses Requires so it will always around, maybe there's no penalty, not sure.

rofinn · 2020-09-16T16:04:09Z

@nalimilan I think discussing whether wrapdims should be called something else should maybe be left to a separate issue/PR? As it stands, I agree with @mcabbott that this extra functionality best fits the current definitions of wrapdims. That being said, maybe a middle ground would be to rename wrapdims to keyedarray similar to string(...) vs String(...) and tuple(...) vs Tuple(...)?

mcabbott · 2020-09-16T22:24:42Z

The lowercase version is a clever idea for a name. But I agree that's orthogonal, sorry.

About this PR, I think I agree with @nalimilan that a choice between overwrite & error seems like enough. Like @inbounds. The error message can still be informative. Perhaps in the end my vote would be for this:

KeyedArray(table, value, columns) is safe, errors if you pick too few columns. (Maybe reshape(A, dims) is the model, it's an error if the container is too small, but will never eat data.)
populate!(A, table, value) is unsafe, like A[inds] .= vals.
Both take a keyword, maybe force, which lets you choose.

rofinn · 2020-09-17T15:52:11Z

Okay, I've (1) updated the docstring with @nalimilan's suggestions and (2) simplified the duplicates logic to either error or not error based on the force flagged.

rofinn · 2020-09-21T19:10:28Z

Is there anything else that needs to be done on this or are we just waiting on #24 so test pass again?

mcabbott · 2020-10-04T08:38:02Z

Sorry I dropped the ball here.

I meant something slightly different about types & wrapdims, I tried to edit this branch but since I'm hopeless at git it looks like I'm making a PR to your PR... but see what you think?

It depends on a global switch to select which wrapper it outermost. This is mostly so that the tests can be run in both orders.
It accepts as an optional 2nd argument a type which then wraps the individual key-vectors. (Maybe that 2nd place, and the name, should be re-thought.)
It's not type-stable, which I think is another argument for this being a method of the sloppy wrapdims not the tight KeyedArray constructor.

rofinn · 2020-10-04T15:58:02Z

No worries. I took a look at your PR and it seems fine. Are you okay if I merge that now? I'm happy that it doesn't really change the API for my use case and it does seem more consistent with the rest of the package after reviewing a few other wrapdims functions. I also definitely agree that this should not be a KeyedArray constructor 👍 I could see a future PR adding a keyedarray/nameddimsarray function as an alias to wrapdims calls that take the output type as an argument.

mcabbott · 2020-10-04T16:15:54Z

Yes, sounds good.

Agree that re-organising wrapdims can happen later.

PR to PR#21

rofinn commented Sep 15, 2020

View reviewed changes

src/tables.jl Outdated Show resolved Hide resolved

mcabbott reviewed Sep 15, 2020

View reviewed changes

nalimilan reviewed Sep 16, 2020

View reviewed changes

src/tables.jl Outdated Show resolved Hide resolved

rofinn added 3 commits September 22, 2020 12:04

Add support for KeyedArrays as a Tables sink.

e3c37ea

Updates from code review

5b5ce44

Address Milan's comments/suggestions.

f15618e

rofinn force-pushed the rf/tables branch from c99478d to f15618e Compare September 22, 2020 17:05

edits to PR... see if this works

23fafbd

mcabbott mentioned this pull request Oct 4, 2020

PR to PR#21 invenia/AxisKeys.jl#1

Merged

Michael Abbott added 2 commits October 4, 2020 10:58

add readme note

a5c4021

tweaks

ab14395

Merge pull request #1 from mcabbott/pull21

877cc80

PR to PR#21

mcabbott merged commit ecfa122 into mcabbott:master Oct 4, 2020

mcabbott mentioned this pull request Nov 14, 2021

Cannot create KeyedArray from some tables #96

Open

Add support for KeyedArrays as a Tables sink #21

Add support for KeyedArrays as a Tables sink #21

Uh oh!

Conversation

rofinn commented Sep 15, 2020

Uh oh!

Uh oh!

mcabbott left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rofinn commented Sep 15, 2020

Uh oh!

rofinn commented Sep 15, 2020

Uh oh!

mcabbott commented Sep 15, 2020

Uh oh!

nalimilan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mcabbott commented Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nalimilan commented Sep 16, 2020

Uh oh!

mcabbott commented Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nalimilan commented Sep 16, 2020

Uh oh!

mcabbott commented Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rofinn commented Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcabbott commented Sep 16, 2020

Uh oh!

rofinn commented Sep 17, 2020

Uh oh!

rofinn commented Sep 21, 2020

Uh oh!

mcabbott commented Oct 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rofinn commented Oct 4, 2020

Uh oh!

mcabbott commented Oct 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mcabbott commented Sep 16, 2020 •

edited

Loading

mcabbott commented Sep 16, 2020 •

edited

Loading

mcabbott commented Sep 16, 2020 •

edited

Loading

rofinn commented Sep 16, 2020 •

edited

Loading

mcabbott commented Oct 4, 2020 •

edited

Loading