Skip to content

Conversation

rofinn
Copy link
Collaborator

@rofinn rofinn commented Sep 15, 2020

We have two main use cases for this:

  1. Wanting to construct KeyedArrays from adhoc DataFrames
  2. Wanting to populate a pre-allocated Array from LibPQ.jl results. For example, we know how many timestamps, locations, ids, etc we expect to get back from our DB.

We've been doing this internally for AxisArrays for a while, but never got around to open sourcing it.

Copy link
Owner

@mcabbott mcabbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the PR!

I scribbled a few comments, but haven't actually tried this yet. One general one is: does it behave OK with bad input? Mentioning a non-existent column name probably gives a sensible error already.

populate! will also silently over-write data. I wonder whether it can (or should) give some warning. For instance could it keep a BitSet with LinearIndex(I) as it goes, perhaps that would not cost much speed?

@rofinn
Copy link
Collaborator Author

rofinn commented Sep 15, 2020

does it behave OK with bad input? Mentioning a non-existent column name probably gives a sensible error already.

Yeah, most of the errors fallback on whatever the table chooses, which is why I left those tests out. For example, with a columntable providing incorrect columns names will throw a field error and dataframes will throw an ArgumentError. I could add a couple tests for those specific examples for documentation purposes if you want though?

populate! will also silently over-write data. I wonder whether it can (or should) give some warning. For instance could it keep a BitSet with LinearIndex(I) as it goes, perhaps that would not cost much speed?

That's correct. My only concern was how best to report the warning without spamming the user. Here the options I can think of:

  1. Always overwrite and document it which is easy and inexpensive
  2. Error on overwrite unless overwrite is true. We can include the info about the index in the error. Since we aren't throwing a warning this won't spam the user.
  3. Warn on every overwrite condition, but this will be noisy
  4. Warn only once, but then we either need to save information about which indices were overwritten or provide a minimal summary statement.

We could also probably do a combination of 2 and 4 if you like?

@rofinn
Copy link
Collaborator Author

rofinn commented Sep 15, 2020

Okay, looks like doing options (2) and (4) with a mask has minimal impact on performance. The code is just a bit uglier now.

julia> @benchmark AxisKeys.populate!($C, $df, :value)
BenchmarkTools.Trial:
  memory estimate:  16.25 MiB
  allocs estimate:  601857
  --------------
  minimum time:     29.026 ms (0.00% GC)
  median time:      30.158 ms (0.00% GC)
  mean time:        32.022 ms (5.83% GC)
  maximum time:     45.322 ms (12.11% GC)
  --------------
  samples:          157
  evals/sample:     1

julia> @benchmark populate2!($C, $df, :value)
BenchmarkTools.Trial:
  memory estimate:  19.63 MiB
  allocs estimate:  581729
  --------------
  minimum time:     27.898 ms (0.00% GC)
  median time:      30.080 ms (0.00% GC)
  mean time:        31.350 ms (6.97% GC)
  maximum time:     40.393 ms (15.46% GC)
  --------------
  samples:          160
  evals/sample:     1

@mcabbott
Copy link
Owner

Re errors that sounds fine, just as long as it doesn't return unhelpful "can't iterate Nothing" things too easily.

Re overwriting, maybe I'm over-thinking this, and should regard it as conceptually A[inds] .= B which is quite happy to silently overwrite... in which case 1? I guess that seems less surprising for populate! than for the constructor. But agree that 1 or 2 seem safest, rather than noise & warnings. Perhaps I leave this to your judgement, since you seem to have a use case, and revisit if anyone else trips over it.

Copy link

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds cool! How did you choose the names wrapdims though? I wouldn't have guess what it does. :-) Couldn't we find something more explicit?

FWIW I agree that it would be better to throw an error instead of printing warnings by default: either you're OK with duplicates and you should pass force=true, or you're not and a warning isn't really a safe way of signalling the problem.

@mcabbott
Copy link
Owner

mcabbott commented Sep 16, 2020

How did you choose the names wrapdims though? I wouldn't have guess what it does. :-) Couldn't we find something more explicit?

It was wrap but I thought that was too likely to collide... any suggestions? It's a bit of a sloppy do-what-I-mean function, which adds one or two wrappers, adjusts lengths of ranges & tries to sort out offset indexing, and will also try to digest AxisArrays etc. Leaving the actual type constructor to do exactly what it's told, else error. Any suggestions?

Could also mention that Type refers to the return type. In general in Julia that kind of argument goes first BTW.

It's the key-vector type, such as UniqueVector, not the overall. But going first might indeed be better; there are methods which accept arrays not tables which could change to match.

@nalimilan
Copy link

Actually I hadn't realized wrapdims already exists. So AFAICT it's a kind of constructor, but which handles wrapping the input into multiple layers of types if needed. My first idea would be to use the KeyedArray constructor for that. The case where people provide only dimension names to get a NamedDims array could be handled by the NamedDims constructor. At least in my mind these are quite different operations.

@mcabbott
Copy link
Owner

mcabbott commented Sep 16, 2020

The basic constructor allows KeyedArray(M, (kv1, kv2)) and KeyedArray(M, n1=kv1, n2=kv2), with the latter making a KeyedArray(NamedDimsArray(... pair. What wrapdims adds is things like wrapdims(A, let='a':'z') will adjust the range for you, wrapdims(A, UniqueVector, kv1, kv2, kv3) will wrap the key-vectors first, and wrapdims(::NamedArray) will convert this (or at least, it will try if the fieldnames match).

@nalimilan
Copy link

OK. Is there any reason why these methods couldn't be KeyedArrays constructors? The automatic adjustment of the range sounds like a perfect match for a keyword argument, and others can work with multiple dispatch, right?

@mcabbott
Copy link
Owner

mcabbott commented Sep 16, 2020

At the moment these aren't type-stable, it's possible that the first two could be made so:

using AxisKeys, OffsetArrays
@code_warntype wrapdims(rand(3), alpha='a':'z')
@code_warntype wrapdims(zeros(-2:2), 10:10:50)

using NamedArrays
@code_warntype wrapdims(NamedArray(rand(3)))

Right now they are handy things for humans, but I think getting the length / indices wrong deep inside some function should be an error, not silently corrected.

The last is a convenient hack, so as not to need Requires... although in fact NamedDims uses Requires so it will always around, maybe there's no penalty, not sure.

@rofinn
Copy link
Collaborator Author

rofinn commented Sep 16, 2020

@nalimilan I think discussing whether wrapdims should be called something else should maybe be left to a separate issue/PR? As it stands, I agree with @mcabbott that this extra functionality best fits the current definitions of wrapdims. That being said, maybe a middle ground would be to rename wrapdims to keyedarray similar to string(...) vs String(...) and tuple(...) vs Tuple(...)?

@mcabbott
Copy link
Owner

The lowercase version is a clever idea for a name. But I agree that's orthogonal, sorry.

About this PR, I think I agree with @nalimilan that a choice between overwrite & error seems like enough. Like @inbounds. The error message can still be informative. Perhaps in the end my vote would be for this:

  • KeyedArray(table, value, columns) is safe, errors if you pick too few columns. (Maybe reshape(A, dims) is the model, it's an error if the container is too small, but will never eat data.)
  • populate!(A, table, value) is unsafe, like A[inds] .= vals.
  • Both take a keyword, maybe force, which lets you choose.

@rofinn
Copy link
Collaborator Author

rofinn commented Sep 17, 2020

Okay, I've (1) updated the docstring with @nalimilan's suggestions and (2) simplified the duplicates logic to either error or not error based on the force flagged.

@rofinn
Copy link
Collaborator Author

rofinn commented Sep 21, 2020

Is there anything else that needs to be done on this or are we just waiting on #24 so test pass again?

@mcabbott
Copy link
Owner

mcabbott commented Oct 4, 2020

Sorry I dropped the ball here.

I meant something slightly different about types & wrapdims, I tried to edit this branch but since I'm hopeless at git it looks like I'm making a PR to your PR... but see what you think?

  • It depends on a global switch to select which wrapper it outermost. This is mostly so that the tests can be run in both orders.
  • It accepts as an optional 2nd argument a type which then wraps the individual key-vectors. (Maybe that 2nd place, and the name, should be re-thought.)
  • It's not type-stable, which I think is another argument for this being a method of the sloppy wrapdims not the tight KeyedArray constructor.

@rofinn
Copy link
Collaborator Author

rofinn commented Oct 4, 2020

No worries. I took a look at your PR and it seems fine. Are you okay if I merge that now? I'm happy that it doesn't really change the API for my use case and it does seem more consistent with the rest of the package after reviewing a few other wrapdims functions. I also definitely agree that this should not be a KeyedArray constructor 👍 I could see a future PR adding a keyedarray/nameddimsarray function as an alias to wrapdims calls that take the output type as an argument.

@mcabbott
Copy link
Owner

mcabbott commented Oct 4, 2020

Yes, sounds good.

Agree that re-organising wrapdims can happen later.

@mcabbott mcabbott merged commit ecfa122 into mcabbott:master Oct 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants