Gripes and anti-gripes about VarInfo

While working on #115, I came to realize that we may be too harsh in our criticism of VarInfo's complexity. This issue tries to explain some of the design choices we have in varinfo.jl which should spiral into a discussion on how to improve things. This issue came to being after a discussion with Hong on slack regarding #118 which I am still very much in favor of.

Function redundancy is a big concern you may have if you have played around with varinfo.jl for a while. The main one I can think of is `getval` vs `getindex` and `setval!` vs `setindex!`. The way I think about it is that `getval` and `setval!` get and set the vectorized version of the value. So `getval` returns a vector and `setval!` expects a vector in even if the value was originally a scalar or matrix. `getindex` and `setindex!` on the other hand should reconstruct the shape of the original variable, and `link`/`invlink` the value based on whether the `VarInfo` is working in terms of the transformed or original values. Currently, `setindex!` doesn't do exactly that and but it was "fixed" in #115 to do exactly the above. So `r = vi[vn]` (or `vi[vn, dist]` in the PR) will always return a value in the domain of `vn`'s distribution with the right shape and `vi[vn] = r` expects `r` to be in the domain of the distribution and in its natural shape.

Another redundancy that we have comes from the need for `r = vi[vn]` and `vi[vn] = r` but also `r = vi[spl]` and `vi[spl] = r`. The first 2 are used in a model call, while the other 2 are used in the `step!` function. These need to be defined for `UntypedVarInfo` as well as `TypedVarInfo`. But imo not much can be done about the need for those functions. `VarInfo` needs to define those functions one way or another because they are the main API of `VarInfo`.

Other than the above redundancy, we have many getter and setter functions in varinfo.jl. I think each one of those functions is used at least once. For example, we need to be able to update the `gids` of a variable to assign a sampler to it. This is because new variables can pop up in dynamic models and they need to be assigned to HMC for example in the next HMC call. Getters and setters for variable ranges, distributions, logp and flags are all necessary. Other functions like `empty!`, `haskey`, `syms`, `tonamedtuple`, etc are all important utility functions to have as well.

Finally regarding unit tests, while we are not exactly perfect in that department, our coverage is at 75%. The coverage of varinfo.jl is at 83%. We could definitely do better in terms of splitting Turing and DPPL tests but completely separating the 2 has proven somewhat difficult thus far. This is mostly because we need a Gibbs-HMC sampler to test the Gibbs-HMC specific components of DPPL. We need a particle sampler to test the particle sampling specific components of DPPL, etc. Will it be possible one day? I hope but this isn't the first heavily interlinked package duo that exists in the Julia ecosystem. Sometimes the need arises for 2 separate development cycles in a heavily interlinked package resulting in the splitting of the package into 2 heavily interlinked packages with 2 development cycles. It's not ideal but it's arguably better than combining the 2 packages again mostly for development needs. This is exactly the situation we are in the Turing-DPPL situation.

Now, back to why I am opening this issue. This isn't just me defending the status quo. I understand there are a few similar issues discussing problems or proposed improvements to VarInfo, e.g. #5, #7, #16, #18 and #68. I am open to criticism if you disagree with anything I said above or think there is a better way to do things but please be specific. For example, here are some questions for you to think about:

1. Which function exactly could use a docstring but doesn't?
2. Which internal function/method exactly can be made redundant?
3. Which exported/API function can be made redundant?
4. Which additional API function would make your life easier when developing Turing?
5. How exactly do you propose to further split Turing and DynamicPPL tests?

@devmotion has brought up more than once his preference to design the `VarInfo` data structure around unvectorized values. I understand the appeal of that but I don't think it will necessarily simplify the code by a lot. We still need to keep track of distributions, sampler gids and varnames. We still need to get a vectorized form and set a vectorized form of the values for HMC samplers. We still need to cater for variables disappearing and popping at any time. We still need specialize the type of VarInfo to cater for mixed variable types and automatic differentiation in a type stable way. All of this would still need to be done. Will it be simpler? Maybe, but I doubt it will be much simpler.

Anyways, this issue grew to be longer than I would have liked, but please let me know what you think, whether you agree or disagree with anything I said above. Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gripes and anti-gripes about VarInfo #119

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gripes and anti-gripes about VarInfo #119

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions