- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 12
 
Add Pareto shape plot #356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the new helpers here should be in the public API or not. Not too sure how frequently these will be used outside of the Pareto shape plot.
          Codecov Report❌ Patch coverage is  
 Additional details and impacted files@@            Coverage Diff             @@
##             main     #356      +/-   ##
==========================================
- Coverage   86.14%   85.30%   -0.85%     
==========================================
  Files          54       55       +1     
  Lines        6078     6361     +283     
==========================================
+ Hits         5236     5426     +190     
- Misses        842      935      +93     ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to leave now, will continue the review tomorrow hopefully it is somewhat helpful already
| if isinstance(color, str) and color in distribution.dims: | ||
| pc_kwargs["aes"].setdefault("color", [color]) | ||
| color = None | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check other plots to see how we handle similar situations, I don't think we want to use setdefault here as it would mean pc_kwargs takes priority over color argument
          
 Very useful, thank you! I will need to re-work somethings you've mentioned already and some that @aloctavodia mentioned as well over slack. Hoping to get these changes in early next week!  | 
    
| return plot_collection | ||
| 
               | 
          ||
| 
               | 
          ||
| def format_coords_as_labels(data, skip_dims=None): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same formatting util from legacy ArviZ. Do we need to adpat this in any way for arviz-plots? This seems to work as it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ideally we'd take a labeller argument and use sel_to_str (I think it would be the adequate method for this particular case). That will allow users to configure the generated labels a bit (for instance with substitutions, using indexes instead/in addition to coordinate values) and I think the default case will stay the same.
| return np.array([f"{s}" for s in coord_labels], dtype=object) | ||
| 
               | 
          ||
| 
               | 
          ||
| def calculate_khat_bin_edges(values, thresholds, tolerance=1e-9): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this live in arviz-stats since it's computational? Could go here if we need to move it https://github.com/arviz-devs/arviz-stats/blob/main/src/arviz_stats/utils.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok to leave here unless we are basically using the same thing already to generate the repr of the ELPDData class in which case it would be nice to use the same thing in both places.
| def hline(y, target, *, color=unset, alpha=unset, width=unset, linestyle=unset, **artist_kws): | ||
| """Interface to matplotlib for a horizontal line spanning the whole axes.""" | ||
| artist_kws.setdefault("zorder", 0) | ||
| artist_kws.setdefault("zorder", 3) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this change to place the horizontal lines in front and khat values in the back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should set zorder to 2 in order to define what goes on top of what with the plotting order. We must have missed that zorder when reviewing the addition of hline, we do have a note about it: https://github.com/arviz-devs/arviz-plots/blob/main/src/arviz_plots/backend/matplotlib/core.py#L4. Otherwise this will only work for matplotlib, and you can see in the preview that as the lines are plotted last, bokeh and plotly already render them on top: https://arviz-plots--356.org.readthedocs.build/en/356/gallery/plot_khat.html
As an example of the opposite behaviour, when we add the the alternate row shading in plot_forest we plot the gray stripes first, then all the other elements so the stripes end up at the back in all backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit unsure about the things we'd want to support and the things we don't want to allow. There are several things we could integrate more with plotcollection machinery which would give more flexibility to the plot, but plotcollection also usually gives more flexibility than what is really needed which can end up meaning more work and less clear behaviour.
| default_color = khat_kwargs.get("color", color) | ||
| 
               | 
          ||
| if default_color is None and "color" not in khat_aes: | ||
| default_color = "C0" | ||
| 
               | 
          ||
| if "color" not in khat_aes and default_color is not None: | ||
| khat_kwargs.setdefault("color", default_color) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| default_color = khat_kwargs.get("color", color) | |
| if default_color is None and "color" not in khat_aes: | |
| default_color = "C0" | |
| if "color" not in khat_aes and default_color is not None: | |
| khat_kwargs.setdefault("color", default_color) | |
| if "color" not in khat_aes: | |
| khat_kwargs.setdefault("color", "C0") | 
I think this is equivalent with the exception of visuals={"khat": {"color": None}} which I think is ok to let it fail
| show_hlines=False, | ||
| show_bins=False, | ||
| hover_label=False, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not completely sure if we want to keep these as top level arguments or defer to visuals={"xyz": True/False} to activate or deactivate them
| for idx, value in enumerate(hline_values): | ||
| h_kwargs = hlines_kwargs.copy() | ||
| if "linestyle" not in hlines_aes: | ||
| h_kwargs.setdefault("linestyle", f"C{idx}") | ||
| if "color" not in hlines_aes: | ||
| h_kwargs.setdefault("color", f"C{idx + 1}") | ||
| if "alpha" not in hlines_aes: | ||
| h_kwargs.setdefault("alpha", 0.7) | ||
| 
               | 
          ||
| h_ds = xr.Dataset({"pareto_k": xr.DataArray(value)}) | ||
| plot_collection.map( | ||
| hline, | ||
| f"hline_{idx}", | ||
| data=h_ds, | ||
| ignore_aes="all", | ||
| **h_kwargs, | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should define a dataarray/dataset with a bin or hline related dimension (kind of like a concatenated version of the h_ds ones). Then add an overlay aesthetic to it. Otherwise if we manually loop here I don't see how it would be possible to add an aesthetic mapping to the lines.
Another option is deciding we won't allow aesthetic mappings for this element in which case we can simplify the aes_by_visual dict and docs as well as all the in hline_aes checks because we do ignore_aes="all"
| **h_kwargs, | ||
| ) | ||
| 
               | 
          ||
| if show_bins: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar idea here.
Example edge case which might help us think if we want to do this or hardcode things to be always off. We can have multidimensional khat dataarray. Take for example the rugby example but instead of defining two observational variables for points home/away we stack that and end up with a (match, field) array for observations, khats, posterior predictive...
If we integrate all the binning with the plotcollection machinery we could do a cols/rows=["field"] to split the plot in two, share the xaxis between the two plots of the figure in order to easily see if the bad khats for the away subset happen at the same matches they do for the home subset. I think in a case like that each plot would have its own independent binning which should not be difficult if using our xarray based histogram along with filter_aes helper to get the dimensions to reduce
Closes #353
📚 Documentation preview 📚: https://arviz-plots--356.org.readthedocs.build/en/356/