-
-
Notifications
You must be signed in to change notification settings - Fork 78
Implementation of faster MCMC CSV parsing and Stan CSV utilities #799
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This class is intended to serve as an abstraction layer between the Stan fit objects (like CmdStanMCMC) and the Stan CSV output. It converts the Stan CSV into structured format that can be used to straightforwardly extract relevant info. The practically important contribution here is the implementation of a faster samples parser using pandas/pyarrow intended to replace the pure-Python implementation. Some of the structure/utilties included in this commit are intended to clean up logic elsewhere in the library by sharing functionality and making the processing of Stan CSV files more consistent across the varying inference methods.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #799 +/- ##
===========================================
+ Coverage 80.24% 80.85% +0.60%
===========================================
Files 25 25
Lines 3878 4011 +133
===========================================
+ Hits 3112 3243 +131
- Misses 766 768 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
WardBrian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amas0. The code looks really good.
My comments below are thoughts on a slightly different style that I think will make it more reusable across the different inference algorithms and try to consolidate the logic for each in one place. Let me know what you think!
WardBrian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good to me now, thanks!
Don't let the number of comments here be discouraging, most are very small changes. Structurally I think this is really clean now :)
f332ce6 to
93bcee7
Compare
|
Okay @WardBrian, I think that's everything addressed? Thanks for the feedback, definitely improved things in a few areas. Let me know if there's anything else that stands out from the additions. All test look good (and I added some new ones). |
WardBrian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amas0!
I have two small questions before merging, but this looks really great
|
Thanks again @amas0! Looking forward to these utilities spreading to the other methods as well |

Submission Checklist
Summary
This is in an initial PR that addresses parts of #785. This introduces some general utilities for parsing Stan CSV files and uses some of those changes within
CmdStanMCMC.Some key ideas:
For example in the standard MCMC output, the sections are
config,warmup,adaptation,samples, andtiming. Then one only needs to write some parsing code for each section.polarsis implemented that takes inlist[bytes]corresponding to the lines in the CSV outputs that correspond to draws and produces anp.arrayas a drop in replacement for existing parsing.StanCsvMCMCdataclass which represents the parsed output of a Stan CSV file. It includes a dict representing the sampler config (which just uses the existingscan_configfunction), the warmup draws as a numpy array (if present), the step size and mass matrix (if present), the sampling draws as a numpy array, and the timing information as a dictionary. The idea being that it will be easier to write and read code with this structured representation than have to know how the CSV files are structured.I use these changes to update the
CmdStanMCMC._assemble_drawsmethod, which, with the optimized parsing, runs noticeably faster for large models.Putting this PR as a draft for now, as I want to get some feedback and am planning on writing unit tests to cover the new code, but all existing unit tests are passing on this.
My plan for next steps is to implement a corresponding
StanCsv*for the various inference methods (I've already done most of this because the logic is very similar) and then update how each of the Stan fit objects pull info from the CSV files. This will make things more consistent across the inference methods and hopefully clean up some parts of the library.Copyright and Licensing
Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Myself
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: