Skip to content

Conversation

@amas0
Copy link
Collaborator

@amas0 amas0 commented Jul 22, 2025

Submission Checklist

  • Run unit tests
  • Declare copyright holder and open-source license: see below

Summary

This is in an initial PR that addresses parts of #785. This introduces some general utilities for parsing Stan CSV files and uses some of those changes within CmdStanMCMC.

Some key ideas:

  • In an attempt to generalize some of the CSV parsing across different inference methods, this change uses the fact that the CSV outputs are made up of an sequences of commented/uncommented lines that correspond to different sections. The parser logic used here is state-based where a new parsing rule is used each time a transition occurs between a commented and un-commented section.
    For example in the standard MCMC output, the sections are config, warmup, adaptation, samples, and timing. Then one only needs to write some parsing code for each section.
  • An optimized parser based on polars is implemented that takes in list[bytes] corresponding to the lines in the CSV outputs that correspond to draws and produces a np.array as a drop in replacement for existing parsing.
  • With these utilities, I introduce a StanCsvMCMC dataclass which represents the parsed output of a Stan CSV file. It includes a dict representing the sampler config (which just uses the existing scan_config function), the warmup draws as a numpy array (if present), the step size and mass matrix (if present), the sampling draws as a numpy array, and the timing information as a dictionary. The idea being that it will be easier to write and read code with this structured representation than have to know how the CSV files are structured.

I use these changes to update the CmdStanMCMC._assemble_draws method, which, with the optimized parsing, runs noticeably faster for large models.

Putting this PR as a draft for now, as I want to get some feedback and am planning on writing unit tests to cover the new code, but all existing unit tests are passing on this.

My plan for next steps is to implement a corresponding StanCsv* for the various inference methods (I've already done most of this because the logic is very similar) and then update how each of the Stan fit objects pull info from the CSV files. This will make things more consistent across the inference methods and hopefully clean up some parts of the library.

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Myself

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

amas0 added 6 commits July 19, 2025 15:48
This class is intended to serve as an abstraction layer between
the Stan fit objects (like CmdStanMCMC) and the Stan CSV output.
It converts the Stan CSV into structured format that can be used
to straightforwardly extract relevant info.

The practically important contribution here is the implementation
of a faster samples parser using pandas/pyarrow intended to replace
the pure-Python implementation.

Some of the structure/utilties included in this commit are intended
to clean up logic elsewhere in the library by sharing functionality
and making the processing of Stan CSV files more consistent across
the varying inference methods.
@amas0 amas0 mentioned this pull request Jul 22, 2025
3 tasks
@codecov-commenter
Copy link

codecov-commenter commented Jul 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.85%. Comparing base (650d2bb) to head (62cfc46).
⚠️ Report is 50 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #799      +/-   ##
===========================================
+ Coverage    80.24%   80.85%   +0.60%     
===========================================
  Files           25       25              
  Lines         3878     4011     +133     
===========================================
+ Hits          3112     3243     +131     
- Misses         766      768       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amas0. The code looks really good.

My comments below are thoughts on a slightly different style that I think will make it more reusable across the different inference algorithms and try to consolidate the logic for each in one place. Let me know what you think!

@amas0 amas0 changed the title Implementation of faster polars-based CSV parsing and structured Stan CSV output Implementation of faster MCMC CSV parsing and Stan CSV utilities Jul 26, 2025
@amas0
Copy link
Collaborator Author

amas0 commented Jul 26, 2025

Okay, I think this PR should be good to go for more review/feedback. In response to the previous comments, I've made the following changes from the last version:

  • Made polars an optional dependency with fallback-parsing done by numpy
  • Added unit tests against the new utils/stancsv functions (and caught some bugs)
  • Removed the ParsingRules based parser in favor of a simpler structure that splits a CSV into comments/draws lines
  • Comment section parsing functions like timing/adaptation parsing expect to be provided the full comment lines
  • Remove StanCsvMCMC in favor of using the utility functions within the CmdStanMCMC._assemble_draws() method

This version is simpler and cleaner than the previous. I still think there's probably some cleanup that can happen in the CmdStanMCMC._assemble_draws() method, but at least I think it's improved compared to the existing version.

I ran one last bit of performance testing comparing the new numpy/polars based parsing against the existing method:

image

This is with the defaults of 4 chains, 1000 draws per chain. I didn't stretch the test out to the extremes of number of parameters, but for moderate to large model sizes, the speedup should be noticeable.

@amas0 amas0 marked this pull request as ready for review July 26, 2025 21:48
Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good to me now, thanks!

Don't let the number of comments here be discouraging, most are very small changes. Structurally I think this is really clean now :)

@amas0 amas0 force-pushed the faster-mcmc-csv-parsing branch from f332ce6 to 93bcee7 Compare July 28, 2025 21:35
@amas0
Copy link
Collaborator Author

amas0 commented Jul 30, 2025

Okay @WardBrian, I think that's everything addressed? Thanks for the feedback, definitely improved things in a few areas.

Let me know if there's anything else that stands out from the additions. All test look good (and I added some new ones).

Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amas0!

I have two small questions before merging, but this looks really great

@WardBrian WardBrian merged commit 5383424 into stan-dev:develop Jul 31, 2025
16 checks passed
@WardBrian
Copy link
Member

Thanks again @amas0!

Looking forward to these utilities spreading to the other methods as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants