Implementation of faster MCMC CSV parsing and Stan CSV utilities #799

amas0 · 2025-07-22T22:06:17Z

Submission Checklist

Run unit tests
Declare copyright holder and open-source license: see below

Summary

This is in an initial PR that addresses parts of #785. This introduces some general utilities for parsing Stan CSV files and uses some of those changes within CmdStanMCMC.

Some key ideas:

In an attempt to generalize some of the CSV parsing across different inference methods, this change uses the fact that the CSV outputs are made up of an sequences of commented/uncommented lines that correspond to different sections. The parser logic used here is state-based where a new parsing rule is used each time a transition occurs between a commented and un-commented section.
For example in the standard MCMC output, the sections are config, warmup, adaptation, samples, and timing. Then one only needs to write some parsing code for each section.
An optimized parser based on polars is implemented that takes in list[bytes] corresponding to the lines in the CSV outputs that correspond to draws and produces a np.array as a drop in replacement for existing parsing.
With these utilities, I introduce a StanCsvMCMC dataclass which represents the parsed output of a Stan CSV file. It includes a dict representing the sampler config (which just uses the existing scan_config function), the warmup draws as a numpy array (if present), the step size and mass matrix (if present), the sampling draws as a numpy array, and the timing information as a dictionary. The idea being that it will be easier to write and read code with this structured representation than have to know how the CSV files are structured.

I use these changes to update the CmdStanMCMC._assemble_draws method, which, with the optimized parsing, runs noticeably faster for large models.

Putting this PR as a draft for now, as I want to get some feedback and am planning on writing unit tests to cover the new code, but all existing unit tests are passing on this.

My plan for next steps is to implement a corresponding StanCsv* for the various inference methods (I've already done most of this because the logic is very similar) and then update how each of the Stan fit objects pull info from the CSV files. This will make things more consistent across the inference methods and hopefully clean up some parts of the library.

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Myself

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)

This class is intended to serve as an abstraction layer between the Stan fit objects (like CmdStanMCMC) and the Stan CSV output. It converts the Stan CSV into structured format that can be used to straightforwardly extract relevant info. The practically important contribution here is the implementation of a faster samples parser using pandas/pyarrow intended to replace the pure-Python implementation. Some of the structure/utilties included in this commit are intended to clean up logic elsewhere in the library by sharing functionality and making the processing of Stan CSV files more consistent across the varying inference methods.

codecov-commenter · 2025-07-22T22:21:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.85%. Comparing base (650d2bb) to head (62cfc46).
⚠️ Report is 50 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #799      +/-   ##
===========================================
+ Coverage    80.24%   80.85%   +0.60%     
===========================================
  Files           25       25              
  Lines         3878     4011     +133     
===========================================
+ Hits          3112     3243     +131     
- Misses         766      768       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

WardBrian

Thanks @amas0. The code looks really good.

My comments below are thoughts on a slightly different style that I think will make it more reusable across the different inference algorithms and try to consolidate the logic for each in one place. Let me know what you think!

cmdstanpy/utils/stancsv.py

amas0 · 2025-07-26T21:48:31Z

Okay, I think this PR should be good to go for more review/feedback. In response to the previous comments, I've made the following changes from the last version:

Made polars an optional dependency with fallback-parsing done by numpy
Added unit tests against the new utils/stancsv functions (and caught some bugs)
Removed the ParsingRules based parser in favor of a simpler structure that splits a CSV into comments/draws lines
Comment section parsing functions like timing/adaptation parsing expect to be provided the full comment lines
Remove StanCsvMCMC in favor of using the utility functions within the CmdStanMCMC._assemble_draws() method

This version is simpler and cleaner than the previous. I still think there's probably some cleanup that can happen in the CmdStanMCMC._assemble_draws() method, but at least I think it's improved compared to the existing version.

I ran one last bit of performance testing comparing the new numpy/polars based parsing against the existing method:

This is with the defaults of 4 chains, 1000 draws per chain. I didn't stretch the test out to the extremes of number of parameters, but for moderate to large model sizes, the speedup should be noticeable.

WardBrian

This is looking really good to me now, thanks!

Don't let the number of comments here be discouraging, most are very small changes. Structurally I think this is really clean now :)

cmdstanpy/utils/stancsv.py

cmdstanpy/stanfit/mcmc.py

test/test_stancsv.py

cmdstanpy/utils/stancsv.py

amas0 · 2025-07-30T03:39:48Z

Okay @WardBrian, I think that's everything addressed? Thanks for the feedback, definitely improved things in a few areas.

Let me know if there's anything else that stands out from the additions. All test look good (and I added some new ones).

WardBrian

Thanks @amas0!

I have two small questions before merging, but this looks really great

cmdstanpy/utils/stancsv.py

WardBrian · 2025-07-31T13:40:40Z

Thanks again @amas0!

Looking forward to these utilities spreading to the other methods as well

amas0 added 6 commits July 19, 2025 15:48

Filter out empty mass matrix lines

c21494d

Update _assemble_draws to use StanCsvMCMC object

69ad018

Fix code incompatible with Python 3.8

2ff7c01

Convert draws parsing to polars

9337738

Add docstrings

df1162f

amas0 mentioned this pull request Jul 22, 2025

Path forward on IO operations #785

Open

3 tasks

Add initial unit tests

75de3b5

WardBrian reviewed Jul 23, 2025

View reviewed changes

cmdstanpy/utils/stancsv.py Outdated Show resolved Hide resolved

cmdstanpy/utils/stancsv.py Outdated Show resolved Hide resolved

cmdstanpy/utils/stancsv.py Outdated Show resolved Hide resolved

cmdstanpy/utils/stancsv.py Show resolved Hide resolved

amas0 added 8 commits July 23, 2025 16:52

Make polars an optional dependency

8cb1b7e

Refactor parsing to be function-based

9e4340a

Add polars to test dependencies

20fd8a0

Add single element csv parsing tests

a526c08

Add fixed_param check before assembling draws

d1838f1

Add numpy/polars equiv testing

bba3bde

Convert tests from np.array_equiv to np.array_equal

97c9ef8

Fix csv numpy parsing shape when single row

8b37adb

amas0 changed the title ~~Implementation of faster polars-based CSV parsing and structured Stan CSV output~~ Implementation of faster MCMC CSV parsing and Stan CSV utilities Jul 26, 2025

amas0 marked this pull request as ready for review July 26, 2025 21:48

WardBrian requested changes Jul 28, 2025

View reviewed changes

amas0 added 5 commits July 28, 2025 17:00

Disable pylint warning for re-raising

196da0c

Fixup csv parse typing to 'np.float64'

ada313e

Fixup typing when converting from 'polars.read_csv'

15c2711

Update stancsv tests to np.float64

20e3649

Use 'without_import' helper in 'test_stancsv'

93bcee7

amas0 force-pushed the faster-mcmc-csv-parsing branch from f332ce6 to 93bcee7 Compare July 28, 2025 21:35

amas0 added 2 commits July 28, 2025 18:02

Add more testing for non-'diag_e' metric types

d8d38f2

Clean up mass matrix construction

aa7165c

amas0 added 3 commits July 29, 2025 18:20

Allow stancsv parse function to accept filename/path

29ee368

Add exception handling to stancsv parsing in assemble_draws

810b32a

Return 1D array when parsing diagnoal hmc mass matrix

62cfc46

WardBrian approved these changes Jul 30, 2025

View reviewed changes

cmdstanpy/utils/stancsv.py Outdated Show resolved Hide resolved

cmdstanpy/utils/stancsv.py Show resolved Hide resolved

cmdstanpy/utils/stancsv.py Outdated Show resolved Hide resolved

amas0 added 4 commits July 30, 2025 15:56

Change typing from Path -> os.PathLike

8a796ba

Override polars schema inference and set to F64

7322abf

Raise exception if empty list provided to csv_bytes_list_to_numpy

c8ab2fb

Remove unused timing line parsing

4d3cbf7

WardBrian merged commit 5383424 into stan-dev:develop Jul 31, 2025
16 checks passed

WardBrian mentioned this pull request Aug 6, 2025

New CSV reading fails when there are no draws #800

Closed

amas0 mentioned this pull request Aug 7, 2025

Revert csv parsing empty data to return empty arrays #801

Merged

2 tasks

amas0 deleted the faster-mcmc-csv-parsing branch August 8, 2025 02:17

amas0 mentioned this pull request Aug 16, 2025

Additional Stan CSV IO updates and refactoring #806

Merged

2 tasks

Uh oh!

Implementation of faster MCMC CSV parsing and Stan CSV utilities #799

Implementation of faster MCMC CSV parsing and Stan CSV utilities #799

Uh oh!

Conversation

amas0 commented Jul 22, 2025

Submission Checklist

Summary

Copyright and Licensing

Uh oh!

codecov-commenter commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

WardBrian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amas0 commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WardBrian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amas0 commented Jul 30, 2025

Uh oh!

WardBrian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WardBrian commented Jul 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jul 22, 2025 •

edited

Loading

amas0 commented Jul 26, 2025 •

edited

Loading