Additional Stan CSV IO updates and refactoring #806

amas0 · 2025-08-16T22:36:08Z

Submission Checklist

Run unit tests
Declare copyright holder and open-source license: see below

Summary

This is the second PR addressing #785 coming after #799. The main proposed changes in this PR are:

Update Laplace, Pathfinder, Variational, MLE, and GQ draw parsing to use optimized parser introduced in Implementation of faster MCMC CSV parsing and Stan CSV utilities #799.
Add new CSV parsing logic for comment blocks that conforms to a more declarative style (and is hopefully a bit more legible)
Complete removal of existing side-effect oriented scan_ CSV parsing functions
Update the CSV validation logic that exists as part of the CmdStanMCMC initialization

There's quite a bit going on in these changes. I think there are still a number of areas in the codebase that could be cleaned up, but I want to avoid these PRs getting too large.

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): myself

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)

See issue stan-dev#805

WardBrian

This is pretty heroic but I also think it looks good at first glance.

I have a few starter questions to prompt some discussion and possibly a few small changes before diving all the way in:

cmdstanpy/stanfit/metadata.py

cmdstanpy/utils/stancsv.py

cmdstanpy/stanfit/__init__.py

cmdstanpy/stanfit/gq.py

cmdstanpy/utils/stancsv.py

amas0 · 2025-08-19T02:45:06Z

Appreciate the initial glance. This turned out to be more work than I expected (it really gets into the weeds with all the csv validation logic that ultimately happens within the stancsv.check_sampler_csv function).

Now that I've been through much (maybe all?) of the code that interacts with the Stan csvs, it seems to me that the overall organization and ergonomics of working with this I/O would really benefit from some sort of "Stan output schema" that could be validated. Whether that's a schema that all the inference methods adhere to or method-specific schemas, something where, at initial load, full validation occurs and a structured, typed-representation is available for the rest of the library to use would be very valuable. (This seems like the kind of thing that would naturally go into the stanio package?)

In the code currently, validation is pretty inconsistent. The CmdStanMCMC class performs a significant amount of validation, but other inference methods are a pretty light-touch. Obviously, this corresponds with importance, but it'd be nice to close that gap. Also, working with the parsed outputs ends up being awkward in practice. We have a lot of key info being passed around in dicts that are mostly populated from parsing the config block in the Stan CSV files (but also adding additional fields like the headers (and parsed headers)). The result of this is having to do things like check for field existence and fight with the type checker (and mostly just ignore the warnings).

I think this PR is an improvement in clarity, but it still struggles with the same fundamental challenges that working with these CSVs have. In #785, you mention newer JSON outputs and possibly future binary outputs for draws; I think changes like that would go a long way to cleaning this up -- something like that with a defined schema would be nice.

WardBrian · 2025-08-19T13:54:49Z

Yeah, it would be fairly natural to define a schema for the output_config.json and use that going forward. With the current CSV files, I actually lean toward relatively little validation, which you've observed with the other algorithms added more recently. Partially this is because they only have one output file. With MCMC it is fairly important to check that the different files all came from the same run, but this is just not a category of error that exists with e.g. optimization

bob-carpenter · 2025-08-19T14:16:33Z

the overall organization and ergonomics of working with this I/O would really benefit from some sort of "Stan output schema"

That would be great, because lack of validation logic is one of the bigger obstacles to refactoring. It'd also be nice to put it in standard I/O formats rather than CSV (not really a standard) plus comments (even worse).

amas0 · 2025-08-20T03:21:50Z

Implemented some changes in response to the initial comments.

@WardBrian when you say:

it would be fairly natural to define a schema for the output_config.json

Is that implying that there is an alternative cmdstan output where the config info (and possibly other non-draw output) is stored in json format? I recall looking for something like this recently, but didn't see anything.

Either way defining some kind of schema for json (or some other format) output would be great. Even if it isn't used as a standard output for cmdstan for a while, with a defined schema we could develop against that now and clean up some key bits of the library. And, presuming the Stan CSV format doesn't really change radically, we could just implement a converter that takes a given Stan CSV to structured json and handle backwards compatibility for as long as those output formats are in common use.

WardBrian · 2025-08-20T15:20:37Z

Is that implying that there is an alternative cmdstan output where the config info (and possibly other non-draw output) is stored in json format?

Yes, output save_cmdstan_config=true was added in CmdStan 2.34, which is a JSON version of the pre-header comment in the output CSV. Other non-sample items like the metric are available as JSON via other arguments, and the timing hopefully will be available as JSON soon

Because it is its own file, you do end up with the unfortunate "what happens if I mix a file from run A with a file from run B" problem, but I think that's just life. We could consider introducing the requirement that a output directory be blank before a run, which would avoid almost all 'natural' ways that problem could arise.

And, presuming the Stan CSV format doesn't really change radically, we could just implement a converter that takes a given Stan CSV to structured json and handle backwards compatibility for as long as those output formats are in common use.

Believe it or not, I tried this same thing nearly 4 years ago: https://github.com/WardBrian/experimental-cmdstan-parsing

There were some irregularities/inconsitencies with the headers that made converting them into a JSON not super natural. The built-in support in cmdstan is much better. But I still wish it had worked out, because then cmdstanpy would have been free of this nonsense for years by now!

That little experiment was using pydantic, which would probably be a natural choice if we were using the new file as well.

WardBrian

I really like how this is looking now! A few comments/questions:

cmdstanpy/utils/stancsv.py

WardBrian · 2025-08-20T15:34:13Z

cmdstanpy/utils/stancsv.py



-def parse_stan_csv_comments_and_draws(
+def parse_comments_header_and_draws(


I don't see an immediate way to resolve this (I think it's too subtle a pattern to encode as a generator) but I just wanted to note that I think all of the usages of this function that don't need the draws (e.g. the calls in the InferenceMetadata factory) also don't need any of the comments after the header, so they might be doing an unnecessary pass over the rest of the CSV.

Short of splitting into two functions I don't see a nice pattern to avoid this, but it's probably not too expensive to be worth worrying about anyway.

Yeah, I've though about this.

I think the current structure of the csv files somewhat forces a tradeoff between passes over the files and clarity. In the current code, when going through the process of fitting a model with sample and loading the draws, each csv is read 3 times? Twice in the validation step after fitting and once when reading the draws. Additionally within that, once loaded into memory the comments are scanned through a few times. Draws, I think 3 times? Once to validate that all draw rows have the same number of columns, once to count divergences + max treedepth warnings, and once when actually loading into numpy.

All that is to say, we do have a lot of loading things into memory that we don't need, but in practice it doesn't seem to be a big deal. In all the various testing I did, scanning through the files is very efficient. In particular scanning through these files as lists of bytes is very lightweight compared to relatively expensive process of loading draws in as a numpy array.

I think it's probably more prudent to design against cleaner future output formats than try to really cleverly redesign this parsing.

cmdstanpy/utils/stancsv.py

cmdstanpy/stanfit/gq.py

amas0 · 2025-08-21T02:56:28Z

Yes, output save_cmdstan_config=true was added in CmdStan 2.34, which is a JSON version of the pre-header comment in the output CSV.

Ahh, I definitely missed this. Thanks for pointing it out.

Believe it or not, I tried this same thing nearly 4 years ago: https://github.com/WardBrian/experimental-cmdstan-parsing

There were some irregularities/inconsitencies with the headers that made converting them into a JSON not super natural. The built-in support in cmdstan is much better. But I still wish it had worked out, because then cmdstanpy would have been free of this nonsense for years by now!

That little experiment was using pydantic, which would probably be a natural choice if we were using the new file as well.

I'm going to add checking some of this out to my to-do list. Since things are more stable on the cmdstan side, it might be worth a re-visit.

WardBrian

Looks good to go. Thanks again!

amas0 added 30 commits August 7, 2025 22:28

Update pathfinder _assemble_draws

da33edc

Remove unnecessary comment

4b268e6

Update laplace _assemble_draws

ea67bc8

Add new config parsing function

19a230f

Add new header extraction functions

18e803f

Update pathfinder to new stancsv parsing

b46aa07

Update laplace csv parsing logic

42edd71

Update gq draws parsing to optimized version

d93b709

Change parse_header output to tuple[str, ...]

b32e0db

Implement InferenceMetadata.from_csv for common usage

8344ee7

Add InferenceMetadata.__getitem__ for accessing config dict

0b41f88

Update CmdStanGQ with new csv parsing functions

f3c4339

Add InferenceMetadata.column_names property

7f9f4e5

Remove scan_generic_csv

3b90dfb

Update mle to new stancsv parsing

54716c4

Remove scan_optimize_csv

5a61d27

Use InferenceMetadata.column_name property throughout

5fc4c10

Add helper to extract key = val pairs from stancsv

df91ac8

Update VB to new stancsv methods

b927c1b

Remove scan_variational_csv

f389d83

Use stancsv namespace for consistency

ff2a3f5

Add extraction of divergences/max treedepth function

964b49d

Add count function for warmup and sampling draws

89a7b98

Add timing line parsing

40c4ee8

Add new metadata parsing function from sample csv

761c7de

Update check_sampler_csv to use new functions

e4b9718

Remove old scan_ parsing functions

a3037d4

Update type assertion for variation eta in config

3b5a7fd

Add check to raise exception on invalid draws shape

6023434

Fixup comments/docstring

2162c2a

amas0 added 8 commits August 15, 2025 18:00

Accommodate automatic fixed_param sampling

497d15a

See issue stan-dev#805

Add adaptation block validation for csv

06bde2b

Fix incorrect eta being parsed

670a658

Remove unreacheable exception handling

ff2cb53

Add column filter tests

66b5a24

Remove errant print statement

4af87ab

Fix incorrect missing step size check

5c792aa

Add tests for new parsing functions

6b1736c

amas0 marked this pull request as ready for review August 17, 2025 18:59

amas0 changed the title ~~Additional Stan CSV IO updates and reafactoring~~ Additional Stan CSV IO updates and refactoring Aug 17, 2025

WardBrian requested changes Aug 18, 2025

View reviewed changes

Remove excessive asserts-as-type-validation

a5e48c1

amas0 force-pushed the csv-io-updates branch from 33bca61 to a5e48c1 Compare August 19, 2025 01:50

Remove unnecessary num chains check

ca961c0

Re-add NoDataError handling

27159c4

amas0 added 2 commits August 19, 2025 22:58

Refactor parsing to extract header separately

a43896f

Fixup docstrings

8c2579e

WardBrian reviewed Aug 20, 2025

View reviewed changes

amas0 added 3 commits August 20, 2025 21:57

Add TODO to remove 'is_sneaky_fixed_param' in future

3a9db21

Re-raise stancsv parsing failures to identify file

47e5810

Specify chain index in gq error

f8819db

WardBrian approved these changes Aug 21, 2025

View reviewed changes

WardBrian merged commit b3c6c47 into stan-dev:develop Aug 21, 2025
16 checks passed

WardBrian mentioned this pull request Aug 21, 2025

Path forward on IO operations #785

Open

3 tasks



		def parse_stan_csv_comments_and_draws(
		def parse_comments_header_and_draws(

Uh oh!

Additional Stan CSV IO updates and refactoring #806

Additional Stan CSV IO updates and refactoring #806

Uh oh!

Conversation

amas0 commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Submission Checklist

Summary

Copyright and Licensing

Uh oh!

WardBrian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amas0 commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WardBrian commented Aug 19, 2025

Uh oh!

bob-carpenter commented Aug 19, 2025

Uh oh!

amas0 commented Aug 20, 2025

Uh oh!

WardBrian commented Aug 20, 2025

Uh oh!

WardBrian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

WardBrian Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

amas0 Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

amas0 commented Aug 21, 2025

Uh oh!

WardBrian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amas0 commented Aug 16, 2025 •

edited

Loading

amas0 commented Aug 19, 2025 •

edited

Loading