Skip to content

Conversation

@amas0
Copy link
Collaborator

@amas0 amas0 commented Aug 16, 2025

Submission Checklist

  • Run unit tests
  • Declare copyright holder and open-source license: see below

Summary

This is the second PR addressing #785 coming after #799. The main proposed changes in this PR are:

  • Update Laplace, Pathfinder, Variational, MLE, and GQ draw parsing to use optimized parser introduced in Implementation of faster MCMC CSV parsing and Stan CSV utilities #799.
  • Add new CSV parsing logic for comment blocks that conforms to a more declarative style (and is hopefully a bit more legible)
  • Complete removal of existing side-effect oriented scan_ CSV parsing functions
  • Update the CSV validation logic that exists as part of the CmdStanMCMC initialization

There's quite a bit going on in these changes. I think there are still a number of areas in the codebase that could be cleaned up, but I want to avoid these PRs getting too large.

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): myself

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@amas0 amas0 marked this pull request as ready for review August 17, 2025 18:59
@amas0 amas0 changed the title Additional Stan CSV IO updates and reafactoring Additional Stan CSV IO updates and refactoring Aug 17, 2025
Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty heroic but I also think it looks good at first glance.

I have a few starter questions to prompt some discussion and possibly a few small changes before diving all the way in:

@amas0
Copy link
Collaborator Author

amas0 commented Aug 19, 2025

Appreciate the initial glance. This turned out to be more work than I expected (it really gets into the weeds with all the csv validation logic that ultimately happens within the stancsv.check_sampler_csv function).

Now that I've been through much (maybe all?) of the code that interacts with the Stan csvs, it seems to me that the overall organization and ergonomics of working with this I/O would really benefit from some sort of "Stan output schema" that could be validated. Whether that's a schema that all the inference methods adhere to or method-specific schemas, something where, at initial load, full validation occurs and a structured, typed-representation is available for the rest of the library to use would be very valuable. (This seems like the kind of thing that would naturally go into the stanio package?)

In the code currently, validation is pretty inconsistent. The CmdStanMCMC class performs a significant amount of validation, but other inference methods are a pretty light-touch. Obviously, this corresponds with importance, but it'd be nice to close that gap. Also, working with the parsed outputs ends up being awkward in practice. We have a lot of key info being passed around in dicts that are mostly populated from parsing the config block in the Stan CSV files (but also adding additional fields like the headers (and parsed headers)). The result of this is having to do things like check for field existence and fight with the type checker (and mostly just ignore the warnings).

I think this PR is an improvement in clarity, but it still struggles with the same fundamental challenges that working with these CSVs have. In #785, you mention newer JSON outputs and possibly future binary outputs for draws; I think changes like that would go a long way to cleaning this up -- something like that with a defined schema would be nice.

@WardBrian
Copy link
Member

Yeah, it would be fairly natural to define a schema for the output_config.json and use that going forward. With the current CSV files, I actually lean toward relatively little validation, which you've observed with the other algorithms added more recently. Partially this is because they only have one output file. With MCMC it is fairly important to check that the different files all came from the same run, but this is just not a category of error that exists with e.g. optimization

@bob-carpenter
Copy link
Member

the overall organization and ergonomics of working with this I/O would really benefit from some sort of "Stan output schema"

That would be great, because lack of validation logic is one of the bigger obstacles to refactoring. It'd also be nice to put it in standard I/O formats rather than CSV (not really a standard) plus comments (even worse).

@amas0
Copy link
Collaborator Author

amas0 commented Aug 20, 2025

Implemented some changes in response to the initial comments.

@WardBrian when you say:

it would be fairly natural to define a schema for the output_config.json

Is that implying that there is an alternative cmdstan output where the config info (and possibly other non-draw output) is stored in json format? I recall looking for something like this recently, but didn't see anything.

Either way defining some kind of schema for json (or some other format) output would be great. Even if it isn't used as a standard output for cmdstan for a while, with a defined schema we could develop against that now and clean up some key bits of the library. And, presuming the Stan CSV format doesn't really change radically, we could just implement a converter that takes a given Stan CSV to structured json and handle backwards compatibility for as long as those output formats are in common use.

@WardBrian
Copy link
Member

Is that implying that there is an alternative cmdstan output where the config info (and possibly other non-draw output) is stored in json format?

Yes, output save_cmdstan_config=true was added in CmdStan 2.34, which is a JSON version of the pre-header comment in the output CSV. Other non-sample items like the metric are available as JSON via other arguments, and the timing hopefully will be available as JSON soon

Because it is its own file, you do end up with the unfortunate "what happens if I mix a file from run A with a file from run B" problem, but I think that's just life. We could consider introducing the requirement that a output directory be blank before a run, which would avoid almost all 'natural' ways that problem could arise.

And, presuming the Stan CSV format doesn't really change radically, we could just implement a converter that takes a given Stan CSV to structured json and handle backwards compatibility for as long as those output formats are in common use.

Believe it or not, I tried this same thing nearly 4 years ago: https://github.com/WardBrian/experimental-cmdstan-parsing

There were some irregularities/inconsitencies with the headers that made converting them into a JSON not super natural. The built-in support in cmdstan is much better. But I still wish it had worked out, because then cmdstanpy would have been free of this nonsense for years by now!

That little experiment was using pydantic, which would probably be a natural choice if we were using the new file as well.

Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how this is looking now! A few comments/questions:



def parse_stan_csv_comments_and_draws(
def parse_comments_header_and_draws(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an immediate way to resolve this (I think it's too subtle a pattern to encode as a generator) but I just wanted to note that I think all of the usages of this function that don't need the draws (e.g. the calls in the InferenceMetadata factory) also don't need any of the comments after the header, so they might be doing an unnecessary pass over the rest of the CSV.

Short of splitting into two functions I don't see a nice pattern to avoid this, but it's probably not too expensive to be worth worrying about anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've though about this.

I think the current structure of the csv files somewhat forces a tradeoff between passes over the files and clarity. In the current code, when going through the process of fitting a model with sample and loading the draws, each csv is read 3 times? Twice in the validation step after fitting and once when reading the draws. Additionally within that, once loaded into memory the comments are scanned through a few times. Draws, I think 3 times? Once to validate that all draw rows have the same number of columns, once to count divergences + max treedepth warnings, and once when actually loading into numpy.

All that is to say, we do have a lot of loading things into memory that we don't need, but in practice it doesn't seem to be a big deal. In all the various testing I did, scanning through the files is very efficient. In particular scanning through these files as lists of bytes is very lightweight compared to relatively expensive process of loading draws in as a numpy array.

I think it's probably more prudent to design against cleaner future output formats than try to really cleverly redesign this parsing.

@amas0
Copy link
Collaborator Author

amas0 commented Aug 21, 2025

Yes, output save_cmdstan_config=true was added in CmdStan 2.34, which is a JSON version of the pre-header comment in the output CSV.

Ahh, I definitely missed this. Thanks for pointing it out.

Believe it or not, I tried this same thing nearly 4 years ago: https://github.com/WardBrian/experimental-cmdstan-parsing

There were some irregularities/inconsitencies with the headers that made converting them into a JSON not super natural. The built-in support in cmdstan is much better. But I still wish it had worked out, because then cmdstanpy would have been free of this nonsense for years by now!

That little experiment was using pydantic, which would probably be a natural choice if we were using the new file as well.

I'm going to add checking some of this out to my to-do list. Since things are more stable on the cmdstan side, it might be worth a re-visit.

Copy link
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to go. Thanks again!

@WardBrian WardBrian merged commit b3c6c47 into stan-dev:develop Aug 21, 2025
16 checks passed
@WardBrian WardBrian mentioned this pull request Aug 21, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants