-
-
Notifications
You must be signed in to change notification settings - Fork 79
Additional Stan CSV IO updates and refactoring #806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WardBrian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty heroic but I also think it looks good at first glance.
I have a few starter questions to prompt some discussion and possibly a few small changes before diving all the way in:
|
Appreciate the initial glance. This turned out to be more work than I expected (it really gets into the weeds with all the csv validation logic that ultimately happens within the Now that I've been through much (maybe all?) of the code that interacts with the Stan csvs, it seems to me that the overall organization and ergonomics of working with this I/O would really benefit from some sort of "Stan output schema" that could be validated. Whether that's a schema that all the inference methods adhere to or method-specific schemas, something where, at initial load, full validation occurs and a structured, typed-representation is available for the rest of the library to use would be very valuable. (This seems like the kind of thing that would naturally go into the stanio package?) In the code currently, validation is pretty inconsistent. The CmdStanMCMC class performs a significant amount of validation, but other inference methods are a pretty light-touch. Obviously, this corresponds with importance, but it'd be nice to close that gap. Also, working with the parsed outputs ends up being awkward in practice. We have a lot of key info being passed around in dicts that are mostly populated from parsing the config block in the Stan CSV files (but also adding additional fields like the headers (and parsed headers)). The result of this is having to do things like check for field existence and fight with the type checker (and mostly just ignore the warnings). I think this PR is an improvement in clarity, but it still struggles with the same fundamental challenges that working with these CSVs have. In #785, you mention newer JSON outputs and possibly future binary outputs for draws; I think changes like that would go a long way to cleaning this up -- something like that with a defined schema would be nice. |
|
Yeah, it would be fairly natural to define a schema for the output_config.json and use that going forward. With the current CSV files, I actually lean toward relatively little validation, which you've observed with the other algorithms added more recently. Partially this is because they only have one output file. With MCMC it is fairly important to check that the different files all came from the same run, but this is just not a category of error that exists with e.g. optimization |
That would be great, because lack of validation logic is one of the bigger obstacles to refactoring. It'd also be nice to put it in standard I/O formats rather than CSV (not really a standard) plus comments (even worse). |
|
Implemented some changes in response to the initial comments. @WardBrian when you say:
Is that implying that there is an alternative cmdstan output where the config info (and possibly other non-draw output) is stored in json format? I recall looking for something like this recently, but didn't see anything. Either way defining some kind of schema for json (or some other format) output would be great. Even if it isn't used as a standard output for cmdstan for a while, with a defined schema we could develop against that now and clean up some key bits of the library. And, presuming the Stan CSV format doesn't really change radically, we could just implement a converter that takes a given Stan CSV to structured json and handle backwards compatibility for as long as those output formats are in common use. |
Yes, Because it is its own file, you do end up with the unfortunate "what happens if I mix a file from run A with a file from run B" problem, but I think that's just life. We could consider introducing the requirement that a output directory be blank before a run, which would avoid almost all 'natural' ways that problem could arise.
Believe it or not, I tried this same thing nearly 4 years ago: https://github.com/WardBrian/experimental-cmdstan-parsing There were some irregularities/inconsitencies with the headers that made converting them into a JSON not super natural. The built-in support in cmdstan is much better. But I still wish it had worked out, because then cmdstanpy would have been free of this nonsense for years by now! That little experiment was using pydantic, which would probably be a natural choice if we were using the new file as well. |
WardBrian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how this is looking now! A few comments/questions:
|
|
||
|
|
||
| def parse_stan_csv_comments_and_draws( | ||
| def parse_comments_header_and_draws( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see an immediate way to resolve this (I think it's too subtle a pattern to encode as a generator) but I just wanted to note that I think all of the usages of this function that don't need the draws (e.g. the calls in the InferenceMetadata factory) also don't need any of the comments after the header, so they might be doing an unnecessary pass over the rest of the CSV.
Short of splitting into two functions I don't see a nice pattern to avoid this, but it's probably not too expensive to be worth worrying about anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've though about this.
I think the current structure of the csv files somewhat forces a tradeoff between passes over the files and clarity. In the current code, when going through the process of fitting a model with sample and loading the draws, each csv is read 3 times? Twice in the validation step after fitting and once when reading the draws. Additionally within that, once loaded into memory the comments are scanned through a few times. Draws, I think 3 times? Once to validate that all draw rows have the same number of columns, once to count divergences + max treedepth warnings, and once when actually loading into numpy.
All that is to say, we do have a lot of loading things into memory that we don't need, but in practice it doesn't seem to be a big deal. In all the various testing I did, scanning through the files is very efficient. In particular scanning through these files as lists of bytes is very lightweight compared to relatively expensive process of loading draws in as a numpy array.
I think it's probably more prudent to design against cleaner future output formats than try to really cleverly redesign this parsing.
Ahh, I definitely missed this. Thanks for pointing it out.
I'm going to add checking some of this out to my to-do list. Since things are more stable on the cmdstan side, it might be worth a re-visit. |
WardBrian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to go. Thanks again!
Submission Checklist
Summary
This is the second PR addressing #785 coming after #799. The main proposed changes in this PR are:
scan_CSV parsing functionsCmdStanMCMCinitializationThere's quite a bit going on in these changes. I think there are still a number of areas in the codebase that could be cleaned up, but I want to avoid these PRs getting too large.
Copyright and Licensing
Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): myself
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: