Create a pacer.DocketReport.from_html_file() classmethod and use it consistently#1785
Create a pacer.DocketReport.from_html_file() classmethod and use it consistently#1785johnhawkinson wants to merge 3 commits intofreelawproject:mainfrom
Conversation
d2e4484 to
3da6cee
Compare
Runs the docket parser on an html file. Use this in the __main__ invokation instead of bespoke way. This method allows removing redundant code from pacerdocket.py and the test_DocketParseTest.py test runner. All three of which did the same task in slightly different ways, leading to dev confusion.
Instead of rolling our own. Helps with consistency. This abandons the faked-up PacerSession which was never necessary anyhow. The from_html_file() method does fallback to 'cand', not 'psc' as this code used to.
Instead of rolling our own. Helps with consistency.
3da6cee to
d82d9e7
Compare
mlissner
left a comment
There was a problem hiding this comment.
Looks good. A couple little things though, please. Thank you!
| dirname, filename = os.path.split(path) | ||
| filename_sans_ext = filename.split(".")[0] | ||
| court = filename_sans_ext.split("_")[0] | ||
| # If filename doesn't begin with a valid court, just use 'cand' | ||
| # (N.D. Cal.) Historically the __main__ runner would default | ||
| # to 'cand' but the pacerdocket.py runner would default to | ||
| # 'psc', but the 'psc' fails some vaidations. | ||
| try: | ||
| _ = get_doc_id_prefix_from_court_id(court) | ||
| except KeyError: | ||
| court = "cand" |
There was a problem hiding this comment.
I like the change, but I think it might be better to just make the court a requirement. Maybe we allow it to be passed as an optional argument if we can't require its use in input?
There was a problem hiding this comment.
Your intent is not particularly clear to me, and there are a million options.
I chose the simplest, moving the filename-deriving code from the test runner into the class method, and choosing one of the two previous disparate defaults.
If by "requirement" you mean pass it in as a parameter, then you want to put the filename-based code back in the test runner? And force the user who invokes the other two mechanisms to specify it?
So you could no longer run
python pacerdocket.py filename.html
like you've always been able to, you need to run
python pacerdocket.py nysd filename.html
? If that's the proposal, it seems to be more annoying to the user than we were prior to this PR, and not particularly helpful in solving any diagnostic problem.
I'm also not sure what would happen if there were an optional argument. So if you didn't specify it, then it would default to…what?cand? psc? Or it would return no court (or None) in all the places that this is used?
I don't have a full assessment of what they all are, but somewhere in the hairy machinery of "html cleaning", juriscraper rewrites relative URLs to absolute URLs using the court name. You would just leave them as relative URLs then?
All of these choices seem more complicated, but let me know what you want.
There was a problem hiding this comment.
I was thinking it becomes:
def from_html_file(cls, path, court=None):
If the court isn't passed, it tries from the file name. If that fails, it crashes. I think that'd make it so that places that need to be explicit can be and so that the court parameter isn't necessary.
I don't like having a default to psc or cand or whatever in a class method since that could slip into a production usage, so crashing should prevent that.
| def from_html_file(cls, path): | ||
| """Run the docket parser on an HTML file. | ||
|
|
||
| This is invokved by by the test runner |
There was a problem hiding this comment.
| This is invokved by by the test runner | |
| This is invokved by the test runner |
|
Also, can you add a note to the release notes, please, so CI passes? |
We have three ways that html files can be parsed by juriscraper, and all three are invoked slightly differently, giving different results, in some cases, parser errors.
For instance:
Create a class method,
pacer.DocketReport.from_html_file()and consistently use it in all three places: the test runner,pacerdocket.py, and the__main__portion ofDocketReport.py.