Skip to content

Create a pacer.DocketReport.from_html_file() classmethod and use it consistently#1785

Open
johnhawkinson wants to merge 3 commits intofreelawproject:mainfrom
johnhawkinson:2026.01.27.from_html
Open

Create a pacer.DocketReport.from_html_file() classmethod and use it consistently#1785
johnhawkinson wants to merge 3 commits intofreelawproject:mainfrom
johnhawkinson:2026.01.27.from_html

Conversation

@johnhawkinson
Copy link
Copy Markdown
Contributor

We have three ways that html files can be parsed by juriscraper, and all three are invoked slightly differently, giving different results, in some cases, parser errors.

For instance:

(juriscraper) jhawk@lrr juriscraper % PYTHONPATH=`pwd`  python3  juriscraper/pacerdocket.py tests/examples/pacer/dockets/district/cand_7.html
Traceback (most recent call last):
  File "/Users/jhawk/src/juriscraper/juriscraper/pacerdocket.py", line 19, in <module>
    data = report.data
           ^^^^^^^^^^^
  File "/Users/jhawk/src/juriscraper/juriscraper/pacer/docket_report.py", line 511, in data
    return super().data
           ^^^^^^^^^^^^
  File "/Users/jhawk/src/juriscraper/juriscraper/pacer/docket_report.py", line 79, in data
    data["docket_entries"] = self.docket_entries
                             ^^^^^^^^^^^^^^^^^^^
  File "/Users/jhawk/src/juriscraper/juriscraper/pacer/docket_report.py", line 1374, in docket_entries
    attachments = self._get_attachments(cells[2])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jhawk/src/juriscraper/juriscraper/pacer/docket_report.py", line 1287, in _get_attachments
    "pacer_doc_id": self._get_pacer_doc_id(row),
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jhawk/src/juriscraper/juriscraper/pacer/docket_report.py", line 1250, in _get_pacer_doc_id
    return f"{self.doc_id_prefix}0{pacer_doc_suffix}"
              ^^^^^^^^^^^^^^^^^^
  File "/Users/jhawk/src/juriscraper/juriscraper/pacer/reports.py", line 65, in doc_id_prefix
    return get_doc_id_prefix_from_court_id(self.court_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jhawk/src/juriscraper/juriscraper/pacer/utils.py", line 433, in get_doc_id_prefix_from_court_id
    return cid_to_prefix_map[court_id]
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^
KeyError: 'psc'

Create a class method, pacer.DocketReport.from_html_file() and consistently use it in all three places: the test runner, pacerdocket.py, and the __main__ portion of DocketReport.py.

Runs the docket parser on an html file.
Use this in the __main__ invokation instead of bespoke way.

This method allows removing redundant code from pacerdocket.py and the
test_DocketParseTest.py test runner.  All three of which did the same
task in slightly different ways, leading to dev confusion.
Instead of rolling our own. Helps with consistency.
This abandons the faked-up PacerSession which was never necessary
anyhow.  The from_html_file() method does fallback to 'cand', not
'psc' as this code used to.
Instead of rolling our own. Helps with consistency.
@johnhawkinson johnhawkinson changed the title Createa a pacer.DocketReport.from_html_file() classmethod and use it consistently Create a pacer.DocketReport.from_html_file() classmethod and use it consistently Jan 29, 2026
Copy link
Copy Markdown
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A couple little things though, please. Thank you!

Comment on lines +505 to +515
dirname, filename = os.path.split(path)
filename_sans_ext = filename.split(".")[0]
court = filename_sans_ext.split("_")[0]
# If filename doesn't begin with a valid court, just use 'cand'
# (N.D. Cal.) Historically the __main__ runner would default
# to 'cand' but the pacerdocket.py runner would default to
# 'psc', but the 'psc' fails some vaidations.
try:
_ = get_doc_id_prefix_from_court_id(court)
except KeyError:
court = "cand"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change, but I think it might be better to just make the court a requirement. Maybe we allow it to be passed as an optional argument if we can't require its use in input?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your intent is not particularly clear to me, and there are a million options.
I chose the simplest, moving the filename-deriving code from the test runner into the class method, and choosing one of the two previous disparate defaults.

If by "requirement" you mean pass it in as a parameter, then you want to put the filename-based code back in the test runner? And force the user who invokes the other two mechanisms to specify it?

So you could no longer run

python pacerdocket.py filename.html

like you've always been able to, you need to run

python pacerdocket.py nysd filename.html

? If that's the proposal, it seems to be more annoying to the user than we were prior to this PR, and not particularly helpful in solving any diagnostic problem.

I'm also not sure what would happen if there were an optional argument. So if you didn't specify it, then it would default to…what?cand? psc? Or it would return no court (or None) in all the places that this is used?

I don't have a full assessment of what they all are, but somewhere in the hairy machinery of "html cleaning", juriscraper rewrites relative URLs to absolute URLs using the court name. You would just leave them as relative URLs then?

All of these choices seem more complicated, but let me know what you want.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it becomes:

def from_html_file(cls, path, court=None):

If the court isn't passed, it tries from the file name. If that fails, it crashes. I think that'd make it so that places that need to be explicit can be and so that the court parameter isn't necessary.

I don't like having a default to psc or cand or whatever in a class method since that could slip into a production usage, so crashing should prevent that.

def from_html_file(cls, path):
"""Run the docket parser on an HTML file.

This is invokved by by the test runner
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is invokved by by the test runner
This is invokved by the test runner

@mlissner
Copy link
Copy Markdown
Member

Also, can you add a note to the release notes, please, so CI passes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants