Conversation
| # Since pages are a number of *blocks searched* and not results, a page | ||
| # in the middle of the result set may have nothing in it. The only way | ||
| # to know when to stop iterating is to check how many pages there are. | ||
| page_count = int(self.session.request('GET', CDX_SEARCH_2_URL, params={ | ||
| **query, | ||
| 'showNumPages': 'true' | ||
| }).text) |
There was a problem hiding this comment.
This might not be necessary! I just discovered that sending too high a page value gets a 400 error with the header x-archive-wayback-runtime-error: page must be smaller than numpages, so we can in theory check for that and stop.
That said, that’s a very human-readable message and feels unstable. We should check with folks at the Internet Archive about what approach they’d prefer people use.
|
Update: the way you control output format in the new search is not with |
This adds support for the Internet Archive's new, beta CDX search endpoint at `/web/timemap/cdx`. It deals with pagination much better and is eventually slated to replace the search currently at `/cdx/search/cdx`, but is a little slower and still being tested. This commit is a start, but we still need to do more detailed testing and talk more with the Wayback Machine team about things that are unclear here. I'm also not sure if `filter`, `collapse`, `resolveRevisits`, etc. are actually supported. Fixes #8.
ada8423 to
5093982
Compare
5c41ba6 to
42d5f7d
Compare
|
Some updates:
At this point, there’s a little more I can do (rate limiting, cleanup), but we the main blocker by lack of clarity on bugs/intended behavior from Wayback, which we’ll have to wait to hear back on. |
🚧 Work in Progress! 🚧
This adds support for the Internet Archive's new, beta CDX search endpoint at
/web/timemap/cdx. It deals with pagination much better and is eventually slated to replace the search currently at/cdx/search/cdx, but is a little slower and still being tested. Fixes #8.There are still a bunch of things to be done before merging:
output=jsonworking), etc.filter,collapse,resolveRevisits, etc) and whether there are new ones we can/should use. (Update:resolveRevisitsis badly broken, but the rest are the same as original search and work fine. Checking w/ Wayback folks for more detail.)search_v2is the right name, or if it should be something else (search_beta()?search_next()?)