Skip to content

Fetch additional data resulting from SPN2 capture_outlinks function #23

@overcast07

Description

@overcast07

Ideally, this script should be able to fetch data for outlinks captured using the server-side SPN2 outlinks function.

Any implementation of this would run into a particular challenge: polling the status API endpoint for a large number of outlinks could cause the server to return 429 errors if the rate of requests is too high. The overall rate of requests would have to be controlled in some way, accounting for the additional requests made.

One way to implement this would be to add a separate text file (spn2-outlinks.txt) to which outlink status IDs are added upon completion of the main capture job. A check for this file could be added at some point in the main while loops (the ones starting at lines 579, 609 and 680), and the child processes could be spawned from those loops. Importantly, this approach would allow the child processes to be immediate children of the main process, so they would be counted by jobs -p. The script would probably have to pause new job submissions while the child processes for the outlinks are spawned. A variable could be used to store remaining lines if the child processes for the outlinks are not spawned in one go.

Alternatively, this could be done within each capture() child process immediately after the status API endpoint returns a successful capture and the list of outlinks. However, this would not be visible to the currently implemented check on the number of child processes (i.e. jobs -p), and the rate of requests of all parts of the script would have to be slowed down to account for this (unless the status API endpoint was just checked really infrequently).

We would have to decide whether failed outlink captures should be retried. Presumably, the outlinks of these pages would not be collected, so they would have to be listed separately from the main failed.txt list. An extra variable would have to be passed to the capture function to indicate whether or not to set capture_outlinks=1.

This option would also need to interface appropriately with the -o, -x and -r options.

This idea was previously listed in the "Future plans" in README.md, but I've removed that section since it's basically outdated and no longer relevant.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions