Skip to content

Manage 429 Responses from Wikipedia #532

@edm00se

Description

@edm00se

Wikipedia is very understandably taking measure against scraping bots as the age of the llm slop scraping is upon us. I should probably look at:

  1. adhering to their terms
  2. caching fetched images (I thought I was doing this, but we need to be part of the good guys here)
    Finn I am with the resistence
  3. limiting runs against live urls (the previous point should hopefully help with this)
  4. confirm whether I should mitigate against hammering other common sources, bgg is probably the next biggest source

Issue as discovered via build/lint:

Image

Example:

159:1-159:163   warning Unexpected dead URL `[https://upload.wikimedia.org/wikipedia/en/thumb/9/92/Ticket_to_Ride_Board_Game_Box_EN.jpg/220px-Ticket_to_Ride_Board_Game_Box_EN.jpg`](https://upload.wikimedia.org/wikipedia/en/thumb/9/92/Ticket_to_Ride_Board_Game_Box_EN.jpg/220px-Ticket_to_Ride_Board_Game_Box_EN.jpg%60), expected live URL                                                                                                                                                                                 no-dead-urls remark-lint
  [cause]:
                error   Unexpected not ok response `429` (`Use thumbnail steps listed on https://w.wiki/GHai. Please contact noc@wikimedia.org for further information (a765913)`) on `[https://upload.wikimedia.org/wikipedia/en/thumb/9/92/Ticket_to_Ride_Board_Game_Box_EN.jpg/220px-Ticket_to_Ride_Board_Game_Box_EN.jpg`](https://upload.wikimedia.org/wikipedia/en/thumb/9/92/Ticket_to_Ride_Board_Game_Box_EN.jpg/220px-Ticket_to_Ride_Board_Game_Box_EN.jpg%60)                                                          dead         dead-or-alive

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions