Skip to content

Fetch multiple source documents sequentially to prevent bot detection#1176

Merged
Ndpnt merged 3 commits intomainfrom
seq-request-in-combine
Jul 21, 2025
Merged

Fetch multiple source documents sequentially to prevent bot detection#1176
Ndpnt merged 3 commits intomainfrom
seq-request-in-combine

Conversation

@Ndpnt
Copy link
Contributor

@Ndpnt Ndpnt commented Jul 18, 2025

Tested with the Platform Governance Archive collection and it should fix OpenTermsArchive/pga-declarations#296, OpenTermsArchive/pga-declarations#260 and OpenTermsArchive/pga-declarations#205

@Ndpnt Ndpnt requested a review from clementbiron July 18, 2025 14:11
Copy link
Member

@MattiSG MattiSG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, that's also what came to my mind: in large combines, there are just too many requests to do them concurrently.
I believe we could optimize this much further by:

  1. Doing domain grouping (this only works for combines, if we track many terms for the same service won't we have the same issue across terms?)
  2. Using a map-reduce rather than a for-await
  3. Control the number of concurrent calls to an arbitrary amount (3? 5?) rather than going down to 1

But that will already likely be a great improvement as is, thanks!

CHANGELOG.md Outdated

## Unreleased [minor]

> Development of this release was supported by the [French Ministry for Foreign Affairs](https://www.diplomatie.gouv.fr/fr/politique-etrangere-de-la-france/diplomatie-numerique/) through its ministerial [State Startups incubator](https://beta.gouv.fr/startups/open-terms-archive.html) under the aegis of the Ambassador for Digital Affairs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that be accounted towards ZEMKI?

@Ndpnt Ndpnt merged commit 9bd6183 into main Jul 21, 2025
11 checks passed
@Ndpnt Ndpnt deleted the seq-request-in-combine branch July 21, 2025 09:01
@MattiSG MattiSG mentioned this pull request Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TwitchCommunity Guidelines ‧ not tracked anymore

2 participants