Skip to content

Getting list of all CPAN package names is broken #1961

@andrew

Description

@andrew

Paging through the CPAN releases API no longer works for results greater than 10,000

Code location: https://github.com/librariesio/libraries.io/blob/master/app/models/package_manager/cpan.rb#L17

Example url:

https://fastapi.metacpan.org/v1/release/_search?fields=distribution&from=10000&q=status%3Alatest&size=5000&sort=date%3Adesc

Error:

{
"message": "[Request] ** [http://127.0.0.1:9200]-[500] {\"error\":{\"root_cause\":[{\"type\":\"query_phase_execution_exception\",\"reason\":\"Result window is too large, from + size must be less than or equal to: [10000] but was [15000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.\"}],\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":[{\"shard\":0,\"index\":\"cpan_v1_01\",\"node\":\"euEoqisPSk68CnedNAzoZA\",\"reason\":{\"type\":\"query_phase_execution_exception\",\"reason\":\"Result window is too large, from + size must be less than or equal to: [10000] but was [15000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter.\"}}]},\"status\":500}, called from sub Search::Elasticsearch::Role::Client::Direct::__ANON__ at /home/metacpan/metacpan-api/lib/MetaCPAN/Server/Controller.pm line 125. With vars: {'request' => {'method' => 'GET','ignore' => [],'path' => '/cpan/release/_search','serialize' => 'std','qs' => {'q' => 'status:latest','fields' => 'distribution','sort' => 'date:desc','size' => 5000,'from' => 10000},'body' => undef},'status_code' => 500}\n"
}

The docs suggest using the scroll api: https://github.com/metacpan/metacpan-api/blob/master/docs/API-docs.md#being-polite but the links to the docs are dead.

More recent scroll api docs here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html but I couldn't seem to get it to accept scroll_id as a parameter:

{
"message": "[Param] ** Unknown param (scroll_id) in (search) request. See docs at: http://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html, called from sub Search::Elasticsearch::Role::Client::Direct::__ANON__ at /home/metacpan/metacpan-api/lib/MetaCPAN/Server/Controller.pm line 125."
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions