Skip to content

[INFRASTRUCTURE] Ideas for optimizing search rankings #24

@gtsueng

Description

@gtsueng

Issue Name

Ideas for optimizing search rankings

Issue Description

The inclusion of the metadata compatibility score provided some minor improvements in the search rankings, but there are plenty of other ways to further improve. Some ideas include:

  • Down-weighting results where the descriptions don't meet a minimum length requirement. Or having some sort of scaled description weight. This is definitely something we could test out faster with more resources:

    • Penalizing records where descriptive information is too short to be useful. For example a dataset where the name and description are both 'Influenza A' doesn't help the user evaluate whether or not it'll be relevant to their work, but it may show up in a search for 'Influenza A'
    • Penalizing records where the 'name' and 'description' values are the same (this is especially true when the combined character length of the name and description strings is low)
      Image
      • Figshare has the highest ratio of the GREIs for having the exact same name and description
  • Pulling distribution.contentSize when a distribution.contentUrl is available. A 5kb file may not be as useful as a larger file.

    • Only 6 out of 25+ repos have distribution.contentSize information available for parsing. However, if the record has a distribution.contentUrl, we could potentially grep the contentSize information
    • curl -sI https://zenodo.org/records/13621947/files/GenCoNet_Neo4j_2018_07_10.tgz?download=1 | grep -i Content-Length content-length: 427403
    • Some issues with applying weights based on distribution.contentSize that should be considered carefully:
      • For repos where the data is expected to be locked (human or clinical data: ImmPort), it's unclear if we can get the contentSize via grep and we should take care NOT to punish repos for having '0' contentSize just because their data is required to be more protected
      • For Zenodo, the contentUrl is for an overall zip file that could include anything from actual data to protocol pdfs, so it's not super meaningful
      • For DASH, the contentUrl is primarily for related documents since it's human data and that contentUrl cannot be accessed without registration, size also not super meaningful in this case
      • For Figshare -- To be updated -- Unclear what the actual coverage would be. Figshare contentUrl's are all missing right now because something went wrong with the logic
    • Maybe some sort of combination of encodingFormat and contentSize? Penalize it if it's .pdf. Most people would take a 5 kb .csv table any day over the same table in a 400 kb .pdf

Related WBS Task

https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/18

Issue discussion

This issue has not yet been between Scripps and NIAID ODSET

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions