-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Issue Name
Ideas for optimizing search rankings
Issue Description
The inclusion of the metadata compatibility score provided some minor improvements in the search rankings, but there are plenty of other ways to further improve. Some ideas include:
-
Down-weighting results where the descriptions don't meet a minimum length requirement. Or having some sort of scaled description weight. This is definitely something we could test out faster with more resources:
- Penalizing records where descriptive information is too short to be useful. For example a dataset where the name and description are both 'Influenza A' doesn't help the user evaluate whether or not it'll be relevant to their work, but it may show up in a search for 'Influenza A'
- Penalizing records where the 'name' and 'description' values are the same (this is especially true when the combined character length of the name and description strings is low)
- Figshare has the highest ratio of the GREIs for having the exact same name and description
-
Pulling
distribution.contentSizewhen adistribution.contentUrlis available. A 5kb file may not be as useful as a larger file.- Only 6 out of 25+ repos have
distribution.contentSizeinformation available for parsing. However, if the record has adistribution.contentUrl, we could potentially grep the contentSize information curl -sI https://zenodo.org/records/13621947/files/GenCoNet_Neo4j_2018_07_10.tgz?download=1 | grep -i Content-Length content-length: 427403- Some issues with applying weights based on distribution.contentSize that should be considered carefully:
- For repos where the data is expected to be locked (human or clinical data: ImmPort), it's unclear if we can get the contentSize via grep and we should take care NOT to punish repos for having '0' contentSize just because their data is required to be more protected
- For Zenodo, the contentUrl is for an overall zip file that could include anything from actual data to protocol pdfs, so it's not super meaningful
- For DASH, the contentUrl is primarily for related documents since it's human data and that contentUrl cannot be accessed without registration, size also not super meaningful in this case
- For Figshare -- To be updated -- Unclear what the actual coverage would be. Figshare contentUrl's are all missing right now because something went wrong with the logic
- Maybe some sort of combination of
encodingFormatandcontentSize? Penalize it if it's .pdf. Most people would take a 5 kb .csv table any day over the same table in a 400 kb .pdf
- Only 6 out of 25+ repos have
Related WBS Task
https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/18
Issue discussion
This issue has not yet been between Scripps and NIAID ODSET
Metadata
Metadata
Assignees
Labels
No labels