[INFRASTRUCTURE] Ideas for optimizing search rankings

### Issue Name
Ideas for optimizing search rankings

### Issue Description
The inclusion of the metadata compatibility score provided some minor improvements in the search rankings, but there are plenty of other ways to further improve. Some ideas include:
* Down-weighting results where the descriptions don't meet a minimum length requirement. Or having some sort of scaled description weight. This is definitely something we could test out faster with more resources:
  * Penalizing records where descriptive information is too short to be useful. For example a dataset where the name and description are both 'Influenza A' doesn't help the user evaluate whether or not it'll be relevant to their work, but it may show up in a search for 'Influenza A'
  * Penalizing records where the 'name' and 'description' values are the same (this is especially true when the combined character length of the name and description strings is low)
  ![Image](https://github.com/user-attachments/assets/69aa23c8-771c-4071-8067-00609a5bea6e)
     * Figshare has the highest ratio of the GREIs for having the exact same name and description
 
* Pulling `distribution.contentSize` when a `distribution.contentUrl` is available. A 5kb file may not be as useful as a larger file.
  *  Only 6 out of 25+ repos have `distribution.contentSize` information available for parsing. However, if the record has a `distribution.contentUrl`, we could potentially grep the contentSize information
  * `curl -sI https://zenodo.org/records/13621947/files/GenCoNet_Neo4j_2018_07_10.tgz?download=1 | grep -i Content-Length
content-length: 427403`
  * Some issues with applying weights based on distribution.contentSize that should be considered carefully:
    * For repos where the data is expected to be locked (human or clinical data: ImmPort), it's unclear if we can get the contentSize via grep and we should take care NOT to punish repos for having '0' contentSize just because their data is *required* to be more protected
    * For Zenodo, the contentUrl is for an overall zip file that could include anything from actual data to protocol pdfs, so it's not super meaningful
    * For DASH, the contentUrl is primarily for related documents since it's human data and that contentUrl cannot be accessed without registration, size also not super meaningful in this case
    * For Figshare -- To be updated -- Unclear what the actual coverage would be. Figshare contentUrl's are all missing right now because something went wrong with the logic
  * Maybe some sort of combination of `encodingFormat` and `contentSize`? Penalize it if it's .pdf. Most people would take a 5 kb .csv table any day over the same table in a 400 kb .pdf

### Related WBS Task
https://github.com/NIAID-Data-Ecosystem/nde-roadmap/issues/18

### Issue discussion
This issue has not yet been between Scripps and NIAID ODSET

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[INFRASTRUCTURE] Ideas for optimizing search rankings #24

Issue Name

Issue Description

Related WBS Task

Issue discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[INFRASTRUCTURE] Ideas for optimizing search rankings #24

Description

Issue Name

Issue Description

Related WBS Task

Issue discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions