Skip to content

Conversation

@vera
Copy link
Contributor

@vera vera commented Jun 5, 2025

What this PR does / why we need it:

This PR adds a datasetCount for each collection to the search index. The count includes all datasets (published, linked or harvested).

This allows users to filter collections using the datasetCount (e.g., datasetCount:[1000 TO *]). Also, the value is returned in Dataverse search results via the Search API.

As recommended by @pdurbin I've used the fileCount feature PRs as inspiration (#6623 + #10598).

Which issue(s) this PR closes:

Special notes for your reviewer:

Regarding the points commented by @cmbz in the issue (#10190 (comment)):

  • Some subtlety here: what gets counted? (e.g., linked datasets, harvested datasets)

For simplicity, I've decided to count all datasets that would be shown when visiting the collection in the UI. This means that the count includes published, linked and harvested datasets in the collection or in any subcollections.

If there is interest in further subdividing the dataset count, more subcounts could be added in the future (e.g. publishedDatasetCount, linkedDatasetCount, harvestedDatasetCount).

  • Considerations: Does this add overhead to reindexing process?

I've had to add new dataverse indexing calls in the following locations:

  • After a dataset is published, the owning dataverse and any linking dataverses are reindexed. (added to this PR after feat: allow linking unpublished datasets to collections #11491)
  • After dataset destruction, previously only the owning dataverse was reindexed. Now, any linking dataverse is also reindexed.
  • After linking a dataset to a dataverse, the dataverse is reindexed.
  • After a harvesting of datasets is complete, the target dataverse is reindexed.
  • After a harvest client is deleted, the target dataverse is reindexed.
  • In general, if the reindexed dataverse is not the root dataverse, all of its parents up until the root are also reindexed.

I believe this overhead is not too large, but let me know what you think.

Suggestions on how to test this:

mvn test -Dtest="SearchIT#testDataverseDatasetCounts"

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

/

Is there a release notes update needed for this change?:

I wrote a short release note.

Additional documentation:

/

@coveralls
Copy link

coveralls commented Jun 5, 2025

Coverage Status

coverage: 23.209% (-0.02%) from 23.229%
when pulling 88cff95 on vera:feat/dataset-count
into fdcdba2 on IQSS:develop.


if (dataverse.isReleased()) {
// Get all datasets published in this dataverse, including harvested datasets
int numberOfPublishedDatasets = dataverseService.findAllDataverseDatasetChildren(dataverse.getId(), true, true).size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth testing this versus a custom query at scale (or, alternatively, to just test this versus the current indexing on a large database). I think this should just result in a proxy list where calling size() won't try to instantiate a real list of dataset objects, but I'm not sure. If it does do extra work compared to just a select count(*) query for the right datasets, it could slow or increase memory requirements significantly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been testing this with our test system which includes a larger dataverse (around 30k datasets) and I think that you are correct and that I am seeing some significant slowing of API responses which require reindexing of that large dataverse. I will investigate this further and update this PR as soon as I am able.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just pushed a commit using a NamedQuery to calculate the datasetCount. With our 30k dataset dataverse, reindexing the dataverse is now speedy again.

@pdurbin pdurbin moved this to Ready for Triage in IQSS Dataverse Project Jun 5, 2025
@ofahimIQSS ofahimIQSS added the Size: 10 A percentage of a sprint. 7 hours. label Jun 17, 2025
@ofahimIQSS
Copy link
Contributor

to be discussed during next weeks triage.

@ofahimIQSS ofahimIQSS moved this from Ready for Triage to Ready for Review ⏩ in IQSS Dataverse Project Jun 24, 2025
@cmbz cmbz added the FY25 Sprint 26 FY25 Sprint 26 (2025-06-18 - 2025-07-02) label Jun 24, 2025
@cmbz cmbz added the FY26 Sprint 1 FY26 Sprint 1 (2025-07-02 - 2025-07-16) label Jul 2, 2025
@stevenwinship
Copy link
Contributor

Please resolve the conflicts.

@stevenwinship stevenwinship moved this from Ready for Review ⏩ to In Review 🔎 in IQSS Dataverse Project Jul 16, 2025
@cmbz cmbz added the FY26 Sprint 2 FY26 Sprint 2 (2025-07-16 - 2025-07-30) label Jul 17, 2025
@stevenwinship stevenwinship added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Jul 23, 2025
@stevenwinship stevenwinship removed their assignment Jul 23, 2025
@stevenwinship stevenwinship moved this from In Review 🔎 to Ready for Review ⏩ in IQSS Dataverse Project Jul 23, 2025
@stevenwinship stevenwinship moved this from Ready for Review ⏩ to In Review 🔎 in IQSS Dataverse Project Jul 23, 2025
@stevenwinship stevenwinship self-assigned this Jul 23, 2025
@vera
Copy link
Contributor Author

vera commented Jul 30, 2025

@stevenwinship the conflicts are now resolved.

@cmbz cmbz added the FY26 Sprint 3 (2025-07-30 - 2025-08-13) label Jul 31, 2025
@stevenwinship stevenwinship removed the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Aug 6, 2025
@github-project-automation github-project-automation bot moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project Aug 6, 2025
@ofahimIQSS ofahimIQSS self-assigned this Aug 6, 2025
@ofahimIQSS ofahimIQSS moved this from Ready for QA ⏩ to QA ✅ in IQSS Dataverse Project Aug 6, 2025
bumped version number
Copy link
Contributor

@ofahimIQSS ofahimIQSS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved version bump, tests are passing - merging.

@github-project-automation github-project-automation bot moved this from QA ✅ to Ready for QA ⏩ in IQSS Dataverse Project Aug 7, 2025
@ofahimIQSS ofahimIQSS merged commit 3433877 into IQSS:develop Aug 7, 2025
16 of 17 checks passed
@github-project-automation github-project-automation bot moved this from Ready for QA ⏩ to Merged 🚀 in IQSS Dataverse Project Aug 7, 2025
@ofahimIQSS ofahimIQSS removed their assignment Aug 7, 2025
@pdurbin pdurbin added this to the 6.8 milestone Aug 7, 2025
@scolapasta scolapasta moved this from Merged 🚀 to Done 🧹 in IQSS Dataverse Project Aug 11, 2025
@cmbz cmbz added the FY26 Sprint 4 FY26 Sprint 4 (2025-08-13 - 2025-08-27) label Aug 16, 2025
@pdurbin pdurbin mentioned this pull request Sep 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

FY25 Sprint 26 FY25 Sprint 26 (2025-06-18 - 2025-07-02) FY26 Sprint 1 FY26 Sprint 1 (2025-07-02 - 2025-07-16) FY26 Sprint 2 FY26 Sprint 2 (2025-07-16 - 2025-07-30) FY26 Sprint 3 (2025-07-30 - 2025-08-13) FY26 Sprint 4 FY26 Sprint 4 (2025-08-13 - 2025-08-27) Size: 10 A percentage of a sprint. 7 hours.

Projects

Status: Done 🧹

Development

Successfully merging this pull request may close these issues.

Feature Request/Idea: Filter dataverses by number of datasets

7 participants