feat: datasetCount for dataverses #11555

vera · 2025-06-05T12:09:48Z

What this PR does / why we need it:

This PR adds a datasetCount for each collection to the search index. The count includes all datasets (published, linked or harvested).

This allows users to filter collections using the datasetCount (e.g., datasetCount:[1000 TO *]). Also, the value is returned in Dataverse search results via the Search API.

As recommended by @pdurbin I've used the fileCount feature PRs as inspiration (#6623 + #10598).

Which issue(s) this PR closes:

Closes Feature Request/Idea: Filter dataverses by number of datasets #10190

Special notes for your reviewer:

Regarding the points commented by @cmbz in the issue (#10190 (comment)):

Some subtlety here: what gets counted? (e.g., linked datasets, harvested datasets)

For simplicity, I've decided to count all datasets that would be shown when visiting the collection in the UI. This means that the count includes published, linked and harvested datasets in the collection or in any subcollections.

If there is interest in further subdividing the dataset count, more subcounts could be added in the future (e.g. publishedDatasetCount, linkedDatasetCount, harvestedDatasetCount).

Considerations: Does this add overhead to reindexing process?

I've had to add new dataverse indexing calls in the following locations:

After a dataset is published, the owning dataverse and any linking dataverses are reindexed. (added to this PR after feat: allow linking unpublished datasets to collections #11491)
After dataset destruction, previously only the owning dataverse was reindexed. Now, any linking dataverse is also reindexed.
After linking a dataset to a dataverse, the dataverse is reindexed.
After a harvesting of datasets is complete, the target dataverse is reindexed.
After a harvest client is deleted, the target dataverse is reindexed.
In general, if the reindexed dataverse is not the root dataverse, all of its parents up until the root are also reindexed.

I believe this overhead is not too large, but let me know what you think.

Suggestions on how to test this:

mvn test -Dtest="SearchIT#testDataverseDatasetCounts"

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

/

Is there a release notes update needed for this change?:

I wrote a short release note.

Additional documentation:

/

…using search API

coveralls · 2025-06-05T12:23:42Z

coverage: 23.209% (-0.02%) from 23.229%
when pulling 88cff95 on vera:feat/dataset-count
into fdcdba2 on IQSS:develop.

src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java

qqmyers · 2025-06-05T12:47:56Z

src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java

+
+        if (dataverse.isReleased()) {
+            // Get all datasets published in this dataverse, including harvested datasets
+            int numberOfPublishedDatasets = dataverseService.findAllDataverseDatasetChildren(dataverse.getId(), true, true).size();


It might be worth testing this versus a custom query at scale (or, alternatively, to just test this versus the current indexing on a large database). I think this should just result in a proxy list where calling size() won't try to instantiate a real list of dataset objects, but I'm not sure. If it does do extra work compared to just a select count(*) query for the right datasets, it could slow or increase memory requirements significantly.

I have been testing this with our test system which includes a larger dataverse (around 30k datasets) and I think that you are correct and that I am seeing some significant slowing of API responses which require reindexing of that large dataverse. I will investigate this further and update this PR as soon as I am able.

I've just pushed a commit using a NamedQuery to calculate the datasetCount. With our 30k dataset dataverse, reindexing the dataverse is now speedy again.

ofahimIQSS · 2025-06-17T15:15:31Z

to be discussed during next weeks triage.

stevenwinship · 2025-07-15T18:20:25Z

Please resolve the conflicts.

vera · 2025-07-30T08:37:11Z

@stevenwinship the conflicts are now resolved.

bumped version number

ofahimIQSS

approved version bump, tests are passing - merging.

vera added 6 commits June 4, 2025 15:42

feat: index datasetCount of dataverses

fd741c9

test: add test for filtering + retrieving datasetCount of dataverses …

e30d214

…using search API

docs: add release note for datasetCount

dc8d680

feat: include harvested datasets in datasetCount

b5f2580

feat: don't include unlinked or destroyed datasets in datasetCount

909a738

docs: update release note for datasetCount

6d302ba

qqmyers reviewed Jun 5, 2025

View reviewed changes

src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java Outdated Show resolved Hide resolved

qqmyers reviewed Jun 5, 2025

View reviewed changes

fix: improve code + add comments in findIdsByOwnerId

7eee55e

pdurbin added this to IQSS Dataverse Project Jun 5, 2025

pdurbin moved this to Ready for Triage in IQSS Dataverse Project Jun 5, 2025

ofahimIQSS added the Size: 10 A percentage of a sprint. 7 hours. label Jun 17, 2025

vera added 2 commits June 18, 2025 17:03

feat: reindex linking dataverses when dataset is published

d8a28f3

feat: improve performance of calculating datasetCount using named query

fbc23cd

ofahimIQSS moved this from Ready for Triage to Ready for Review ⏩ in IQSS Dataverse Project Jun 24, 2025

cmbz added the FY25 Sprint 26 FY25 Sprint 26 (2025-06-18 - 2025-07-02) label Jun 24, 2025

cmbz added the FY26 Sprint 1 FY26 Sprint 1 (2025-07-02 - 2025-07-16) label Jul 2, 2025

feat: include datasets in sub-dataverses in datasetCount

b29972b

stevenwinship assigned vera and stevenwinship Jul 15, 2025

stevenwinship moved this from Ready for Review ⏩ to In Review 🔎 in IQSS Dataverse Project Jul 16, 2025

cmbz added the FY26 Sprint 2 FY26 Sprint 2 (2025-07-16 - 2025-07-30) label Jul 17, 2025

stevenwinship added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Jul 23, 2025

stevenwinship removed their assignment Jul 23, 2025

stevenwinship moved this from In Review 🔎 to Ready for Review ⏩ in IQSS Dataverse Project Jul 23, 2025

stevenwinship moved this from Ready for Review ⏩ to In Review 🔎 in IQSS Dataverse Project Jul 23, 2025

stevenwinship self-assigned this Jul 23, 2025

Merge branch 'develop' into feat/dataset-count

8b851c8

cmbz added the FY26 Sprint 3 (2025-07-30 - 2025-08-13) label Jul 31, 2025

stevenwinship removed the Status: Needs Input Applied to issues in need of input from someone currently unavailable label Aug 6, 2025

stevenwinship approved these changes Aug 6, 2025

View reviewed changes

github-project-automation bot moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project Aug 6, 2025

stevenwinship unassigned vera and stevenwinship Aug 6, 2025

ofahimIQSS self-assigned this Aug 6, 2025

ofahimIQSS moved this from Ready for QA ⏩ to QA ✅ in IQSS Dataverse Project Aug 6, 2025

Update pom.xml

88cff95

bumped version number

ofahimIQSS approved these changes Aug 7, 2025

View reviewed changes

github-project-automation bot moved this from QA ✅ to Ready for QA ⏩ in IQSS Dataverse Project Aug 7, 2025

ofahimIQSS merged commit 3433877 into IQSS:develop Aug 7, 2025
16 of 17 checks passed

github-project-automation bot moved this from Ready for QA ⏩ to Merged 🚀 in IQSS Dataverse Project Aug 7, 2025

ofahimIQSS removed their assignment Aug 7, 2025

pdurbin added this to the 6.8 milestone Aug 7, 2025

scolapasta moved this from Merged 🚀 to Done 🧹 in IQSS Dataverse Project Aug 11, 2025

cmbz added the FY26 Sprint 4 FY26 Sprint 4 (2025-08-13 - 2025-08-27) label Aug 16, 2025

pdurbin mentioned this pull request Sep 15, 2025

6.8 release notes #11816

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: datasetCount for dataverses #11555

feat: datasetCount for dataverses #11555

Uh oh!

vera commented Jun 5, 2025 •

edited

Loading

Uh oh!

coveralls commented Jun 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

qqmyers Jun 5, 2025

Uh oh!

vera Jun 18, 2025

Uh oh!

vera Jun 24, 2025

Uh oh!

ofahimIQSS commented Jun 17, 2025

Uh oh!

stevenwinship commented Jul 15, 2025

Uh oh!

vera commented Jul 30, 2025

Uh oh!

ofahimIQSS left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

feat: datasetCount for dataverses #11555

feat: datasetCount for dataverses #11555

Uh oh!

Conversation

vera commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

qqmyers Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

vera Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

vera Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

ofahimIQSS commented Jun 17, 2025

Uh oh!

stevenwinship commented Jul 15, 2025

Uh oh!

vera commented Jul 30, 2025

Uh oh!

ofahimIQSS left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

vera commented Jun 5, 2025 •

edited

Loading

coveralls commented Jun 5, 2025 •

edited

Loading