-
Notifications
You must be signed in to change notification settings - Fork 531
feat: datasetCount for dataverses #11555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| if (dataverse.isReleased()) { | ||
| // Get all datasets published in this dataverse, including harvested datasets | ||
| int numberOfPublishedDatasets = dataverseService.findAllDataverseDatasetChildren(dataverse.getId(), true, true).size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth testing this versus a custom query at scale (or, alternatively, to just test this versus the current indexing on a large database). I think this should just result in a proxy list where calling size() won't try to instantiate a real list of dataset objects, but I'm not sure. If it does do extra work compared to just a select count(*) query for the right datasets, it could slow or increase memory requirements significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been testing this with our test system which includes a larger dataverse (around 30k datasets) and I think that you are correct and that I am seeing some significant slowing of API responses which require reindexing of that large dataverse. I will investigate this further and update this PR as soon as I am able.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just pushed a commit using a NamedQuery to calculate the datasetCount. With our 30k dataset dataverse, reindexing the dataverse is now speedy again.
|
to be discussed during next weeks triage. |
|
Please resolve the conflicts. |
|
@stevenwinship the conflicts are now resolved. |
bumped version number
ofahimIQSS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved version bump, tests are passing - merging.
What this PR does / why we need it:
This PR adds a
datasetCountfor each collection to the search index. The count includes all datasets (published, linked or harvested).This allows users to filter collections using the datasetCount (e.g.,
datasetCount:[1000 TO *]). Also, the value is returned in Dataverse search results via the Search API.As recommended by @pdurbin I've used the
fileCountfeature PRs as inspiration (#6623 + #10598).Which issue(s) this PR closes:
Special notes for your reviewer:
Regarding the points commented by @cmbz in the issue (#10190 (comment)):
For simplicity, I've decided to count all datasets that would be shown when visiting the collection in the UI. This means that the count includes published, linked and harvested datasets in the collection or in any subcollections.
If there is interest in further subdividing the dataset count, more subcounts could be added in the future (e.g.
publishedDatasetCount,linkedDatasetCount,harvestedDatasetCount).I've had to add new dataverse indexing calls in the following locations:
I believe this overhead is not too large, but let me know what you think.
Suggestions on how to test this:
mvn test -Dtest="SearchIT#testDataverseDatasetCounts"Does this PR introduce a user interface change? If mockups are available, please link/include them here:
/
Is there a release notes update needed for this change?:
I wrote a short release note.
Additional documentation:
/