Skip to content

Conversation

@qqmyers
Copy link
Member

@qqmyers qqmyers commented May 15, 2025

What this PR does / why we need it: This PR addresses two problems with full-text indexing:

  • Actively embargoed files which are not also restricted (or otherwise non-public) were still being full-text indexed
  • Files in Globus and not 'accessible to Dataverse' were causing an exception in the log when full-text indexed due to a problem with the isDataverseAccessible method.

Which issue(s) this PR closes:

  • Closes #

Special notes for your reviewer:
This PR is only a few lines, but is built on #11374 which makes many changes in this part of the code. Nominally #11374 will soon be merged, so checking the PR after that makes more sense.

The original design of the static isDataverseAccessible(String driverId) method caused the getInputStream method for Globus to return an exception rather than a null stream. However, all other possible failures return a null. In this PR, I changed it to return a null. I also added a new isDataverseAccessible() method to StorageIO that defaults to true. The Globus store overrides this to call the static method which looks up the relevant property to decide. This change is efficient for non-Globus stores and a convenience for Globus ones (in the full text indexing, we have a StorageIO class but don't have the driverId to call the static method.

Suggestions on how to test this: Add an embargoed file (not restricted), publish, check whether it's content is visible to search. To test the Globus part, reindex a dataset with Globus/NESE files, with full-text indexing on, verify that there's no null exception in the logs.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: Not sure - #11374 makes broad indexing changes - it's release note probably covers this.

Additional documentation:

@qqmyers qqmyers added the Size: 3 A percentage of a sprint. 2.1 hours. label May 15, 2025
@qqmyers qqmyers moved this to Ready for Triage in IQSS Dataverse Project May 15, 2025
@qqmyers qqmyers added this to the 6.7 milestone May 15, 2025
@scolapasta scolapasta moved this from Ready for Triage to Ready for Review ⏩ in IQSS Dataverse Project May 20, 2025
@cmbz cmbz added FY25 Sprint 23 FY25 Sprint 23 (2025-05-07 - 2025-05-21) FY25 Sprint 24 FY25 Sprint 24 (2025-05-21 - 2025-06-04) labels May 20, 2025
@coveralls
Copy link

Coverage Status

coverage: 23.114%. remained the same
when pulling 5e6d1a7 on GlobalDataverseCommunityConsortium:indexing_fixes
into 5db10ea on IQSS:develop.

@sekmiller sekmiller moved this from Ready for Review ⏩ to In Review 🔎 in IQSS Dataverse Project Jun 3, 2025
@sekmiller sekmiller self-assigned this Jun 3, 2025
@github-project-automation github-project-automation bot moved this from In Review 🔎 to Ready for QA ⏩ in IQSS Dataverse Project Jun 3, 2025
@sekmiller sekmiller removed their assignment Jun 3, 2025
@ofahimIQSS ofahimIQSS self-assigned this Jun 3, 2025
@ofahimIQSS ofahimIQSS moved this from Ready for QA ⏩ to QA ✅ in IQSS Dataverse Project Jun 3, 2025
@ofahimIQSS
Copy link
Contributor

  1. I came across some issues while testing globus in internal. When I try to upload files via globus to my dataset, I get a timeout error and the files aren't transferred over.
    https://github.com/user-attachments/assets/2eeaf1f2-1dc6-436c-8fea-48f58d346bb8

  2. For this dataset: https://dataverse-internal.iq.harvard.edu/dataset.xhtml?persistentId=doi%3A10.70122%2FFK2%2FQZQPQE&version=DRAFT --- I am unable to publish this dataset for some reason. I hit publish then refresh after a moment - dataset still showing draft status.
    https://github.com/user-attachments/assets/fa8c9772-4ea4-4d8a-bfcb-2b0d3e828b82

Adding @landreev

@qqmyers
Copy link
Member Author

qqmyers commented Jun 3, 2025

  1. is easy I think - the cloud where NEESTape is is down for maintenance through Wed. I would have expected Globus to just report not being done and the Globus lock staying on the dataset until the endpoint came back up on Wed. evening.

  2. appears to be due to some files not having an empty string separator in the dvobject table (including for this dataset). As far as I can tell, there's a set from <~Feb. 2025. I'm not sure why that might be - possibly a bug or config issue (I think we used to have a FAKE DOI provider using the valid DataCite test authority/shoulder and I switched it this spring to use the Dataset test account when we had config issues. )? The relevant code is in

    if (dvObject.getSeparator() == null) {
    dvObject.setSeparator(getSeparator());
    } else {
    if (!dvObject.getSeparator().equals(getSeparator())) {
    logger.warning("The separator of the DvObject (" + dvObject.getSeparator()
    + ") does not match the configured separator (" + getSeparator() + ")");
    throw new IllegalArgumentException("The separator of the DvObject (" + dvObject.getSeparator()
    + ") doesn't match that of the provider, id: " + getId());
    }
    where you can see that an empty but non-null separator results in an error for a DOI where the separator for the provider is hardcoded to '/' - I see the warning shown in the log. If we think this is in any production db, we probably can/should have an issue to fix it.

In both cases, it should not be relevant to the PR and I think testing can either use new datasets (for the non-globus, embargo issue), or reindexing of earlier Globus datasets (i.e. by using the index dataset api) (to check for Globus/full text warnings).

@qqmyers
Copy link
Member Author

qqmyers commented Jun 3, 2025

I created a bug report - #11546 with more analysis of issue 2)

@cmbz cmbz added the FY25 Sprint 25 FY25 Sprint 25 (2025-06-04 - 2025-06-18) label Jun 4, 2025
@ofahimIQSS
Copy link
Contributor

Merging this - was able to finally test with globus up and running.

@ofahimIQSS ofahimIQSS merged commit 0f83d78 into IQSS:develop Jun 16, 2025
16 checks passed
@github-project-automation github-project-automation bot moved this from QA ✅ to Merged 🚀 in IQSS Dataverse Project Jun 16, 2025
@ofahimIQSS ofahimIQSS removed their assignment Jun 16, 2025
@pdurbin pdurbin moved this from Merged 🚀 to Done 🧹 in IQSS Dataverse Project Jun 17, 2025
@cmbz cmbz added the FY26 Sprint 4 FY26 Sprint 4 (2025-08-13 - 2025-08-27) label Aug 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

FY25 Sprint 23 FY25 Sprint 23 (2025-05-07 - 2025-05-21) FY25 Sprint 24 FY25 Sprint 24 (2025-05-21 - 2025-06-04) FY25 Sprint 25 FY25 Sprint 25 (2025-06-04 - 2025-06-18) FY26 Sprint 4 FY26 Sprint 4 (2025-08-13 - 2025-08-27) Size: 3 A percentage of a sprint. 2.1 hours.

Projects

Status: Done 🧹

Development

Successfully merging this pull request may close these issues.

5 participants