Skip to content

Commit f3395cc

Browse files
authored
Merge pull request #11832 from QualitativeDataRepository/QDR-DCiteScaling
QDR-DataCite Scaling
2 parents 5e47014 + 5f51d92 commit f3395cc

File tree

6 files changed

+135
-13
lines changed

6 files changed

+135
-13
lines changed
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
This release adds functionality to retry calls to DataCite when their server is overloaded or Dataverse has hit their rate limit.
2+
3+
It also introduces an option to only update DataCite metadata after checking to see if the current DataCite information is out of date.
4+
(This adds a request to get information from DataCite before any potential write of new information which will be more efficient when
5+
most DOIs have not changed but will result in an extra call to get info when a DOI has changed.)
6+
7+
Both of these can help when DataCite is being used heavily, e.g. creating and publishing datasets with many datafiles and using file DOIs,
8+
or doing bulk operations that involve DataCite with many datasets.
9+
10+
### New Settings
11+
12+
- dataverse.feature.only-update-datacite-when-needed
13+
14+
The default is false - Dataverse will not check to see if DataCite's information is out of date before sending an update.

doc/sphinx-guides/source/installation/config.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -556,6 +556,8 @@ dataverse.pid.*.datacite.username
556556
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
557557
dataverse.pid.*.datacite.password
558558
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
559+
dataverse.feature.only-update-datacite-when-needed
560+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
559561

560562
PID Providers of type ``datacite`` require four additional parameters that define how the provider connects to DataCite.
561563
DataCite has two APIs that are used in Dataverse:
@@ -571,6 +573,11 @@ for `Fabrica <https://doi.datacite.org/>`_ and their APIs. You need to provide
571573
the same credentials (``username``, ``password``) to Dataverse software to mint and manage DOIs for you.
572574
As noted above, you should use one of the more secure options for setting the password.
573575

576+
The `only-update-datacite-when-needed feature` flag is a global option that causes Dataverse to GET the latest metadata from DataCite
577+
for a DOI and compare it with the current metadata in Dataverse and only sending a following POST request if needed. This potentially
578+
substitutes a read for an unnecessary write at DataCite, but would result in extra reads when all metadata in Dataverse is new.
579+
Setting the flag to "true" is recommended when using DataCite file DOIs.
580+
574581
CrossRef-specific Settings
575582
^^^^^^^^^^^^^^^^^^^^^^^^^^
576583

@@ -3843,6 +3850,9 @@ please find all known feature flags below. Any of these flags can be activated u
38433850
* - role-assignment-history
38443851
- Turns on tracking/display of role assignments and revocations for collections, datasets, and files
38453852
- ``Off``
3853+
* - only-update-datacite-when-needed
3854+
- Only contact DataCite to update a DOI after checking to see if DataCite has outdated information (for efficiency, lighter load on DataCite, especially when using file DOIs).
3855+
- ``Off``
38463856

38473857
**Note:** Feature flags can be set via any `supported MicroProfile Config API source`_, e.g. the environment variable
38483858
``DATAVERSE_FEATURE_XXX`` (e.g. ``DATAVERSE_FEATURE_API_SESSION_AUTH=1``). These environment variables can be set in your shell before starting Payara. If you are using :doc:`Docker for development </container/dev-usage>`, you can set them in the `docker compose <https://docs.docker.com/compose/environment-variables/set-environment-variables/>`_ file.

src/main/java/edu/harvard/iq/dataverse/pidproviders/doi/datacite/DOIDataCiteRegisterService.java

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,14 @@ public String reRegisterIdentifier(String identifier, Map<String, String> metada
9595
}
9696
retString = "metadata:\\r" + client.postMetadata(xmlMetadata) + "\\r";
9797
}
98-
if (!target.equals(client.getUrl(numericIdentifier))) {
98+
String currentUrl = null;
99+
try {
100+
//May get a 204 if the DOI is still draft
101+
currentUrl = client.getUrl(numericIdentifier);
102+
} catch (RuntimeException ex) {
103+
logger.fine("Error getting Url for " + numericIdentifier + ": " + ex.getMessage());
104+
}
105+
if (!target.equals(currentUrl)) {
99106
logger.info("Updating target URL to " + target);
100107
client.postUrl(numericIdentifier, target);
101108
retString = retString + "url:\\r" + target;

src/main/java/edu/harvard/iq/dataverse/pidproviders/doi/datacite/DataCiteDOIProvider.java

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
import edu.harvard.iq.dataverse.FileMetadata;
1616
import edu.harvard.iq.dataverse.GlobalId;
1717
import edu.harvard.iq.dataverse.pidproviders.doi.AbstractDOIProvider;
18+
import edu.harvard.iq.dataverse.settings.FeatureFlags;
1819
import edu.harvard.iq.dataverse.util.json.JsonUtil;
1920
import jakarta.json.JsonObject;
2021

@@ -217,7 +218,11 @@ public boolean publicizeIdentifier(DvObject dvObject) {
217218
metadata.put("datacite.publicationyear", generateYear(dvObject));
218219
metadata.put("_target", getTargetUrl(dvObject));
219220
try {
220-
doiDataCiteRegisterService.registerIdentifier(identifier, metadata, dvObject);
221+
if (FeatureFlags.ONLY_UPDATE_DATACITE_WHEN_NEEDED.enabled()) {
222+
doiDataCiteRegisterService.reRegisterIdentifier(identifier, metadata, dvObject);
223+
} else {
224+
doiDataCiteRegisterService.registerIdentifier(identifier, metadata, dvObject);
225+
}
221226
return true;
222227
} catch (Exception e) {
223228
logger.log(Level.WARNING, "modifyMetadata failed: " + e.getMessage(), e);

src/main/java/edu/harvard/iq/dataverse/pidproviders/doi/datacite/DataCiteRESTfullClient.java

Lines changed: 82 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,10 @@ public class DataCiteRESTfullClient implements Closeable {
4141

4242
private static final Logger logger = Logger.getLogger(DataCiteRESTfullClient.class.getCanonicalName());
4343

44+
// Constants for retry mechanism
45+
private static final int MAX_RETRIES = 5;
46+
private static final long RETRY_DELAY_MS = 10000; // 10 seconds
47+
4448
private String url;
4549
private CloseableHttpClient httpClient;
4650
private HttpClientContext context;
@@ -59,11 +63,78 @@ public DataCiteRESTfullClient(String url, String username, String password) {
5963
public void close() {
6064
if (this.httpClient != null) {
6165
try {
62-
httpClient.close();
66+
httpClient.close();
6367
} catch (IOException io) {
64-
logger.warning("IOException closing hhtpClient: " + io.getMessage());
65-
}
68+
logger.warning("IOException closing httpClient: " + io.getMessage());
69+
}
70+
}
71+
}
72+
73+
/**
74+
* Execute HTTP request with retry mechanism for specific status codes
75+
*
76+
* @param request The HTTP request to execute
77+
* @param operationName Name of the operation for logging
78+
* @return HttpResponse The response from the server
79+
* @throws IOException If an error occurs during the request
80+
*/
81+
private HttpResponse executeWithRetry(org.apache.http.client.methods.HttpRequestBase request, String operationName) throws IOException {
82+
int attempts = 0;
83+
IOException lastException = null;
84+
85+
while (attempts < MAX_RETRIES) {
86+
try {
87+
HttpResponse response = httpClient.execute(request, context);
88+
int statusCode = response.getStatusLine().getStatusCode();
89+
90+
// If we get a retry status code, try again after delay
91+
if (statusCode == 429 || statusCode == 503 || statusCode == 504) {
92+
EntityUtils.consumeQuietly(response.getEntity());
93+
attempts++;
94+
95+
if (attempts < MAX_RETRIES) {
96+
logger.warning("DataCite API returned status " + statusCode +
97+
" for " + operationName + ". Retrying in " +
98+
(RETRY_DELAY_MS / 1000) + " seconds (attempt " + attempts + " of " + MAX_RETRIES + ")");
99+
try {
100+
Thread.sleep(RETRY_DELAY_MS);
101+
} catch (InterruptedException ie) {
102+
Thread.currentThread().interrupt();
103+
throw new IOException("Retry interrupted", ie);
104+
}
105+
} else {
106+
logger.severe("DataCite API failed with status " + statusCode +
107+
" for " + operationName + " after " + MAX_RETRIES + " attempts");
108+
return response; // Return the last failed response
109+
}
110+
} else {
111+
// Success or non-retry error code
112+
return response;
113+
}
114+
} catch (IOException ioe) {
115+
lastException = ioe;
116+
attempts++;
117+
118+
if (attempts < MAX_RETRIES) {
119+
logger.warning("IOException during " + operationName + ": " + ioe.getMessage() +
120+
". Retrying in " + (RETRY_DELAY_MS / 1000) + " seconds (attempt " +
121+
attempts + " of " + MAX_RETRIES + ")");
122+
try {
123+
Thread.sleep(RETRY_DELAY_MS);
124+
} catch (InterruptedException ie) {
125+
Thread.currentThread().interrupt();
126+
throw new IOException("Retry interrupted", ie);
127+
}
128+
} else {
129+
logger.severe("DataCite API failed for " + operationName + " after " +
130+
MAX_RETRIES + " attempts due to: " + ioe.getMessage());
131+
throw lastException;
132+
}
133+
}
66134
}
135+
136+
// This should never happen, but just in case
137+
throw new IOException("Failed to execute request after " + MAX_RETRIES + " attempts");
67138
}
68139

69140
/**
@@ -75,7 +146,7 @@ public void close() {
75146
public String getUrl(String doi) {
76147
HttpGet httpGet = new HttpGet(this.url + "/doi/" + doi);
77148
try {
78-
HttpResponse response = httpClient.execute(httpGet,context);
149+
HttpResponse response = executeWithRetry(httpGet, "getUrl");
79150
HttpEntity entity = response.getEntity();
80151
String data = null;
81152

@@ -104,7 +175,7 @@ public String postUrl(String doi, String url) throws IOException {
104175
httpPost.setHeader("Content-Type", "text/plain;charset=UTF-8");
105176
httpPost.setEntity(new StringEntity("doi=" + doi + "\nurl=" + url, "utf-8"));
106177

107-
HttpResponse response = httpClient.execute(httpPost, context);
178+
HttpResponse response = executeWithRetry(httpPost, "postUrl");
108179
String data = EntityUtils.toString(response.getEntity(), encoding);
109180
if (response.getStatusLine().getStatusCode() != 201) {
110181
String errMsg = "Response from postUrl: " + response.getStatusLine().getStatusCode() + ", " + data;
@@ -124,7 +195,7 @@ public String getMetadata(String doi) {
124195
HttpGet httpGet = new HttpGet(this.url + "/metadata/" + doi);
125196
httpGet.setHeader("Accept", "application/xml");
126197
try {
127-
HttpResponse response = httpClient.execute(httpGet,context);
198+
HttpResponse response = executeWithRetry(httpGet, "getMetadata");
128199
String data = EntityUtils.toString(response.getEntity(), encoding);
129200
if (response.getStatusLine().getStatusCode() != 200) {
130201
String errMsg = "Response from getMetadata: " + response.getStatusLine().getStatusCode() + ", " + data;
@@ -133,7 +204,7 @@ public String getMetadata(String doi) {
133204
}
134205
return data;
135206
} catch (IOException ioe) {
136-
logger.log(Level.SEVERE, "IOException when get metadata");
207+
logger.log(Level.SEVERE, "IOException when get metadata", ioe);
137208
throw new RuntimeException("IOException when get metadata", ioe);
138209
}
139210
}
@@ -147,7 +218,7 @@ public String getMetadata(String doi) {
147218
public boolean testDOIExists(String doi) throws IOException {
148219
HttpGet httpGet = new HttpGet(this.url + "/metadata/" + doi);
149220
httpGet.setHeader("Accept", "application/xml");
150-
HttpResponse response = httpClient.execute(httpGet, context);
221+
HttpResponse response = executeWithRetry(httpGet, "testDOIExists");
151222
if (response.getStatusLine().getStatusCode() != 200) {
152223
EntityUtils.consumeQuietly(response.getEntity());
153224
return false;
@@ -166,7 +237,7 @@ public String postMetadata(String metadata) throws IOException {
166237
HttpPost httpPost = new HttpPost(this.url + "/metadata");
167238
httpPost.setHeader("Content-Type", "application/xml;charset=UTF-8");
168239
httpPost.setEntity(new StringEntity(metadata, "utf-8"));
169-
HttpResponse response = httpClient.execute(httpPost, context);
240+
HttpResponse response = executeWithRetry(httpPost, "postMetadata");
170241
String data = EntityUtils.toString(response.getEntity(), encoding);
171242
if (response.getStatusLine().getStatusCode() != 201) {
172243
String errMsg = "Response from postMetadata: " + response.getStatusLine().getStatusCode() + ", " + data;
@@ -185,7 +256,7 @@ public String postMetadata(String metadata) throws IOException {
185256
public String inactiveDataset(String doi) {
186257
HttpDelete httpDelete = new HttpDelete(this.url + "/metadata/" + doi);
187258
try {
188-
HttpResponse response = httpClient.execute(httpDelete,context);
259+
HttpResponse response = executeWithRetry(httpDelete, "inactiveDataset");
189260
String data = EntityUtils.toString(response.getEntity(), encoding);
190261
if (response.getStatusLine().getStatusCode() != 200) {
191262
String errMsg = "Response code: " + response.getStatusLine().getStatusCode() + ", " + data;
@@ -194,7 +265,7 @@ public String inactiveDataset(String doi) {
194265
}
195266
return data;
196267
} catch (IOException ioe) {
197-
logger.log(Level.SEVERE, "IOException when inactive dataset");
268+
logger.log(Level.SEVERE, "IOException when inactive dataset", ioe);
198269
throw new RuntimeException("IOException when inactive dataset", ioe);
199270
}
200271
}

src/main/java/edu/harvard/iq/dataverse/settings/FeatureFlags.java

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,21 @@ public enum FeatureFlags {
235235
* or revoked, at what times, and by whom.
236236
*/
237237
ROLE_ASSIGNMENT_HISTORY("role-assignment-history"),
238+
239+
/**
240+
* Only update a DataCite DOI when needed (for efficiency, lighter load on DataCite).
241+
* This flag causes Dataverse to GET the latest metadata from DataCite for a DOI and
242+
* comparing it with the current metadata in Dataverse and only sending a following POST
243+
* request if needed. This potentially substitutes a read for an unnecessary write at DataCite,
244+
* but would result in extra reads when all metadata in Dataverse is new. Setting the flag
245+
* to "true" is recommended when using DataCite file DOIs.
246+
*
247+
* @apiNote Raise flag by setting
248+
* "dataverse.feature.only-update-datacite-when-needed"
249+
* @since Dataverse 6.9
250+
*/
251+
ONLY_UPDATE_DATACITE_WHEN_NEEDED("only-update-datacite-when-needed"),
252+
238253
;
239254

240255
final String flag;

0 commit comments

Comments
 (0)