Skip to content

Commit 154466c

Browse files
authored
Merge pull request #11781 from QualitativeDataRepository/IQSS/11777-improve_MDC_Citation_api_scaling
IQSS/11777 improve MDC citation api scaling
2 parents 6cf973e + bd6e4ff commit 154466c

File tree

9 files changed

+279
-32
lines changed

9 files changed

+279
-32
lines changed

conf/mdc/counter_weekly.sh

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
#!/bin/sh
2+
#counter_weekly.sh
3+
4+
# This script iterates through all published Datasets in all Dataverses and calls the Make Data Count API to update their citations from DataCite
5+
# Note: Requires curl and jq for parsing JSON responses form curl
6+
7+
# A recursive method to process each Dataverse
8+
processDV () {
9+
echo "Processing Dataverse ID#: $1"
10+
11+
#Call the Dataverse API to get the contents of the Dataverse (without credentials, this will only list published datasets and dataverses
12+
DVCONTENTS=$(curl -s http://localhost:8080/api/dataverses/$1/contents)
13+
14+
# Iterate over all datasets, pulling the value of their DOIs (as part of the persistentUrl) from the json returned
15+
for subds in $(echo "${DVCONTENTS}" | jq -r '.data[] | select(.type == "dataset") | .persistentUrl'); do
16+
17+
#The authority/identifier are preceded by a protocol/host, i.e. https://doi.org/
18+
DOI=`expr "$subds" : '.*:\/\/\doi\.org\/\(.*\)'`
19+
20+
# Call the Dataverse API for this dataset and capture both the response and HTTP status code
21+
HTTP_RESPONSE=$(curl -s -w "\n%{http_code}" -X POST "http://localhost:8080/api/admin/makeDataCount/:persistentId/updateCitationsForDataset?persistentId=doi:$DOI")
22+
23+
# Extract the HTTP status code from the last line
24+
HTTP_STATUS=$(echo "$HTTP_RESPONSE" | tail -n1)
25+
# Extract the response body (everything except the last line)
26+
RESPONSE_BODY=$(echo "$HTTP_RESPONSE" | sed '$d')
27+
28+
# Check the HTTP status code and report accordingly
29+
case $HTTP_STATUS in
30+
200)
31+
# Successfully queued
32+
# Extract status from the nested data object
33+
STATUS=$(echo "$RESPONSE_BODY" | jq -r '.data.status')
34+
35+
# Extract message from the nested data object
36+
if echo "$RESPONSE_BODY" | jq -e '.data.message' > /dev/null 2>&1 && [ "$(echo "$RESPONSE_BODY" | jq -r '.data.message')" != "null" ]; then
37+
MESSAGE=$(echo "$RESPONSE_BODY" | jq -r '.data.message')
38+
echo "[SUCCESS] doi:$DOI - $STATUS: $MESSAGE"
39+
else
40+
# If message is missing or null, just show the status
41+
echo "[SUCCESS] doi:$DOI - $STATUS: Citation update queued"
42+
fi
43+
;;
44+
400)
45+
# Bad request
46+
if echo "$RESPONSE_BODY" | jq -e '.message' > /dev/null 2>&1; then
47+
ERROR=$(echo "$RESPONSE_BODY" | jq -r '.message')
48+
echo "[ERROR 400] doi:$DOI - Bad request: $ERROR"
49+
else
50+
echo "[ERROR 400] doi:$DOI - Bad request"
51+
fi
52+
;;
53+
404)
54+
# Not found
55+
if echo "$RESPONSE_BODY" | jq -e '.message' > /dev/null 2>&1; then
56+
ERROR=$(echo "$RESPONSE_BODY" | jq -r '.message')
57+
echo "[ERROR 404] doi:$DOI - Not found: $ERROR"
58+
else
59+
echo "[ERROR 404] doi:$DOI - Not found"
60+
fi
61+
;;
62+
503)
63+
# Service unavailable (queue full)
64+
if echo "$RESPONSE_BODY" | jq -e '.message' > /dev/null 2>&1; then
65+
ERROR=$(echo "$RESPONSE_BODY" | jq -r '.message')
66+
echo "[ERROR 503] doi:$DOI - Service unavailable: $ERROR"
67+
elif echo "$RESPONSE_BODY" | jq -e '.data.message' > /dev/null 2>&1; then
68+
ERROR=$(echo "$RESPONSE_BODY" | jq -r '.data.message')
69+
echo "[ERROR 503] doi:$DOI - Service unavailable: $ERROR"
70+
else
71+
echo "[ERROR 503] doi:$DOI - Service unavailable: Queue is full"
72+
fi
73+
;;
74+
*)
75+
# Other error
76+
echo "[ERROR $HTTP_STATUS] doi:$DOI - Unexpected error"
77+
echo "Response: $RESPONSE_BODY"
78+
;;
79+
esac
80+
81+
done
82+
83+
# Now iterate over any child Dataverses and recursively process them
84+
for subdv in $(echo "${DVCONTENTS}" | jq -r '.data[] | select(.type == "dataverse") | .id'); do
85+
echo $subdv
86+
processDV $subdv
87+
done
88+
89+
}
90+
91+
# Call the function on the root dataverse to start processing
92+
processDV 1
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
The /api/admin/makeDataCount/{id}/updateCitationsForDataset endpoint, which allows citations for a dataset to be retrieved from DataCite, is often called periodically for all datasets. However, allowing calls for many datasets to be processed in parallel can cause performance problems in Dataverse and/or cause calls to DataCite to fail due to rate limiting. The existing implementation was also inefficient w.r.t. memory use when used on datasets with many (>~1K) files. This release configures Dataverse to queue calls to this api, processes them serially, adds optional throttling to avoid hitting DataCite rate limits and improves memory use.
2+
3+
New optional MPConfig setting:
4+
5+
dataverse.api.mdc.min-delay-ms - number of milliseconds to wait between calls to DataCite. A value of ~100 should conservatively address DataCite's current 3000/5 minute limit. A value of 250 may be required for their test service.
6+
7+
Backward compatibility: This api call is now asynchronous and will return an OK response when the call is queued or a 503 if the queue is full.

doc/sphinx-guides/source/admin/make-data-count.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,8 @@ The example :download:`counter_weekly.sh <../_static/util/counter_weekly.sh>` wi
166166

167167
Citations will be retrieved for each published dataset and recorded in the your Dataverse installation's database.
168168

169+
Note that the :ref:`dataverse.api.mdc.min-delay-ms` setting can be used to avoid getting rate-limit errors from DataCite.
170+
169171
For how to get the citations out of your Dataverse installation, see "Retrieving Citations for a Dataset" under :ref:`Dataset Metrics <dataset-metrics-api>` in the :doc:`/api/native-api` section of the API Guide.
170172

171173
Please note that while the Dataverse Software has a metadata field for "Related Dataset" this information is not currently sent as a citation to Crossref.

doc/sphinx-guides/source/api/changelog.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ This API changelog is experimental and we would love feedback on its usefulness.
77
:local:
88
:depth: 1
99

10+
v6.9
11+
----
12+
- The POST /api/admin/makeDataCount/{id}/updateCitationsForDataset processing is now asynchronous and the response no longer includes the number of citations. The response can be OK if the request is queued or 503 if the queue is full (default queue size is 1000).
13+
1014
v6.8
1115
----
1216

doc/sphinx-guides/source/installation/config.rst

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3731,6 +3731,22 @@ Example:
37313731

37323732
Can also be set via any `supported MicroProfile Config API source`_, e.g. the environment variable ``DATAVERSE_CORS_HEADERS_EXPOSE``.
37333733

3734+
3735+
.. _dataverse.api.mdc.min-delay-ms:
3736+
3737+
dataverse.api.mdc.min-delay-ms
3738+
++++++++++++++++++++++++++++++
3739+
3740+
Minimum delay in milliseconds between Make Data Count (MDC) API requests from the /api/admin/makeDataCount/{id}/updateCitationsForDataset api.
3741+
This setting helps prevent overloading the MDC service by enforcing a minimum time interval between consecutive requests.
3742+
If a request arrives before this interval has elapsed since the previous request, it will be rate-limited.
3743+
3744+
Default: ``0`` (no delay enforced)
3745+
3746+
Example: ``dataverse.api.mdc.min-delay-ms=100`` (enforces a minimum 100ms delay between MDC API requests)
3747+
3748+
Can also be set via any `supported MicroProfile Config API source`_, e.g. the environment variable ``DATAVERSE_API_MDC_MIN_DELAY_MS``.
3749+
37343750
.. _feature-flags:
37353751

37363752
Feature Flags

src/main/java/edu/harvard/iq/dataverse/api/MakeDataCountApi.java

Lines changed: 142 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,15 @@
2525
import java.net.URL;
2626
import java.util.Iterator;
2727
import java.util.List;
28+
import java.util.concurrent.Future;
29+
import java.util.concurrent.RejectedExecutionException;
30+
import java.util.concurrent.atomic.AtomicLong;
2831
import java.util.logging.Level;
2932
import java.util.logging.Logger;
33+
34+
import jakarta.annotation.Resource;
3035
import jakarta.ejb.EJB;
36+
import jakarta.enterprise.concurrent.ManagedExecutorService;
3137
import jakarta.json.Json;
3238
import jakarta.json.JsonArray;
3339
import jakarta.json.JsonArrayBuilder;
@@ -62,6 +68,13 @@ public class MakeDataCountApi extends AbstractApiBean {
6268
@EJB
6369
SystemConfig systemConfig;
6470

71+
// Inject the managed executor service provided by the container
72+
@Resource(name = "concurrent/CitationUpdateExecutor")
73+
private ManagedExecutorService executorService;
74+
75+
// Track the last execution time to implement rate limiting during Citation updates
76+
private static final AtomicLong lastExecutionTime = new AtomicLong(0);
77+
6578
/**
6679
* TODO: For each dataset, send the following:
6780
*
@@ -141,69 +154,166 @@ public Response addUsageMetricsFromSushiReportAll(@QueryParam("reportOnDisk") St
141154

142155
@POST
143156
@Path("{id}/updateCitationsForDataset")
144-
public Response updateCitationsForDataset(@PathParam("id") String id) throws IOException {
157+
public Response updateCitationsForDataset(@PathParam("id") String id) {
145158
try {
146-
Dataset dataset = findDatasetOrDie(id);
147-
GlobalId pid = dataset.getGlobalId();
148-
PidProvider pidProvider = PidUtil.getPidProvider(pid.getProviderId());
159+
// First validate that the dataset exists and has a valid DOI
160+
final Dataset dataset = findDatasetOrDie(id);
161+
final GlobalId pid = dataset.getGlobalId();
162+
final PidProvider pidProvider = PidUtil.getPidProvider(pid.getProviderId());
163+
149164
// Only supported for DOIs and for DataCite DOI providers
150-
if(!DataCiteDOIProvider.TYPE.equals(pidProvider.getProviderType())) {
165+
if (!DataCiteDOIProvider.TYPE.equals(pidProvider.getProviderType())) {
151166
return error(Status.BAD_REQUEST, "Only DataCite DOI providers are supported");
152167
}
153-
String persistentId = pid.toString();
154168

155-
// DataCite wants "doi=", not "doi:".
156-
String authorityPlusIdentifier = persistentId.replaceFirst("doi:", "");
157-
// Request max page size and then loop to handle multiple pages
158-
URL url = null;
169+
// Submit the task to the managed executor service
170+
Future<?> future;
159171
try {
160-
url = new URI(JvmSettings.DATACITE_REST_API_URL.lookup(pidProvider.getId()) +
161-
"/events?doi=" +
162-
authorityPlusIdentifier +
163-
"&source=crossref&page[size]=1000&page[cursor]=1").toURL();
164-
} catch (URISyntaxException e) {
165-
//Nominally this means a config error/ bad DATACITE_REST_API_URL for this provider
166-
logger.warning("Unable to create URL for " + persistentId + ", pidProvider " + pidProvider.getId());
167-
return error(Status.INTERNAL_SERVER_ERROR, "Unable to create DataCite URL to retrieve citations.");
172+
future = executorService.submit(() -> {
173+
try {
174+
// Apply rate limiting if enabled
175+
applyRateLimit();
176+
177+
// Process the citation update
178+
boolean success = processCitationUpdate(dataset, pid, pidProvider);
179+
180+
// Update the last execution time after processing
181+
lastExecutionTime.set(System.currentTimeMillis());
182+
183+
if (success) {
184+
logger.fine("Successfully processed citation update for dataset " + id);
185+
} else {
186+
logger.warning("Failed to process citation update for dataset " + id);
187+
}
188+
} catch (Exception e) {
189+
logger.log(Level.SEVERE, "Error processing citation update for dataset " + id, e);
190+
}
191+
});
192+
193+
JsonObjectBuilder output = Json.createObjectBuilder();
194+
output.add("status", "queued");
195+
output.add("message", "Citation update for dataset " + id + " has been queued for processing");
196+
return ok(output);
197+
} catch (RejectedExecutionException ree) {
198+
logger.warning("Citation update for dataset " + id + " was rejected: Queue is full");
199+
return error(Status.SERVICE_UNAVAILABLE,
200+
"Citation update service is currently at capacity. Please try again later.");
168201
}
169-
logger.fine("Retrieving Citations from " + url.toString());
170-
boolean nextPage = true;
171-
JsonArrayBuilder dataBuilder = Json.createArrayBuilder();
202+
} catch (WrappedResponse wr) {
203+
return wr.getResponse();
204+
}
205+
}
206+
207+
/**
208+
* Apply rate limiting by waiting if necessary
209+
*/
210+
private void applyRateLimit() {
211+
// Check if rate limiting is enabled
212+
long minDelay = JvmSettings.API_MDC_UPDATE_MIN_DELAY_MS.lookupOptional(Long.class).orElse(0l);
213+
if(minDelay ==0) {
214+
return;
215+
}
216+
// Calculate how long to wait
217+
long lastExecution = lastExecutionTime.get();
218+
long currentTime = System.currentTimeMillis();
219+
long elapsedTime = currentTime - lastExecution;
220+
221+
// If not enough time has passed since the last execution, wait
222+
if (lastExecution > 0 && elapsedTime < minDelay) {
223+
long waitTime = minDelay - elapsedTime;
224+
logger.fine("Rate limiting: waiting " + waitTime + " ms before processing next citation update");
225+
try {
226+
Thread.sleep(waitTime);
227+
} catch (InterruptedException e) {
228+
Thread.currentThread().interrupt();
229+
logger.warning("Rate limiting sleep interrupted: " + e.getMessage());
230+
}
231+
}
232+
}
233+
234+
/**
235+
* Process the citation update for a dataset
236+
* This method contains the logic that was previously in updateCitationsForDataset
237+
* @return true if processing was successful, false otherwise
238+
*/
239+
private boolean processCitationUpdate(Dataset dataset, GlobalId pid, PidProvider pidProvider) {
240+
String persistentId = pid.asRawIdentifier();
241+
242+
// Request max page size and then loop to handle multiple pages
243+
URL url = null;
244+
try {
245+
url = new URI(JvmSettings.DATACITE_REST_API_URL.lookup(pidProvider.getId()) +
246+
"/events?doi=" +
247+
persistentId +
248+
"&source=crossref&page[size]=1000&page[cursor]=1").toURL();
249+
} catch (URISyntaxException | MalformedURLException e) {
250+
//Nominally this means a config error/ bad DATACITE_REST_API_URL for this provider
251+
logger.warning("Unable to create URL for " + persistentId + ", pidProvider " + pidProvider.getId());
252+
return false;
253+
}
254+
255+
logger.fine("Retrieving Citations from " + url.toString());
256+
boolean nextPage = true;
257+
JsonArrayBuilder dataBuilder = Json.createArrayBuilder();
258+
259+
try {
172260
do {
173261
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
174262
connection.setRequestMethod("GET");
175263
int status = connection.getResponseCode();
176264
if (status != 200) {
177265
logger.warning("Failed to get citations from " + url.toString());
178266
connection.disconnect();
179-
return error(Status.fromStatusCode(status), "Failed to get citations from " + url.toString());
267+
return false;
180268
}
269+
181270
JsonObject report;
182271
try (InputStream inStream = connection.getInputStream()) {
183272
report = JsonUtil.getJsonObject(inStream);
184273
} finally {
185274
connection.disconnect();
186275
}
276+
187277
JsonObject links = report.getJsonObject("links");
188278
JsonArray data = report.getJsonArray("data");
189279
Iterator<JsonValue> iter = data.iterator();
190280
while (iter.hasNext()) {
191-
dataBuilder.add(iter.next());
281+
JsonValue citationValue = iter.next();
282+
JsonObject citation = (JsonObject) citationValue;
283+
284+
// Filter out relations we don't use (e.g. hasPart) to lower memory req. with many files
285+
if (citation.containsKey("attributes")) {
286+
JsonObject attributes = citation.getJsonObject("attributes");
287+
if (attributes.containsKey("relation-type-id")) {
288+
String relationshipType = attributes.getString("relation-type-id");
289+
290+
// Only add citations with relationship types we care about
291+
if (DatasetExternalCitationsServiceBean.inboundRelationships.contains(relationshipType) ||
292+
DatasetExternalCitationsServiceBean.outboundRelationships.contains(relationshipType)) {
293+
dataBuilder.add(citationValue);
294+
}
295+
}
296+
}
192297
}
298+
193299
if (links.containsKey("next")) {
194300
try {
195301
url = new URI(links.getString("next")).toURL();
302+
applyRateLimit();
196303
} catch (URISyntaxException e) {
197304
logger.warning("Unable to create URL from DataCite response: " + links.getString("next"));
198-
return error(Status.INTERNAL_SERVER_ERROR, "Unable to retrieve all results from DataCite");
305+
return false;
199306
}
200307
} else {
201308
nextPage = false;
202309
}
310+
203311
logger.fine("body of citation response: " + report.toString());
204312
} while (nextPage == true);
313+
205314
JsonArray allData = dataBuilder.build();
206315
List<DatasetExternalCitations> datasetExternalCitations = datasetExternalCitationsService.parseCitations(allData);
316+
207317
/*
208318
* ToDo: If this is the only source of citations, we should remove all the existing ones for the dataset and repopulate them.
209319
* As is, this call doesn't remove old citations if there are now none (legacy issue if we decide to stop counting certain types of citation
@@ -216,14 +326,16 @@ public Response updateCitationsForDataset(@PathParam("id") String id) throws IOE
216326
datasetExternalCitationsService.save(dm);
217327
}
218328
}
219-
220-
JsonObjectBuilder output = Json.createObjectBuilder();
221-
output.add("citationCount", datasetExternalCitations.size());
222-
return ok(output);
223-
} catch (WrappedResponse wr) {
224-
return wr.getResponse();
329+
330+
logger.fine("Citation update completed for dataset " + dataset.getId() +
331+
" with " + datasetExternalCitations.size() + " citations");
332+
return true;
333+
} catch (IOException e) {
334+
logger.log(Level.WARNING, "Error processing citation update for dataset " + dataset.getId(), e);
335+
return false;
225336
}
226337
}
338+
227339
@GET
228340
@Path("{yearMonth}/processingState")
229341
public Response getProcessingState(@PathParam("yearMonth") String yearMonth) {

0 commit comments

Comments
 (0)