Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 87 additions & 79 deletions build.gradle

Large diffs are not rendered by default.

10 changes: 5 additions & 5 deletions doc/Install.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,9 @@ After installing all or a selection of bibliographical databases, the bibliograp

The following describes how to build and start the bibliographical service.

### Build the service
### Build the service

You need Java JDK 1.11 or more installed for building and running the tool.
You need **Java JDK 21 (LTS)** installed for building and running the tool. The Gradle wrapper is configured with a Java 21 toolchain — if your default `java` is older, the [Foojay toolchain resolver](https://github.com/gradle/foojay-toolchains) will automatically download and provision a JDK 21 on first build. To install Java 21 manually, use [SDKMAN!](https://sdkman.io) or [Eclipse Temurin](https://adoptium.net/temurin/releases/?version=21).

```sh
./gradlew clean build
Expand Down Expand Up @@ -74,11 +74,11 @@ Each item represent a data storage. By default they are LMDB storage and their d

biblio-glutton takes advantage of GROBID for parsing raw bibliographical references. This permits faster and more accurate bibliographical record matching. To use GROBID service:

* First download and install GROBID as indicated in the [documentation](https://grobid.readthedocs.io/en/latest/Install-Grobid/), normally as a docker image to take advantage of Deep Learning models for more accurate parsing of bibliographical references.
* First download and install GROBID as indicated in the [documentation](https://grobid.readthedocs.io/en/latest/Install-Grobid/), normally as a docker image to take advantage of Deep Learning models for more accurate parsing of bibliographical references. **Recommended Grobid version: the latest stable 0.8.x release** (see [Grobid releases](https://github.com/kermitt2/grobid/releases)). biblio-glutton communicates with Grobid only via HTTP (the `/api/isalive` and `/api/processCitation` endpoints), so any Grobid 0.7.x or later release is API-compatible.

* Start the service as documented [here](https://grobid.readthedocs.io/en/latest/Grobid-service/). You can change the `port` used by GROBID when strating the docker container, or by updating the service config file under `grobid/grobid-home/config/grobid.yaml`.
* Start the service as documented [here](https://grobid.readthedocs.io/en/latest/Grobid-service/). You can change the `port` used by GROBID when strating the docker container, or by updating the service config file under `grobid/grobid-home/config/grobid.yaml`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: "strating" should be "starting".

📝 Fix typo
-* Start the service as documented [here](https://grobid.readthedocs.io/en/latest/Grobid-service/). You can change the `port` used by GROBID when strating the docker container, or by updating the service config file under `grobid/grobid-home/config/grobid.yaml`.
+* Start the service as documented [here](https://grobid.readthedocs.io/en/latest/Grobid-service/). You can change the `port` used by GROBID when starting the docker container, or by updating the service config file under `grobid/grobid-home/config/grobid.yaml`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
* Start the service as documented [here](https://grobid.readthedocs.io/en/latest/Grobid-service/). You can change the `port` used by GROBID when strating the docker container, or by updating the service config file under `grobid/grobid-home/config/grobid.yaml`.
* Start the service as documented [here](https://grobid.readthedocs.io/en/latest/Grobid-service/). You can change the `port` used by GROBID when starting the docker container, or by updating the service config file under `grobid/grobid-home/config/grobid.yaml`.
🧰 Tools
🪛 LanguageTool

[grammar] ~79-~79: Ensure spelling is correct
Context: ...n change the port used by GROBID when strating the docker container, or by updating th...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@doc/Install.md` at line 79, Typo in the documentation: replace the misspelled
word "strating" with "starting" in the sentence that mentions changing the
`port` used by GROBID; edit the Install.md line containing "You can change the
`port` used by GROBID when strating the docker container" to read "starting the
docker container" so the sentence is correct.


* Update if necessary the host and port information of GROBID in the biblio-glutton config file under `biblio-glutton/config/glutton.yml` (parameter `grobidPath`).
* Update if necessary the host and port information of GROBID in the biblio-glutton config file under `biblio-glutton/config/glutton.yml` (parameter `grobidHost`).

While GROBID is not required for running biblio-glutton, in particular if it is used only for bibliographical look-up, it is strongly recommended for performing bibliographical record matching. And vice-vera, configuration the biblio-glutton service for Grobid will provide high quality consolidation services to resolve the bibliographical references automatically extracted by Grobid.

Expand Down
2 changes: 1 addition & 1 deletion gradle/wrapper/gradle-wrapper.properties
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-7.2-bin.zip
distributionUrl=https\://services.gradle.org/distributions/gradle-8.10.2-bin.zip
8 changes: 8 additions & 0 deletions pubmed-glutton/Readme.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
## pubmed-glutton

> **Build status**: this subproject does **not** currently compile from a clean checkout. It depends on
> `com.scienceminer.glutton.data.db.{KBEnvironment, KBStagingEnvironment}` which are not present in this
> repository's source tree and were originally hosted in a local Maven repo on the original developer's
> workstation. To build, you need those JARs available on a local Maven repository or under `lib/`.
> A toolchain is configured for **JDK 17** (the wider repo targets JDK 21, but pubmed-glutton's
> Jersey 1.8 / log4j 1.2.x dependency stack cannot run on JDK 21). Modernizing this subproject's
> dependencies is a tracked follow-up.

This package can be used to parse and store the PubMed data (all MEDLINE data, with abstract, MeSH classes etc.), and provides some mapping functionalities.

Any command will first initialize the `staging area` databases, this is only done the first time a command is launched.
Expand Down
24 changes: 15 additions & 9 deletions pubmed-glutton/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,7 @@ buildscript {
}

dependencies {
classpath 'gradle.plugin.org.kt3k.gradle.plugin:coveralls-gradle-plugin:2.12.0'
classpath "gradle.plugin.com.github.jengelman.gradle.plugins:shadow:7.0.0"
//classpath 'com.github.jengelman.gradle.plugins:shadow:6.1.0'
classpath "com.gradleup.shadow:shadow-gradle-plugin:8.3.5"
}
}

Expand All @@ -20,25 +18,33 @@ apply plugin: 'java'
apply plugin: 'java-library'
//apply plugin: 'maven'
apply plugin: 'maven-publish'
apply plugin: 'com.github.johnrengelman.shadow'
apply plugin: 'com.gradleup.shadow'

group 'com.scienceminer.glutton'
version '0.3-SNAPSHOT'

sourceCompatibility = 1.8
java {
toolchain {
languageVersion = JavaLanguageVersion.of(17)
}
}

tasks.withType(JavaCompile) {
options.encoding = 'UTF-8'
}

repositories {
maven { url "file:////home/lopez/biblio-glutton/pubmed-glutton/lib/" }
// NOTE: pubmed-glutton imports com.scienceminer.glutton.data.db.{KBEnvironment, KBStagingEnvironment}
// which are NOT in this repository's source tree. They were originally hosted in a local
// Maven repo on the original developer's workstation (file:////home/lopez/...). If you need
// to build this subproject, drop the JARs into a local lib directory and add a flatDir or
// file-based maven repo here. See pubmed-glutton/Readme.md for context.
mavenLocal()
mavenCentral()
}

wrapper {
gradleVersion "7.2"
gradleVersion "8.10.2"
}

// The main class of the application
Expand Down Expand Up @@ -77,10 +83,10 @@ dependencies {
testImplementation group: 'org.hamcrest', name: 'hamcrest-all', version:'1.3'
}

task install(dependsOn: installShadowDist)
task install(dependsOn: 'installShadowDist')

shadowJar {
classifier = 'onejar'
archiveClassifier = 'onejar'
mergeServiceFiles()
zip64 true
manifest {
Expand Down
2 changes: 1 addition & 1 deletion pubmed-glutton/gradle/wrapper/gradle-wrapper.properties
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@ distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-7.2-bin.zip
distributionUrl=https\://services.gradle.org/distributions/gradle-8.10.2-bin.zip
4 changes: 4 additions & 0 deletions settings.gradle
Original file line number Diff line number Diff line change
@@ -1,2 +1,6 @@
plugins {
id 'org.gradle.toolchains.foojay-resolver-convention' version '0.8.0'
}
Comment on lines +1 to +3
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

What is the latest version of org.gradle.toolchains.foojay-resolver-convention plugin?

💡 Result:

The latest version of the org.gradle.toolchains.foojay-resolver-convention Gradle plugin is 1.0.0 (released May 19, 2025). [1]

Usage:

plugins {
  id("org.gradle.toolchains.foojay-resolver-convention") version "1.0.0"
}

Upgrade Foojay resolver plugin to version 1.0.0.

Good addition—the Foojay toolchain resolver enables automatic JDK provisioning. However, version 0.8.0 is outdated; the latest stable version is 1.0.0 (released May 2025). Update to:

plugins {
    id 'org.gradle.toolchains.foojay-resolver-convention' version '1.0.0'
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@settings.gradle` around lines 1 - 3, Update the Foojay toolchain resolver
plugin version used in the Gradle build: locate the plugins block that applies
id 'org.gradle.toolchains.foojay-resolver-convention' and change its version
from '0.8.0' to '1.0.0' so the plugins declaration reads the newer stable
release.


rootProject.name = 'lookup-service'

139 changes: 70 additions & 69 deletions src/main/java/com/scienceminer/glutton/utils/grobid/GrobidClient.java
Original file line number Diff line number Diff line change
Expand Up @@ -2,104 +2,105 @@

import com.ctc.wstx.stax.WstxInputFactory;
import com.scienceminer.glutton.exception.ServiceException;
import com.scienceminer.glutton.utils.xml.StaxUtils;
import com.scienceminer.glutton.utils.grobid.GrobidResponseStaxHandler.GrobidResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.Consts;
import org.apache.http.HttpResponse;
import org.apache.http.client.ResponseHandler;
import org.apache.http.NameValuePair;
import org.apache.http.client.HttpClient;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.concurrent.FutureCallback;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import com.scienceminer.glutton.utils.xml.StaxUtils;
import org.codehaus.stax2.XMLStreamReader2;
import org.apache.commons.io.IOUtils;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import javax.xml.stream.XMLStreamException;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;
import java.util.function.Consumer;
import java.net.URI;
import java.net.URLEncoder;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.charset.StandardCharsets;
import java.time.Duration;

/**
* Synchronous grobid client
* Synchronous Grobid client built on the JDK 11+ {@link HttpClient}.
* <p>
* Calls only two Grobid REST endpoints:
* <ul>
* <li>{@code GET /isalive} — health check (see {@link #ping()})</li>
* <li>{@code POST /processCitation} — parse a raw citation string (see {@link #processCitation(String, String)})</li>
* </ul>
* Both endpoints have been stable across Grobid 0.7.x and 0.8.x.
*/
public class GrobidClient {

private static final Logger LOGGER = LoggerFactory.getLogger(GrobidClient.class);

//private ClosableHttpClient httpClient;
private String grobidPath;
private WstxInputFactory inputFactory = new WstxInputFactory();
//private GrobidResponseStaxHandler grobidResponseStaxHandler = new GrobidResponseStaxHandler();
private static final Duration CONNECT_TIMEOUT = Duration.ofSeconds(10);
private static final Duration REQUEST_TIMEOUT = Duration.ofSeconds(30);

private final String grobidPath;
private final HttpClient httpClient;
private final WstxInputFactory inputFactory = new WstxInputFactory();

public GrobidClient(String grobidPath) {
this.grobidPath = grobidPath;
//this.httpClient = HttpClients.createDefault();
this.httpClient = HttpClient.newBuilder()
.connectTimeout(CONNECT_TIMEOUT)
.version(HttpClient.Version.HTTP_1_1)
.build();
}

public void ping() throws ServiceException {
try (CloseableHttpClient httpClient = HttpClients.createDefault()) {
final HttpGet httpGet = new HttpGet(grobidPath + "/isalive");
HttpResponse response = httpClient.execute(httpGet);
if (response.getStatusLine().getStatusCode() != HttpURLConnection.HTTP_OK) {
throw new ServiceException(502, "Error while connecting to GROBID service. Error code: " + response.getStatusLine().getStatusCode());
HttpRequest request = HttpRequest.newBuilder(URI.create(grobidPath + "/isalive"))
.timeout(REQUEST_TIMEOUT)
.GET()
.build();
try {
HttpResponse<Void> response = httpClient.send(request, HttpResponse.BodyHandlers.discarding());
if (response.statusCode() != HttpURLConnection.HTTP_OK) {
throw new ServiceException(502, "Error while connecting to GROBID service. Error code: " + response.statusCode());
}
} catch (Exception e) {
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new ServiceException(502, "Interrupted while connecting to GROBID service", e);
} catch (IOException e) {
throw new ServiceException(502, "Error while connecting to GROBID service", e);
}
}

public GrobidResponse processCitation(String rawCitation, String consolidation) throws ServiceException {
GrobidResponse grobidResponse = null;

try (CloseableHttpClient httpClient = HttpClients.createDefault()) {
final HttpPost request = new HttpPost(grobidPath + "/processCitation");

List<NameValuePair> formparams = new ArrayList<>();
formparams.add(new BasicNameValuePair("citations", rawCitation));
formparams.add(new BasicNameValuePair("consolidateCitation", consolidation));
UrlEncodedFormEntity entity = new UrlEncodedFormEntity(formparams, Consts.UTF_8);
request.setEntity(entity);

ResponseHandler<GrobidResponse> responseHandler = new ResponseHandler<GrobidResponse>() {

@Override
public GrobidResponse handleResponse(HttpResponse response) throws ClientProtocolException, IOException {
if (response.getStatusLine().getStatusCode() != HttpURLConnection.HTTP_OK) {
throw new ServiceException(502, "Error while connecting to GROBID service. Error code: " + response.getStatusLine().getStatusCode());
} else {
try {
XMLStreamReader2 reader = (XMLStreamReader2) inputFactory.createXMLStreamReader(response.getEntity().getContent());
GrobidResponseStaxHandler grobidResponseStaxHandler = new GrobidResponseStaxHandler();

StaxUtils.traverse(reader, grobidResponseStaxHandler);

return grobidResponseStaxHandler.getResponse();
} catch (IOException | XMLStreamException e) {
throw new ServiceException(502, "Cannot parse the response from GROBID", e);
}
}
}
};

grobidResponse = httpClient.execute(request, responseHandler);
} catch(IOException e) {
String formBody = "citations=" + URLEncoder.encode(rawCitation, StandardCharsets.UTF_8)
+ "&consolidateCitation=" + URLEncoder.encode(consolidation, StandardCharsets.UTF_8);

HttpRequest request = HttpRequest.newBuilder(URI.create(grobidPath + "/processCitation"))
.timeout(REQUEST_TIMEOUT)
.header("Content-Type", "application/x-www-form-urlencoded")
.POST(HttpRequest.BodyPublishers.ofString(formBody, StandardCharsets.UTF_8))
.build();

try {
HttpResponse<InputStream> response = httpClient.send(request, HttpResponse.BodyHandlers.ofInputStream());
if (response.statusCode() != HttpURLConnection.HTTP_OK) {
throw new ServiceException(502, "Error while connecting to GROBID service. Error code: " + response.statusCode());
}
try (InputStream body = response.body()) {
return parseGrobidResponse(body);
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new ServiceException(502, "Interrupted while calling GROBID", e);
} catch (IOException e) {
throw new ServiceException(502, "Error calling GROBID", e);
}
}

return grobidResponse;
private GrobidResponse parseGrobidResponse(InputStream body) throws ServiceException {
try {
XMLStreamReader2 reader = (XMLStreamReader2) inputFactory.createXMLStreamReader(body);
GrobidResponseStaxHandler handler = new GrobidResponseStaxHandler();
StaxUtils.traverse(reader, handler);
return handler.getResponse();
} catch (XMLStreamException e) {
throw new ServiceException(502, "Cannot parse the response from GROBID", e);
}
}
Comment on lines +96 to 105
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

XMLStreamReader is not closed after parsing.

The XMLStreamReader2 created at line 98 is never explicitly closed. While the underlying InputStream is closed by the caller (line 85-87), the XMLStreamReader holds its own parser resources that should be released. Per the relevant snippet in StaxUtils.java, traverse() does not close the reader.

🛡️ Proposed fix to close the reader
 private GrobidResponse parseGrobidResponse(InputStream body) throws ServiceException {
+    XMLStreamReader2 reader = null;
     try {
-        XMLStreamReader2 reader = (XMLStreamReader2) inputFactory.createXMLStreamReader(body);
+        reader = (XMLStreamReader2) inputFactory.createXMLStreamReader(body);
         GrobidResponseStaxHandler handler = new GrobidResponseStaxHandler();
         StaxUtils.traverse(reader, handler);
         return handler.getResponse();
     } catch (XMLStreamException e) {
         throw new ServiceException(502, "Cannot parse the response from GROBID", e);
+    } finally {
+        if (reader != null) {
+            try {
+                reader.close();
+            } catch (XMLStreamException ignored) {
+                // Best-effort cleanup
+            }
+        }
     }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
private GrobidResponse parseGrobidResponse(InputStream body) throws ServiceException {
try {
XMLStreamReader2 reader = (XMLStreamReader2) inputFactory.createXMLStreamReader(body);
GrobidResponseStaxHandler handler = new GrobidResponseStaxHandler();
StaxUtils.traverse(reader, handler);
return handler.getResponse();
} catch (XMLStreamException e) {
throw new ServiceException(502, "Cannot parse the response from GROBID", e);
}
}
private GrobidResponse parseGrobidResponse(InputStream body) throws ServiceException {
XMLStreamReader2 reader = null;
try {
reader = (XMLStreamReader2) inputFactory.createXMLStreamReader(body);
GrobidResponseStaxHandler handler = new GrobidResponseStaxHandler();
StaxUtils.traverse(reader, handler);
return handler.getResponse();
} catch (XMLStreamException e) {
throw new ServiceException(502, "Cannot parse the response from GROBID", e);
} finally {
if (reader != null) {
try {
reader.close();
} catch (XMLStreamException ignored) {
// Best-effort cleanup
}
}
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main/java/com/scienceminer/glutton/utils/grobid/GrobidClient.java` around
lines 96 - 105, parseGrobidResponse creates an XMLStreamReader2 via inputFactory
and calls StaxUtils.traverse(reader, handler) but never closes the reader;
update parseGrobidResponse to ensure the XMLStreamReader2 (reader) is closed
after parsing (e.g. use try-with-resources or a try/finally that calls
reader.close()), preserving the existing ServiceException handling (wrap
XMLStreamException as before) and still returning handler.getResponse();
reference symbols: parseGrobidResponse, XMLStreamReader2, inputFactory,
StaxUtils.traverse, GrobidResponseStaxHandler.

}
Loading