fix: Article IDs correctly extracted and tested

jannisborn · jannisborn · commit f489cdd382ba · 2025-05-03T13:06:27.000+02:00
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -0,0 +1,29 @@
+---
+name: Source
+on: [push, release]
+
+jobs:
+  test-source-install:
+    runs-on: ubuntu-latest
+    strategy:
+      max-parallel: 3
+      matrix:
+        python-version:
+          - "3.10"
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v2
+      - name: Set up Python
+        uses: actions/setup-python@v2
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -r requirements.txt
+      - name: Install package from source
+        run: pip install -e .
+      - name: Test package from source
+        run: |
+          python -c "import pymed_paperscraper"
+          python -m pytest pymed_paperscraper
diff --git a/README.md b/README.md
@@ -28,10 +28,21 @@ pubmed = PubMed(tool="MyTool", email="my@email.address")
 results = pubmed.query("Some query", max_results=500)
 ```
 
+## Bugfixes compared to archived [`pymed`](https://github.com/gijswobben/pymed):
+- Article IDs are correctly extracted [`pymed#22`](https://github.com/gijswobben/pymed/issues/22)
+- Automatic retries if API is unresponsive/overloaded. Support for `max_tries` in `PubMed` class.
+
 ## Notes on the API
 The original documentation of the PubMed API can be found here: [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc/tools/developers/). PubMed Central kindly requests you to:
 
 > - Do not make concurrent requests, even at off-peak times; and
 > - Include two parameters that help to identify your service or application to our servers
 >   * _tool_ should be the name of the application, as a string value with no internal spaces, and
 >   * _email_ should be the e-mail address of the maintainer of the tool, and should be a valid e-mail address.
+
+## Citation
+If you use `pymed_paperscraper` in your work, please cite:
+```bib
+(Citation follows)
+```
+
diff --git a/pymed_paperscraper/__init__.py b/pymed_paperscraper/__init__.py
@@ -1,4 +1,4 @@
 from .api import PubMed
 
 __all__ = ["PubMed"]
-__version__ = "1.0.3"
+__version__ = "1.0.4"
diff --git a/pymed_paperscraper/article.py b/pymed_paperscraper/article.py
@@ -43,7 +43,7 @@ def __init__(
                 self.__setattr__(field, kwargs.get(field, None))
 
     def _extractPubMedId(self: object, xml_element: TypeVar("Element")) -> str:
-        path = ".//ArticleId[@IdType='pubmed']"
+        path = ".//PubmedData/ArticleIdList/ArticleId[@IdType='pubmed']"
         return getContent(element=xml_element, path=path)
 
     def _extractTitle(self: object, xml_element: TypeVar("Element")) -> str:
diff --git a/pymed_paperscraper/tests/test_article.py b/pymed_paperscraper/tests/test_article.py
@@ -0,0 +1,12 @@
+from pymed_paperscraper import PubMed
+
+
+def test_unique_id():
+	pubmed = PubMed(tool="MyTool", email="my@email.address")
+	query = '((Haliaeetus leucocephalus[Title/Abstract])) AND ((prey[Title/Abstract]) OR (diet[Title/Abstract]))'
+	results = pubmed.query(query, max_results=30)
+
+	for r in results:
+		ids = r.pubmed_id.strip().split("\n")
+		print('org',r.pubmed_id,  'IDS', ids)
+		assert len(ids) == 1