Skip to content

Commit 124f66d

Browse files
authored
Bump version to 2.0.0 (#139)
+ Improve `max_results`/`delay_seconds` types, defaults (#138) + Eliminate `get`, deprecate `Search.Results` (#137) + Accelerate CI integration tests (#140)
1 parent eb930dd commit 124f66d

File tree

7 files changed

+75
-107
lines changed

7 files changed

+75
-107
lines changed

.github/workflows/python-package.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@ jobs:
1212
runs-on: ubuntu-latest
1313
strategy:
1414
fail-fast: false
15-
max-parallel: 1
1615
matrix:
1716
python-version: ["3.7", "3.10", "3.11"]
1817
steps:

README.md

Lines changed: 20 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -36,19 +36,19 @@ A `Search` specifies a search of arXiv's database.
3636
arxiv.Search(
3737
query: str = "",
3838
id_list: List[str] = [],
39-
max_results: float = float('inf'),
39+
max_results: int | None = None,
4040
sort_by: SortCriterion = SortCriterion.Relevance,
4141
sort_order: SortOrder = SortOrder.Descending
4242
)
4343
```
4444

4545
+ `query`: an arXiv query string. Advanced query formats are documented in the [arXiv API User Manual](https://arxiv.org/help/api/user-manual#query_details).
4646
+ `id_list`: list of arXiv record IDs (typically of the format `"0710.5765v1"`). See [the arXiv API User's Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list) for documentation of the interaction between `query` and `id_list`.
47-
+ `max_results`: The maximum number of results to be returned in an execution of this search. To fetch every result available, set `max_results=float('inf')` (default); to fetch up to 10 results, set `max_results=10`. The API's limit is 300,000 results.
47+
+ `max_results`: The maximum number of results to be returned in an execution of this search. To fetch every result available, set `max_results=None` (default); to fetch up to 10 results, set `max_results=10`. The API's limit is 300,000 results.
4848
+ `sort_by`: The sort criterion for results: `relevance`, `lastUpdatedDate`, or `submittedDate`.
4949
+ `sort_order`: The sort order for results: `'descending'` or `'ascending'`.
5050

51-
To fetch arXiv records matching a `Search`, use `search.results()` or `(Client).results(search)` to get a generator yielding `Result`s.
51+
To fetch arXiv records matching a `Search`, use `(Client).results(search)` to get a generator yielding `Result`s.
5252

5353
#### Example: fetching results
5454

@@ -63,7 +63,7 @@ search = arxiv.Search(
6363
sort_by = arxiv.SortCriterion.SubmittedDate
6464
)
6565

66-
for result in search.results():
66+
for result in arxiv.Client().results(search):
6767
print(result.title)
6868
```
6969

@@ -72,16 +72,18 @@ Fetch and print the title of the paper with ID "1605.08386v1:"
7272
```python
7373
import arxiv
7474

75+
client = arxiv.Client()
7576
search = arxiv.Search(id_list=["1605.08386v1"])
76-
paper = next(search.results())
77+
78+
paper = next(arxiv.Client().results(search))
7779
print(paper.title)
7880
```
7981

8082
### Result
8183

8284
<!-- TODO: improve this section. -->
8385

84-
The `Result` objects yielded by `(Search).results()` include metadata about each paper and some helper functions for downloading their content.
86+
The `Result` objects yielded by `(Client).results()` include metadata about each paper and some helper functions for downloading their content.
8587

8688
The meaning of the underlying raw data is documented in the [arXiv API User Manual: Details of Atom Results Returned](https://arxiv.org/help/api/user-manual#_details_of_atom_results_returned).
8789

@@ -108,7 +110,7 @@ To download a PDF of the paper with ID "1605.08386v1," run a `Search` and then u
108110
```python
109111
import arxiv
110112

111-
paper = next(arxiv.Search(id_list=["1605.08386v1"]).results())
113+
paper = next(arxiv.Client().results(arxiv.Search(id_list=["1605.08386v1"])))
112114
# Download the PDF to the PWD with a default filename.
113115
paper.download_pdf()
114116
# Download the PDF to the PWD with a custom filename.
@@ -122,7 +124,7 @@ The same interface is available for downloading .tar.gz files of the paper sourc
122124
```python
123125
import arxiv
124126

125-
paper = next(arxiv.Search(id_list=["1605.08386v1"]).results())
127+
paper = next(arxiv.Client().results(arxiv.Search(id_list=["1605.08386v1"])))
126128
# Download the archive to the PWD with a default filename.
127129
paper.download_source()
128130
# Download the archive to the PWD with a custom filename.
@@ -133,14 +135,13 @@ paper.download_source(dirpath="./mydir", filename="downloaded-paper.tar.gz")
133135

134136
### Client
135137

136-
A `Client` specifies a strategy for fetching results from arXiv's API; it obscures pagination and retry logic.
137-
138-
For most use cases the default client should suffice. You can construct it explicitly with `arxiv.Client()`, or use it via the `(Search).results()` method.
138+
A `Client` specifies a strategy for fetching results from arXiv's API; it obscures pagination and retry logic. For most use cases the default client should suffice.
139139

140140
```python
141+
# Default client properties.
141142
arxiv.Client(
142143
page_size: int = 100,
143-
delay_seconds: int = 3,
144+
delay_seconds: float = 3.0,
144145
num_retries: int = 3
145146
)
146147
```
@@ -151,14 +152,12 @@ arxiv.Client(
151152

152153
#### Example: fetching results with a custom client
153154

154-
`(Search).results()` uses the default client settings. If you want to use a client you've defined instead of the defaults, use `(Client).results(...)`:
155-
156155
```python
157156
import arxiv
158157

159158
big_slow_client = arxiv.Client(
160159
page_size = 1000,
161-
delay_seconds = 10,
160+
delay_seconds = 10.0,
162161
num_retries = 5
163162
)
164163

@@ -173,9 +172,11 @@ To inspect this package's network behavior and API logic, configure an `INFO`-le
173172

174173
```pycon
175174
>>> import logging, arxiv
176-
>>> logging.basicConfig(level=logging.INFO)
177-
>>> paper = next(arxiv.Search(id_list=["1605.08386v1"]).results())
175+
>>> logging.basicConfig(level=logging.DEBUG)
176+
>>> client = arxiv.Client()
177+
>>> paper = next(client.results(arxiv.Search(id_list=["1605.08386v1"])))
178178
INFO:arxiv.arxiv:Requesting 100 results at offset 0
179-
INFO:arxiv.arxiv:Requesting page of results
180-
INFO:arxiv.arxiv:Got first page; 1 of inf results available
179+
INFO:arxiv.arxiv:Requesting page (first: False, try: 0): https://export.arxiv.org/api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100
180+
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): export.arxiv.org:443
181+
DEBUG:urllib3.connectionpool:https://export.arxiv.org:443 "GET /api/query?search_query=&id_list=1605.08386v1&sortBy=relevance&sortOrder=descending&start=0&max_results=100&user-agent=arxiv.py%2F1.4.8 HTTP/1.1" 200 979
181182
```

arxiv/arxiv.py

Lines changed: 42 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,10 @@
33

44
import logging
55
import time
6+
import itertools
67
import feedparser
78
import os
9+
import math
810
import re
911
import requests
1012
import warnings
@@ -422,12 +424,12 @@ class Search(object):
422424
Manual](https://arxiv.org/help/api/user-manual#search_query_and_id_list)
423425
for documentation of the interaction between `query` and `id_list`.
424426
"""
425-
max_results: float
427+
max_results: int | None
426428
"""
427429
The maximum number of results to be returned in an execution of this
428430
search.
429431
430-
To fetch every result available, set `max_results=float('inf')`.
432+
To fetch every result available, set `max_results=None`.
431433
"""
432434
sort_by: SortCriterion
433435
"""The sort criterion for results."""
@@ -438,7 +440,7 @@ def __init__(
438440
self,
439441
query: str = "",
440442
id_list: List[str] = [],
441-
max_results: float = float("inf"),
443+
max_results: int | None = None,
442444
sort_by: SortCriterion = SortCriterion.Relevance,
443445
sort_order: SortOrder = SortOrder.Descending,
444446
):
@@ -447,7 +449,8 @@ def __init__(
447449
"""
448450
self.query = query
449451
self.id_list = id_list
450-
self.max_results = max_results
452+
# Handle deprecated v1 default behavior.
453+
self.max_results = None if max_results == math.inf else max_results
451454
self.sort_by = sort_by
452455
self.sort_order = sort_order
453456

@@ -479,23 +482,19 @@ def _url_args(self) -> Dict[str, str]:
479482
"sortOrder": self.sort_order.value,
480483
}
481484

482-
def get(self) -> Generator[Result, None, None]:
483-
"""
484-
**Deprecated** after 1.2.0; use `Search.results`.
485-
"""
486-
warnings.warn(
487-
"The 'get' method is deprecated, use 'results' instead",
488-
DeprecationWarning,
489-
stacklevel=2,
490-
)
491-
return self.results()
492-
493485
def results(self, offset: int = 0) -> Generator[Result, None, None]:
494486
"""
495487
Executes the specified search using a default arXiv API client.
496488
497489
For info on default behavior, see `Client.__init__` and `Client.results`.
490+
491+
**Deprecated** after 2.0.0; use `Client.results`.
498492
"""
493+
warnings.warn(
494+
"The '(Search).results' method is deprecated, use 'Client.results' instead",
495+
DeprecationWarning,
496+
stacklevel=2,
497+
)
499498
return Client().results(self, offset=offset)
500499

501500

@@ -511,7 +510,7 @@ class Client(object):
511510
"""The arXiv query API endpoint format."""
512511
page_size: int
513512
"""Maximum number of results fetched in a single API request."""
514-
delay_seconds: int
513+
delay_seconds: float
515514
"""Number of seconds to wait between API requests."""
516515
num_retries: int
517516
"""Number of times to retry a failing API request."""
@@ -520,7 +519,7 @@ class Client(object):
520519
_session: requests.Session
521520

522521
def __init__(
523-
self, page_size: int = 100, delay_seconds: int = 3, num_retries: int = 3
522+
self, page_size: int = 100, delay_seconds: float = 3.0, num_retries: int = 3
524523
):
525524
"""
526525
Constructs an arXiv API client with the specified options.
@@ -548,17 +547,6 @@ def __repr__(self) -> str:
548547
repr(self.num_retries),
549548
)
550549

551-
def get(self, search: Search) -> Generator[Result, None, None]:
552-
"""
553-
**Deprecated** after 1.2.0; use `Client.results`.
554-
"""
555-
warnings.warn(
556-
"The 'get' method is deprecated, use 'results' instead",
557-
DeprecationWarning,
558-
stacklevel=2,
559-
)
560-
return self.results(search)
561-
562550
def results(self, search: Search, offset: int = 0) -> Generator[Result, None, None]:
563551
"""
564552
Uses this client configuration to fetch one page of the search results
@@ -574,46 +562,37 @@ def results(self, search: Search, offset: int = 0) -> Generator[Result, None, No
574562
For more on using generators, see
575563
[Generators](https://wiki.python.org/moin/Generators).
576564
"""
565+
limit = search.max_results - offset if search.max_results else None
566+
if limit and limit < 0:
567+
return iter(())
568+
return itertools.islice(self._results(search, offset), limit)
569+
570+
def _results(
571+
self, search: Search, offset: int = 0
572+
) -> Generator[Result, None, None]:
573+
page_url = self._format_url(search, offset, self.page_size)
574+
feed = self._parse_feed(page_url, first_page=True)
575+
if not feed.entries:
576+
logger.info("Got empty first page; stopping generation")
577+
return
578+
total_results = int(feed.feed.opensearch_totalresults)
579+
logger.info(
580+
"Got first page: %d of %d total results",
581+
len(feed.entries),
582+
total_results,
583+
)
577584

578-
# total_results may be reduced according to the feed's
579-
# opensearch:totalResults value.
580-
total_results = search.max_results
581-
first_page = True
582-
while offset < total_results:
583-
page_size = min(self.page_size, search.max_results - offset)
584-
logger.info("Requesting %d results at offset %d", page_size, offset)
585-
page_url = self._format_url(search, offset, page_size)
586-
feed = self._parse_feed(page_url, first_page=first_page)
587-
if first_page:
588-
# NOTE: this is an ugly fix for a known bug. The totalresults
589-
# value is set to 1 for results with zero entries. If that API
590-
# bug is fixed, we can remove this conditional and always set
591-
# `total_results = min(...)`.
592-
if len(feed.entries) == 0:
593-
logger.info("Got empty first page; stopping generation")
594-
total_results = 0
595-
else:
596-
total_results = min(
597-
total_results, int(feed.feed.opensearch_totalresults)
598-
)
599-
logger.info(
600-
"Got first page: %d of %d total results",
601-
total_results,
602-
search.max_results
603-
if search.max_results != float("inf")
604-
else -1,
605-
)
606-
# Subsequent pages are not the first page.
607-
first_page = False
608-
# Update offset for next request: account for received results.
609-
offset += len(feed.entries)
610-
# Yield query results until page is exhausted.
585+
while feed.entries:
611586
for entry in feed.entries:
612587
try:
613588
yield Result._from_feed_entry(entry)
614589
except Result.MissingFieldError as e:
615590
logger.warning("Skipping partial result: %s", e)
616-
continue
591+
offset += len(feed.entries)
592+
if offset >= total_results:
593+
break
594+
page_url = self._format_url(search, offset, self.page_size)
595+
feed = self._parse_feed(page_url, first_page=False)
617596

618597
def _format_url(self, search: Search, start: int, page_size: int) -> str:
619598
"""
@@ -679,7 +658,7 @@ def __try_parse_feed(
679658
"Requesting page (first: %r, try: %d): %s", first_page, try_index, url
680659
)
681660

682-
resp = self._session.get(url, headers={"user-agent": "arxiv.py/1.4.8"})
661+
resp = self._session.get(url, headers={"user-agent": "arxiv.py/2.0.0"})
683662
self._last_request_dt = datetime.now()
684663
if resp.status_code != requests.codes.OK:
685664
raise HTTPError(url, try_index, resp.status_code)

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from setuptools import setup
22

3-
version = "1.4.8"
3+
version = "2.0.0"
44

55
with open("README.md", "r") as fh:
66
long_description = fh.read()

tests/test_api_bugs.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
import unittest
66

77

8-
class TestClient(unittest.TestCase):
8+
class TestAPIBugs(unittest.TestCase):
99
def test_missing_title(self):
1010
"""
1111
Papers with the title "0" do not have a title element in the Atom feed.

tests/test_client.py

Lines changed: 6 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,9 @@
33
import arxiv
44
from datetime import datetime, timedelta
55
from pytest import approx
6-
import time
76

87

98
class TestClient(unittest.TestCase):
10-
def tearDown(self) -> None:
11-
# Bodge: sleep three seconds between tests to simulate a shared rate limit.
12-
time.sleep(3)
13-
return super().tearDown()
14-
159
def test_invalid_format_id(self):
1610
with self.assertRaises(arxiv.HTTPError):
1711
list(arxiv.Client(num_retries=0).results(arxiv.Search(id_list=["abc"])))
@@ -58,7 +52,7 @@ def test_query_page_count(self):
5852
"https://export.arxiv.org/api/query?search_query=testing&id_list=&sortBy=relevance&sortOrder=descending&start=20&max_results=10",
5953
"https://export.arxiv.org/api/query?search_query=testing&id_list=&sortBy=relevance&sortOrder=descending&start=30&max_results=10",
6054
"https://export.arxiv.org/api/query?search_query=testing&id_list=&sortBy=relevance&sortOrder=descending&start=40&max_results=10",
61-
"https://export.arxiv.org/api/query?search_query=testing&id_list=&sortBy=relevance&sortOrder=descending&start=50&max_results=5",
55+
"https://export.arxiv.org/api/query?search_query=testing&id_list=&sortBy=relevance&sortOrder=descending&start=50&max_results=10",
6256
},
6357
)
6458

@@ -79,14 +73,12 @@ def test_offset(self):
7973
self.assertListEqual(offset_above_max_results, [])
8074

8175
def test_search_results_offset(self):
76+
# NOTE: page size is irrelevant here.
77+
client = arxiv.Client(page_size=15)
8278
search = arxiv.Search(query="testing", max_results=10)
83-
client = arxiv.Client()
84-
85-
all_results = list(client.results(search, 0))
79+
all_results = list(client.results(search, offset=0))
8680
self.assertEqual(len(all_results), 10)
8781

88-
client.page_size = 5
89-
9082
for offset in [0, 5, 9, 10, 11]:
9183
client_results = list(client.results(search, offset=offset))
9284
self.assertEqual(len(client_results), max(0, search.max_results - offset))
@@ -191,12 +183,12 @@ def test_sleep_between_errors(self, patched_time_sleep):
191183
self.assertEqual(patched_time_sleep.call_count, client.num_retries)
192184
patched_time_sleep.assert_has_calls(
193185
[
194-
call(approx(client.delay_seconds, rel=1e-3)),
186+
call(approx(client.delay_seconds, abs=1e-2)),
195187
]
196188
* client.num_retries
197189
)
198190

199-
def get_code_client(code: int, delay_seconds=3, num_retries=3) -> arxiv.Client:
191+
def get_code_client(code: int, delay_seconds=0.1, num_retries=3) -> arxiv.Client:
200192
"""
201193
get_code_client returns an arxiv.Cient with HTTP requests routed to
202194
httpstat.us.

0 commit comments

Comments
 (0)