Skip to content

Commit 25d3844

Browse files
authored
Merge pull request #53 from allenai/py311_and_uv
Moving to uv and py311 for main
2 parents ca14837 + dea00b7 commit 25d3844

27 files changed

+2303
-326
lines changed

.github/workflows/main.yaml

Lines changed: 84 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,92 @@ name: CI
22

33
on:
44
pull_request:
5-
branches:
6-
- main
5+
branches: [main]
6+
7+
concurrency:
8+
group: ${{ github.workflow }}-${{ github.ref }}
9+
cancel-in-progress: true
710

811
jobs:
9-
build:
12+
lint:
13+
runs-on: ubuntu-latest
14+
steps:
15+
- uses: actions/checkout@v4
16+
17+
# Install uv (fast Python + package manager) and enable caching
18+
- name: Setup uv
19+
uses: astral-sh/setup-uv@v3
1020

21+
# Cache uv tool + resolver cache (speeds up uvx + resolves)
22+
- name: Cache uv caches
23+
uses: actions/cache@v4
24+
with:
25+
path: |
26+
~/.cache/uv
27+
key: uv-cache-${{ runner.os }}-${{ hashFiles('pyproject.toml', 'uv.lock') }}
28+
29+
# Style checks via uvx (no env creation needed, blazing fast)
30+
- name: black (s2and/)
31+
run: uvx --from black==24.8.0 black s2and --check --line-length 120
32+
- name: black (scripts/*.py)
33+
shell: bash
34+
run: |
35+
shopt -s nullglob
36+
files=(scripts/*.py)
37+
if (( ${#files[@]} )); then
38+
uvx --from black==24.8.0 black "${files[@]}" --check --line-length 120
39+
fi
40+
41+
typecheck-and-test:
1142
runs-on: ubuntu-latest
12-
43+
needs: [lint]
1344
steps:
14-
- uses: actions/checkout@v1
15-
- name: Build and test with Docker
16-
run: |
17-
docker build --tag s2and .
18-
docker run --rm s2and pytest tests/ --verbose
19-
docker run --rm s2and black s2and --check --line-length 120
20-
docker run --rm s2and black scripts/*.py --check --line-length 120
21-
docker run --rm s2and bash scripts/mypy.sh
22-
docker run --rm s2and pytest tests/ --cov s2and --cov-fail-under=40
45+
- uses: actions/checkout@v4
46+
47+
- name: Setup uv
48+
uses: astral-sh/setup-uv@v3
49+
50+
# Optional: ensure a specific Python (uv can also manage this on its own)
51+
- name: Setup Python
52+
uses: actions/setup-python@v5
53+
with:
54+
python-version: '3.11'
55+
56+
# Cache uv resolver + wheels + project venv
57+
- name: Cache uv + venv
58+
uses: actions/cache@v4
59+
with:
60+
path: |
61+
~/.cache/uv
62+
.venv
63+
key: uv-venv-${{ runner.os }}-py311-${{ hashFiles('pyproject.toml', 'uv.lock') }}
64+
restore-keys: |
65+
uv-venv-${{ runner.os }}-py311-
66+
uv-venv-
67+
68+
# Sync environment from lock if present (fast; no network if cached)
69+
- name: Sync deps (locked if available)
70+
shell: bash
71+
run: |
72+
if [[ -f uv.lock ]]; then
73+
uv sync --all-extras --dev --frozen
74+
else
75+
# No lock present; resolve once, then install
76+
uv sync --all-extras --dev
77+
fi
78+
79+
# Type checking (run mypy commands directly)
80+
- name: mypy (s2and)
81+
run: uv run mypy s2and --ignore-missing-imports
82+
- name: mypy (scripts)
83+
run: uv run mypy scripts/*.py --ignore-missing-imports
84+
85+
# Single pytest run with coverage (replaces the two docker pytest calls)
86+
- name: pytest (coverage)
87+
env:
88+
# keep startup lean; avoid user-level plugins on hosted runners
89+
PYTHONPATH: .
90+
run: |
91+
uv run pytest tests/ \
92+
--cov=s2and --cov-report=term-missing --cov-fail-under=40
93+

README.md

Lines changed: 43 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,33 +3,60 @@ This repository provides access to the S2AND dataset and S2AND reference model d
33

44
The reference model is live on semanticscholar.org, and the trained model is available now as part of the data download (see below).
55

6+
## Installation Prereqs (one-time)
7+
Clone the repo.
8+
9+
If `uv` is not installed yet, install it:
10+
11+
```bash
12+
# (any OS) install uv into the Python you use to bootstrap environments
13+
python -m pip install --user --upgrade uv
14+
# Alternatively (if you use pipx): pipx install uv
15+
```
16+
17+
---
18+
619
## Installation
7-
To install this package, run the following:
20+
21+
1. From repo root:
22+
23+
```bash
24+
# create the project venv (uv defaults to .venv if you don't give a name)
25+
uv venv --python 3.11
26+
```
27+
28+
2. Activate the venv (choose one):
829

930
```bash
10-
git clone https://github.com/allenai/S2AND.git
11-
cd S2AND
12-
conda create -y --name s2and python==3.8.15
13-
conda activate s2and
14-
pip install -r requirements.in
15-
pip install -e .
31+
# macOS / Linux (bash / zsh)
32+
source .venv/bin/activate
33+
34+
# Windows PowerShell
35+
. .venv\Scripts\Activate.ps1
36+
37+
# Windows CMD
38+
.venv\Scripts\activate.bat
1639
```
1740

18-
If you run into cryptic errors about GCC on macOS while installing the requirments, try this instead:
41+
3. Install project dependencies (dev extras):
42+
1943
```bash
20-
CFLAGS='-stdlib=libc++' pip install -r requirements.in
44+
# prefer uv --active so uv uses your activated environment
45+
uv sync --active --all-extras --dev
2146
```
2247

23-
Or use uv with a more recent Python version (3.11+):
48+
## Running Tests
49+
50+
To run the tests, use the following command:
51+
2452
```bash
25-
uv venv s2anduv --python 3.11
26-
source s2anduv\Scripts\activate # macOS/Linux
27-
# s2anduv\Scripts\activate # Windows
28-
uv pip install fasttext-wheel pycld2
29-
uv pip install -r requirements_py_311.in
30-
uv pip install -e . --no-deps
53+
uv run pytest tests/
3154
```
3255

56+
To run the entire CI suite mimicking the GH Actions, use the following command:
57+
```bash
58+
python scripts\run_ci_locally.py
59+
```
3360

3461
## Data
3562
To obtain the S2AND dataset, run the following command after the package is installed (from inside the `S2AND` directory):

pyproject.toml

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
[build-system]
2+
requires = ["setuptools>=68", "wheel"]
3+
build-backend = "setuptools.build_meta"
4+
5+
[project]
6+
name = "s2and"
7+
version = "0.1.0"
8+
description = "S2AND"
9+
readme = "README.md"
10+
requires-python = ">=3.11"
11+
license = { text = "MIT" }
12+
authors = [{ name = "Sergey Feldman, Daniel King, Shivashankar Subramanian" }]
13+
14+
# --- Runtime dependencies (loosened, conservative) ---
15+
dependencies = [
16+
"fasttext-wheel>=0.9.2",
17+
"pycld2>=0.41",
18+
"scikit-learn>=1.2,<1.5",
19+
"text-unidecode==1.3",
20+
"requests>=2.28,<3",
21+
"hyperopt>=0.2.4,<0.3",
22+
"pandas>=1.5,<2.2",
23+
"lightgbm==3.2.1",
24+
"fastcluster>=1.2.6,<2",
25+
"genieclust>=1.1.4,<2",
26+
"matplotlib>=3.7,<3.9",
27+
"seaborn>=0.12,<0.14",
28+
"tqdm>=4.64,<5",
29+
"strsimpy>=0.2,<0.3",
30+
"jellyfish>=0.9,<2",
31+
"numpy>=1.24,<2",
32+
"orjson>=3.9,<4",
33+
"shap",
34+
"sinonym",
35+
# Backport only for older Pythons; not needed on 3.11+
36+
'importlib-metadata>=4.13; python_version < "3.10"',
37+
]
38+
39+
[project.optional-dependencies]
40+
dev = [
41+
# Test stack
42+
"pytest==8.4.1",
43+
"pytest-cov>=4,<6",
44+
# Type checking
45+
"mypy>=1.5.1",
46+
# Linters/formatters
47+
"black==24.8.0",
48+
"flake8>=6,<8", # or prefer ruff below
49+
"ruff>=0.4,<0.7",
50+
# CLI helpers used in some repos
51+
"click>=8,<9",
52+
]
53+
54+
[tool.setuptools.packages.find]
55+
include = ["s2and*"]
56+
57+
# ---- Tooling config ----
58+
[tool.black]
59+
line-length = 120
60+
target-version = ["py311"]
61+
62+
[tool.pytest.ini_options]
63+
minversion = "7.0"
64+
testpaths = ["tests"]
65+
66+
# (Optional) Ruff config if you use it instead of flake8
67+
[tool.ruff]
68+
line-length = 120
69+
target-version = "py311"
70+
select = ["E","F","I","UP","B"]
71+
ignore = []
72+
73+
# If you keep flake8, you can mirror the same line length:
74+
[tool.flake8]
75+
max-line-length = 120
76+
77+
# ------------------------
78+
# If you must replicate the *exact* legacy pins you sent, use this block instead
79+
# of the loosened dependencies above (comment out the dependencies list above and
80+
# paste these into it). This is *not* recommended long-term:
81+
#
82+
# "scikit-learn==1.2.2",
83+
# "text-unidecode==1.3",
84+
# "requests==2.24.0",
85+
# "hyperopt==0.2.4",
86+
# "pandas>=1.2",
87+
# "lightgbm==3.0.0",
88+
# "fastcluster==1.2.6",
89+
# "genieclust==1.1.4",
90+
# "matplotlib==3.7.1",
91+
# "seaborn==0.12.2",
92+
# "tqdm==4.49.0",
93+
# "strsimpy==0.2.0",
94+
# "jellyfish==0.8.2",
95+
# "numpy==1.24.3",
96+
# "orjson",
97+
# "shap",
98+
# "sinonym",
99+
# 'importlib-metadata==4.13.0; python_version < "3.10"',
100+
# "click>=7.1.2",
101+
#
102+
# And dev tools (old pins):
103+
# dev = [
104+
# "pytest==8.4.1",
105+
# "pytest-cov==2.10.1",
106+
# "flake8==3.8.3",
107+
# "black==22.3.0",
108+
# "mypy>=1.5.1",
109+
# 'importlib-metadata==4.13.0; python_version < "3.10"',
110+
# "click>=7.1.2",
111+
# ]
112+
# ------------------------

requirements.in

Lines changed: 0 additions & 27 deletions
This file was deleted.

requirements_py_311.in

Lines changed: 0 additions & 25 deletions
This file was deleted.

s2and/data.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1464,7 +1464,7 @@ def preprocess_papers_parallel(papers_dict: Dict, n_jobs: int, preprocess: bool)
14641464
output: Dict = {}
14651465
if n_jobs > 1:
14661466
# Use UniversalPool to replicate the original p.imap() streaming behavior
1467-
with UniversalPool(processes=n_jobs) as p:
1467+
with UniversalPool(processes=n_jobs) as p: # type: ignore
14681468
_max = len(papers_dict)
14691469
with tqdm(total=_max, desc="Preprocessing papers 1/2") as pbar:
14701470
for key, value in p.imap(preprocess_paper_1, papers_dict.items(), 1000):
@@ -1488,7 +1488,8 @@ def preprocess_papers_parallel(papers_dict: Dict, n_jobs: int, preprocess: bool)
14881488
journal_name=p.journal_name,
14891489
authors=[a.author_name for a in p.authors],
14901490
)
1491-
for p in filter(None, [output.get(str(rid)) for rid in (value.references or [])])
1491+
for p in [output.get(str(rid)) for rid in (value.references or [])]
1492+
if p is not None
14921493
],
14931494
)
14941495
for key, value in output.items()

s2and/subblocking.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -250,9 +250,9 @@ def make_subblocks(signature_ids, anddata, maximum_size=7500, first_k_letter_cou
250250
key
251251
)
252252
for key in list(output_cant_subdivide_single_letter_first_name.keys()):
253-
output_cant_subdivide_single_letter_first_name[
254-
f"{first_letter}|middle=" + str(key)
255-
] = output_cant_subdivide_single_letter_first_name.pop(key)
253+
output_cant_subdivide_single_letter_first_name[f"{first_letter}|middle=" + str(key)] = (
254+
output_cant_subdivide_single_letter_first_name.pop(key)
255+
)
256256
output.update(output_single_letter_first_name)
257257
output_for_specter.update(
258258
output_cant_subdivide_single_letter_first_name

scripts/LLM_based_filtering_of_name_tuples.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -271,6 +271,7 @@ def generate_chinese(input_tuples):
271271
else:
272272
print(f"Unexpected line format: {line}")
273273

274+
274275
# Step 2: A bunch of the names in the final_keep_tuples_deduped
275276
# don't appear in the original name_pairs.txt file, so we need to handle that
276277
# with LLMs!

scripts/blog_post_eval.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@
1111
python scripts/blog_post_eval.py --random_seed 42 --experiment_name dont_use_name_counts --feature_groups_to_skip name_counts
1212
"""
1313

14-
1514
from typing import Optional, List, Dict, Any
1615

1716
import os

0 commit comments

Comments
 (0)