Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 84 additions & 13 deletions .github/workflows/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,92 @@ name: CI

on:
pull_request:
branches:
- main
branches: [main]

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
build:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

# Install uv (fast Python + package manager) and enable caching
- name: Setup uv
uses: astral-sh/setup-uv@v3

# Cache uv tool + resolver cache (speeds up uvx + resolves)
- name: Cache uv caches
uses: actions/cache@v4
with:
path: |
~/.cache/uv
key: uv-cache-${{ runner.os }}-${{ hashFiles('pyproject.toml', 'uv.lock') }}

# Style checks via uvx (no env creation needed, blazing fast)
- name: black (s2and/)
run: uvx --from black==24.8.0 black s2and --check --line-length 120
- name: black (scripts/*.py)
shell: bash
run: |
shopt -s nullglob
files=(scripts/*.py)
if (( ${#files[@]} )); then
uvx --from black==24.8.0 black "${files[@]}" --check --line-length 120
fi

typecheck-and-test:
runs-on: ubuntu-latest

needs: [lint]
steps:
- uses: actions/checkout@v1
- name: Build and test with Docker
run: |
docker build --tag s2and .
docker run --rm s2and pytest tests/ --verbose
docker run --rm s2and black s2and --check --line-length 120
docker run --rm s2and black scripts/*.py --check --line-length 120
docker run --rm s2and bash scripts/mypy.sh
docker run --rm s2and pytest tests/ --cov s2and --cov-fail-under=40
- uses: actions/checkout@v4

- name: Setup uv
uses: astral-sh/setup-uv@v3

# Optional: ensure a specific Python (uv can also manage this on its own)
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.11'

# Cache uv resolver + wheels + project venv
- name: Cache uv + venv
uses: actions/cache@v4
with:
path: |
~/.cache/uv
.venv
key: uv-venv-${{ runner.os }}-py311-${{ hashFiles('pyproject.toml', 'uv.lock') }}
restore-keys: |
uv-venv-${{ runner.os }}-py311-
uv-venv-

# Sync environment from lock if present (fast; no network if cached)
- name: Sync deps (locked if available)
shell: bash
run: |
if [[ -f uv.lock ]]; then
uv sync --all-extras --dev --frozen
else
# No lock present; resolve once, then install
uv sync --all-extras --dev
fi

# Type checking (run mypy commands directly)
- name: mypy (s2and)
run: uv run mypy s2and --ignore-missing-imports
- name: mypy (scripts)
run: uv run mypy scripts/*.py --ignore-missing-imports

# Single pytest run with coverage (replaces the two docker pytest calls)
- name: pytest (coverage)
env:
# keep startup lean; avoid user-level plugins on hosted runners
PYTHONPATH: .
run: |
uv run pytest tests/ \
--cov=s2and --cov-report=term-missing --cov-fail-under=40

59 changes: 43 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,33 +3,60 @@ This repository provides access to the S2AND dataset and S2AND reference model d

The reference model is live on semanticscholar.org, and the trained model is available now as part of the data download (see below).

## Installation Prereqs (one-time)
Clone the repo.

If `uv` is not installed yet, install it:

```bash
# (any OS) install uv into the Python you use to bootstrap environments
python -m pip install --user --upgrade uv
# Alternatively (if you use pipx): pipx install uv
```

---

## Installation
To install this package, run the following:

1. From repo root:

```bash
# create the project venv (uv defaults to .venv if you don't give a name)
uv venv --python 3.11
```

2. Activate the venv (choose one):

```bash
git clone https://github.com/allenai/S2AND.git
cd S2AND
conda create -y --name s2and python==3.8.15
conda activate s2and
pip install -r requirements.in
pip install -e .
# macOS / Linux (bash / zsh)
source .venv/bin/activate

# Windows PowerShell
. .venv\Scripts\Activate.ps1

# Windows CMD
.venv\Scripts\activate.bat
```

If you run into cryptic errors about GCC on macOS while installing the requirments, try this instead:
3. Install project dependencies (dev extras):

```bash
CFLAGS='-stdlib=libc++' pip install -r requirements.in
# prefer uv --active so uv uses your activated environment
uv sync --active --all-extras --dev
```

Or use uv with a more recent Python version (3.11+):
## Running Tests

To run the tests, use the following command:

```bash
uv venv s2anduv --python 3.11
source s2anduv\Scripts\activate # macOS/Linux
# s2anduv\Scripts\activate # Windows
uv pip install fasttext-wheel pycld2
uv pip install -r requirements_py_311.in
uv pip install -e . --no-deps
uv run pytest tests/
```

To run the entire CI suite mimicking the GH Actions, use the following command:
```bash
python scripts\run_ci_locally.py
```

## Data
To obtain the S2AND dataset, run the following command after the package is installed (from inside the `S2AND` directory):
Expand Down
112 changes: 112 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "s2and"
version = "0.1.0"
description = "S2AND"
readme = "README.md"
requires-python = ">=3.11"
license = { text = "MIT" }
authors = [{ name = "Sergey Feldman, Daniel King, Shivashankar Subramanian" }]

# --- Runtime dependencies (loosened, conservative) ---
dependencies = [
"fasttext-wheel>=0.9.2",
"pycld2>=0.41",
"scikit-learn>=1.2,<1.5",
"text-unidecode==1.3",
"requests>=2.28,<3",
"hyperopt>=0.2.4,<0.3",
"pandas>=1.5,<2.2",
"lightgbm==3.2.1",
"fastcluster>=1.2.6,<2",
"genieclust>=1.1.4,<2",
"matplotlib>=3.7,<3.9",
"seaborn>=0.12,<0.14",
"tqdm>=4.64,<5",
"strsimpy>=0.2,<0.3",
"jellyfish>=0.9,<2",
"numpy>=1.24,<2",
"orjson>=3.9,<4",
"shap",
"sinonym",
# Backport only for older Pythons; not needed on 3.11+
'importlib-metadata>=4.13; python_version < "3.10"',
]

[project.optional-dependencies]
dev = [
# Test stack
"pytest==8.4.1",
"pytest-cov>=4,<6",
# Type checking
"mypy>=1.5.1",
# Linters/formatters
"black==24.8.0",
"flake8>=6,<8", # or prefer ruff below
"ruff>=0.4,<0.7",
# CLI helpers used in some repos
"click>=8,<9",
]

[tool.setuptools.packages.find]
include = ["s2and*"]

# ---- Tooling config ----
[tool.black]
line-length = 120
target-version = ["py311"]

[tool.pytest.ini_options]
minversion = "7.0"
testpaths = ["tests"]

# (Optional) Ruff config if you use it instead of flake8
[tool.ruff]
line-length = 120
target-version = "py311"
select = ["E","F","I","UP","B"]
ignore = []

# If you keep flake8, you can mirror the same line length:
[tool.flake8]
max-line-length = 120

# ------------------------
# If you must replicate the *exact* legacy pins you sent, use this block instead
# of the loosened dependencies above (comment out the dependencies list above and
# paste these into it). This is *not* recommended long-term:
#
# "scikit-learn==1.2.2",
# "text-unidecode==1.3",
# "requests==2.24.0",
# "hyperopt==0.2.4",
# "pandas>=1.2",
# "lightgbm==3.0.0",
# "fastcluster==1.2.6",
# "genieclust==1.1.4",
# "matplotlib==3.7.1",
# "seaborn==0.12.2",
# "tqdm==4.49.0",
# "strsimpy==0.2.0",
# "jellyfish==0.8.2",
# "numpy==1.24.3",
# "orjson",
# "shap",
# "sinonym",
# 'importlib-metadata==4.13.0; python_version < "3.10"',
# "click>=7.1.2",
#
# And dev tools (old pins):
# dev = [
# "pytest==8.4.1",
# "pytest-cov==2.10.1",
# "flake8==3.8.3",
# "black==22.3.0",
# "mypy>=1.5.1",
# 'importlib-metadata==4.13.0; python_version < "3.10"',
# "click>=7.1.2",
# ]
# ------------------------
27 changes: 0 additions & 27 deletions requirements.in

This file was deleted.

25 changes: 0 additions & 25 deletions requirements_py_311.in

This file was deleted.

5 changes: 3 additions & 2 deletions s2and/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -1464,7 +1464,7 @@ def preprocess_papers_parallel(papers_dict: Dict, n_jobs: int, preprocess: bool)
output: Dict = {}
if n_jobs > 1:
# Use UniversalPool to replicate the original p.imap() streaming behavior
with UniversalPool(processes=n_jobs) as p:
with UniversalPool(processes=n_jobs) as p: # type: ignore
_max = len(papers_dict)
with tqdm(total=_max, desc="Preprocessing papers 1/2") as pbar:
for key, value in p.imap(preprocess_paper_1, papers_dict.items(), 1000):
Expand All @@ -1488,7 +1488,8 @@ def preprocess_papers_parallel(papers_dict: Dict, n_jobs: int, preprocess: bool)
journal_name=p.journal_name,
authors=[a.author_name for a in p.authors],
)
for p in filter(None, [output.get(str(rid)) for rid in (value.references or [])])
for p in [output.get(str(rid)) for rid in (value.references or [])]
if p is not None
],
)
for key, value in output.items()
Expand Down
6 changes: 3 additions & 3 deletions s2and/subblocking.py
Original file line number Diff line number Diff line change
Expand Up @@ -250,9 +250,9 @@ def make_subblocks(signature_ids, anddata, maximum_size=7500, first_k_letter_cou
key
)
for key in list(output_cant_subdivide_single_letter_first_name.keys()):
output_cant_subdivide_single_letter_first_name[
f"{first_letter}|middle=" + str(key)
] = output_cant_subdivide_single_letter_first_name.pop(key)
output_cant_subdivide_single_letter_first_name[f"{first_letter}|middle=" + str(key)] = (
output_cant_subdivide_single_letter_first_name.pop(key)
)
output.update(output_single_letter_first_name)
output_for_specter.update(
output_cant_subdivide_single_letter_first_name
Expand Down
1 change: 1 addition & 0 deletions scripts/LLM_based_filtering_of_name_tuples.py
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,7 @@ def generate_chinese(input_tuples):
else:
print(f"Unexpected line format: {line}")


# Step 2: A bunch of the names in the final_keep_tuples_deduped
# don't appear in the original name_pairs.txt file, so we need to handle that
# with LLMs!
Expand Down
1 change: 0 additions & 1 deletion scripts/blog_post_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
python scripts/blog_post_eval.py --random_seed 42 --experiment_name dont_use_name_counts --feature_groups_to_skip name_counts
"""


from typing import Optional, List, Dict, Any

import os
Expand Down
Loading