Skip to content

Commit cff2c3a

Browse files
authored
Merge pull request #62 from ziqizhang/dev
feat: JATE v3.0.0 — Python rewrite with 14 ATE algorithms
2 parents f97d510 + faf0acf commit cff2c3a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

89 files changed

+22862
-56
lines changed

.flake8

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[flake8]
2+
max-line-length = 120
3+
extend-ignore = E203, E501

.github/workflows/ci.yml

Lines changed: 35 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,19 +3,9 @@ name: CI
33
on:
44
pull_request:
55
branches: ["master"]
6-
paths-ignore:
7-
- "docs/**"
8-
- "*.md"
9-
- ".gitignore"
10-
- "LICENSE"
116

127
push:
138
branches: ["master"]
14-
paths-ignore:
15-
- "docs/**"
16-
- "*.md"
17-
- ".gitignore"
18-
- "LICENSE"
199

2010
workflow_dispatch:
2111

@@ -93,6 +83,9 @@ jobs:
9383
poetry config virtualenvs.in-project true
9484
poetry install --no-interaction --with dev
9585
86+
- name: Download spaCy model
87+
run: poetry run python -m spacy download en_core_web_sm
88+
9689
- name: Run tests
9790
run: |
9891
poetry run pytest tests/ \
@@ -135,3 +128,35 @@ jobs:
135128
136129
- name: Run mypy
137130
run: poetry run mypy src/jate/
131+
132+
build:
133+
name: Build package
134+
runs-on: ubuntu-latest
135+
timeout-minutes: 5
136+
137+
steps:
138+
- name: Checkout code
139+
uses: actions/checkout@v5
140+
141+
- name: Set up Python ${{ env.PYTHON_VERSION }}
142+
uses: actions/setup-python@v6
143+
with:
144+
python-version: ${{ env.PYTHON_VERSION }}
145+
146+
- name: Install Poetry
147+
run: pipx install poetry>=2.0.0
148+
149+
- name: Build package
150+
run: poetry build
151+
152+
- name: Verify package contents
153+
run: |
154+
pip install dist/*.whl
155+
python -c "import jate; print(jate.__version__)"
156+
157+
- name: Upload build artifacts
158+
uses: actions/upload-artifact@v4
159+
with:
160+
name: dist
161+
path: dist/
162+
retention-days: 7

.github/workflows/publish.yml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
name: Publish to PyPI
2+
3+
on:
4+
release:
5+
types: [published]
6+
7+
jobs:
8+
publish:
9+
name: Build and publish to PyPI
10+
runs-on: ubuntu-latest
11+
permissions:
12+
id-token: write # required for trusted publishing
13+
14+
steps:
15+
- name: Checkout code
16+
uses: actions/checkout@v5
17+
with:
18+
ref: ${{ github.event.release.tag_name }}
19+
20+
- name: Set up Python 3.11
21+
uses: actions/setup-python@v6
22+
with:
23+
python-version: "3.11"
24+
25+
- name: Install Poetry
26+
run: pipx install poetry>=2.0.0
27+
28+
- name: Build package
29+
run: poetry build
30+
31+
- name: Publish to PyPI
32+
uses: pypa/gh-action-pypi-publish@release/v1

.github/workflows/release.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,11 @@ jobs:
2222
- name: Checkout repository
2323
uses: actions/checkout@v5
2424

25+
- name: Set up Python 3.11
26+
uses: actions/setup-python@v6
27+
with:
28+
python-version: "3.11"
29+
2530
- name: Set up Node.js
2631
uses: actions/setup-node@v6
2732
with:

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,10 +41,14 @@ Thumbs.db
4141
# Environment
4242
.env
4343
.env.*
44+
CLAUDE.md
45+
docs/plans/
4446

4547
# Node (semantic-release)
4648
node_modules/
47-
package-lock.json
49+
50+
# MkDocs
51+
site/
4852

4953
# Distribution
5054
*.tar.gz

.pre-commit-config.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ exclude: '^docs/'
22
default_stages: [pre-commit]
33

44
default_language_version:
5-
python: python3.11
5+
python: python3.12
66

77
repos:
88
- repo: https://github.com/pre-commit/pre-commit-hooks
@@ -23,7 +23,7 @@ repos:
2323
rev: 24.8.0
2424
hooks:
2525
- id: black
26-
language_version: python3.11
26+
language_version: python3.12
2727

2828
- repo: https://github.com/pycqa/isort
2929
rev: 5.13.2
@@ -35,7 +35,7 @@ repos:
3535
rev: 7.1.1
3636
hooks:
3737
- id: flake8
38-
args: ["--max-line-length=120", "--extend-ignore=E203"]
38+
args: ["--max-line-length=120", "--extend-ignore=E203,E501"]
3939

4040
ci:
4141
autoupdate_schedule: weekly

README.md

Lines changed: 140 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,175 @@
11
# JATE — Just Automatic Term Extraction
22

3-
> **JATE is being completely rewritten.** The original Java/Solr library (84+ stars, 10+ algorithms) is being rebuilt from the ground up in Python. The goal: make automatic term extraction as easy as `pip install jate`.
3+
A Python library for automatic term extraction (ATE) from text corpora. JATE provides 14 ATE algorithms (13 classical + ensemble voting), corpus-level statistics, built-in evaluation, and a CLI — all pip-installable with no external services required.
44

5-
## What's happening?
5+
> Previously known as "Java Automatic Term Extraction" (84+ stars). The original Java/Solr library is preserved on the [`legacy/java`](https://github.com/ziqizhang/jate/tree/legacy/java) branch.
66
7-
JATE has been a trusted open-source tool for automatic term extraction since 2014, used by researchers and practitioners across NLP, biomedical text mining, and knowledge engineering. But the Java/Solr dependency made it hard to set up and integrate into modern Python-based NLP workflows.
7+
## Installation
88

9-
**We're fixing that.** Over the coming weeks, JATE is being revamped into a modern Python library — lightweight, pip-installable, and designed to work seamlessly with spaCy, pandas, and the broader Python ecosystem.
9+
```bash
10+
pip install jate
11+
```
1012

11-
This rebuild is being developed with the help of **agentic coding tools**, combining human expertise in NLP research with AI-assisted development to accelerate the process.
13+
Or from source:
1214

13-
> Looking for the original Java version? It's preserved on the [`legacy/java`](https://github.com/ziqizhang/jate/tree/legacy/java) branch.
15+
```bash
16+
git clone https://github.com/ziqizhang/jate.git
17+
cd jate
18+
pip install .
19+
```
1420

15-
## Sneak peek
21+
Requires Python 3.11+ and a spaCy model:
1622

17-
**Dead-simple API:**
23+
```bash
24+
python -m spacy download en_core_web_sm
25+
```
26+
27+
## Quick start
28+
29+
### Single document
1830

1931
```python
2032
import jate
2133

22-
# One line — that's it
23-
terms = jate.extract("Your document text here...")
34+
# Extract terms from text (default: C-Value + POS pattern extraction)
35+
result = jate.extract("Your document text here...")
2436

25-
# Corpus-level extraction with any algorithm
26-
terms = jate.extract_corpus(docs, algorithm="cvalue")
37+
for term in result:
38+
print(f"{term.string:30s} score={term.score:.4f} surfaces={term.surface_forms}")
39+
```
2740

28-
# Compare algorithms side by side
29-
results = jate.compare(corpus, algorithms=["cvalue", "tfidf", "weirdness"])
41+
### Corpus-level extraction
42+
43+
```python
44+
import jate
45+
46+
# From a list of texts
47+
result = jate.extract_corpus(
48+
["First document...", "Second document..."],
49+
algorithm="tfidf",
50+
)
51+
52+
# From a directory of text files
53+
result = jate.extract_corpus("path/to/corpus/", algorithm="cvalue")
54+
55+
# Export results
56+
df = result.to_dataframe()
57+
print(result.to_csv())
3058
```
3159

32-
**spaCy integration:**
60+
### Compare algorithms
3361

3462
```python
35-
import spacy
36-
nlp = spacy.load("en_core_web_sm")
37-
nlp.add_pipe("jate", config={"algorithm": "cvalue"})
63+
import jate
64+
65+
results = jate.compare(
66+
["Doc one...", "Doc two..."],
67+
algorithms=["cvalue", "tfidf", "rake", "weirdness"],
68+
)
3869

39-
doc = nlp("Your text here")
40-
print(doc._.terms)
70+
for algo_name, result in results.items():
71+
print(f"\n{algo_name}: {len(result)} terms")
72+
for term in list(result)[:5]:
73+
print(f" {term.string:30s} {term.score:.4f}")
4174
```
4275

43-
**CLI:**
76+
For large corpora, speed up with parallel processing:
4477

45-
```bash
46-
jate extract paper.pdf --algorithm cvalue
47-
jate benchmark --dataset genia --algorithms all
48-
jate demo # launches interactive web UI
78+
```python
79+
config = jate.JATEConfig(max_workers=4)
80+
results = jate.compare(docs, algorithms=["cvalue", "tfidf", "rake"], config=config)
81+
```
82+
83+
### Evaluation against a gold standard
84+
85+
```python
86+
import jate
87+
88+
result = jate.extract_corpus(docs, algorithm="cvalue")
89+
90+
evaluator = jate.Evaluator(gold_terms={"machine learning", "neural network", ...})
91+
eval_result = evaluator.evaluate(result)
92+
print(eval_result.summary())
93+
# P=0.2800 R=0.0644 F1=0.1047 TP=28 FP=72 FN=407 predicted=100 gold=435
94+
95+
# Evaluate top-k
96+
eval_at_50 = evaluator.evaluate_at_k(result, k=50)
4997
```
5098

51-
## Planned features
99+
### CLI
100+
101+
```bash
102+
# Extract terms from text
103+
jate extract "Your text here" --algorithm cvalue --top 20
104+
105+
# Extract from a corpus directory
106+
jate corpus path/to/docs/ --algorithm tfidf --output csv
107+
108+
# Compare algorithms on a corpus
109+
jate compare path/to/docs/ --algorithms cvalue tfidf rake
110+
111+
# Run benchmark on built-in dataset
112+
jate benchmark --top 100
113+
```
52114

53-
- **13+ ATE algorithms** in one library — C-Value, NC-Value, TFIDF, RIDF, RAKE, ChiSquare, Weirdness, TermEx, GlossEx, and more
115+
## Algorithms
116+
117+
| Algorithm | Description | Reference |
118+
|-----------|-------------|-----------|
119+
| `tfidf` | TF-IDF at corpus level ||
120+
| `cvalue` | Multi-word term extraction via nested term frequency | Frantzi et al. 2000 |
121+
| `ncvalue` | C-Value extended with context word information | Frantzi et al. 2000 |
122+
| `basic` | Frequency + containment scoring | Bordea et al. 2013 |
123+
| `combobasic` | Basic with parent and child containment | Bordea et al. 2013 |
124+
| `attf` | Average total term frequency (TTF / DF) ||
125+
| `ttf` | Raw total term frequency ||
126+
| `ridf` | Residual IDF (deviation from Poisson) | Church & Gale 1995 |
127+
| `rake` | Rapid Automatic Keyword Extraction | Rose et al. 2010 |
128+
| `chi_square` | Chi-square test for term independence | Matsuo & Ishizuka 2003 |
129+
| `weirdness` | Target vs reference corpus frequency ratio | Ahmad et al. 1999 |
130+
| `termex` | Domain pertinence + context + lexical cohesion | Sclano et al. 2007 |
131+
| `glossex` | Domain specificity via glossary comparison | Park et al. 2002 |
132+
| `voting` | Ensemble via reciprocal rank fusion ||
133+
134+
## Candidate extractors
135+
136+
| Extractor | Description |
137+
|-----------|-------------|
138+
| `pos_pattern` (default) | Regex over Universal POS tags (e.g. `(ADJ )*(NOUN )+`) |
139+
| `ngram` | Contiguous token n-grams (configurable min/max n) |
140+
| `noun_phrase` | spaCy noun chunk detection |
141+
142+
## How it works
143+
144+
1. **Candidate extraction** — identifies potential terms using POS patterns, n-grams, or noun phrases
145+
2. **Lemmatisation** — normalises candidates to their lemmatised form (e.g. "neural networks" and "neural network" become one entry)
146+
3. **Sentence context** *(automatic)* — builds sentence co-occurrence and adjacency features for algorithms that use them (Chi-Square, NC-Value)
147+
4. **Corpus statistics** — builds frequency and co-occurrence counts (in-memory or SQLite-backed)
148+
5. **Scoring** — applies the chosen algorithm to rank candidates
149+
6. **Output** — returns `TermExtractionResult` with the normalised term, score, and all observed surface forms
150+
151+
Each `Term` in the result contains:
152+
- `string` — the canonical (lemmatised) form, used for scoring and evaluation
153+
- `score` — algorithm-assigned score
154+
- `frequency` — total corpus frequency
155+
- `surface_forms` — all surface variants observed (e.g. `{"neural network", "neural networks", "Neural Networks"}`)
156+
157+
## Roadmap
158+
159+
- **spaCy pipeline integration**`nlp.add_pipe("jate")`
160+
- **Interactive web demo** — Streamlit UI with HuggingFace Spaces deployment
161+
- **More benchmarks** — ACTER, GENIA, CoastTerm, TermEval datasets
54162
- **Neural methods** — BERT-based sequence labeling, embedding-based scoring
55-
- **LLM-augmented extraction** — optional LLM re-ranking and validation of extracted terms
56-
- **Corpus-level analysis** — SQLite-backed statistics that scale without external services
57-
- **Built-in benchmarking** — 10 standard datasets (GENIA, ACTER, ACL RD-TEC, CoastTerm, and more) with one-command evaluation
58-
- **Interactive web demo** — Streamlit UI for trying algorithms, comparing results, and visualizing terms
59-
- **Multilingual** — works with any spaCy language model
60-
- **Agentic pipeline** — LangGraph-powered orchestration that automatically selects the best algorithm and parameters for your corpus
61-
- **Production-ready** — typed, tested, CI/CD, Docker, and published on PyPI
163+
- **LLM-augmented extraction** — optional LLM re-ranking and validation
164+
- **Agentic pipeline** — LangGraph-powered orchestration for automatic algorithm selection
165+
- **Multilingual support** — works with any spaCy language model
166+
- **Production-ready** — strict typing, >90% test coverage, Docker, PyPI publishing
62167

63168
## Get involved
64169

65170
JATE is in active development. We'd love your input:
66171

67-
- **Feature requests:** [Open an issue](https://github.com/ziqizhang/jate/issues/new?template=feature_request.yml) — tell us what you need
172+
- **Feature requests:** [Open an issue](https://github.com/ziqizhang/jate/issues/new?template=feature_request.yml)
68173
- **Bug reports:** [Report here](https://github.com/ziqizhang/jate/issues/new?template=bug_report.yml)
69174
- **Star the repo** to follow progress
70175

0 commit comments

Comments
 (0)