|
1 | 1 | # JATE — Just Automatic Term Extraction |
2 | 2 |
|
3 | | -> **JATE is being completely rewritten.** The original Java/Solr library (84+ stars, 10+ algorithms) is being rebuilt from the ground up in Python. The goal: make automatic term extraction as easy as `pip install jate`. |
| 3 | +A Python library for automatic term extraction (ATE) from text corpora. JATE provides 14 ATE algorithms (13 classical + ensemble voting), corpus-level statistics, built-in evaluation, and a CLI — all pip-installable with no external services required. |
4 | 4 |
|
5 | | -## What's happening? |
| 5 | +> Previously known as "Java Automatic Term Extraction" (84+ stars). The original Java/Solr library is preserved on the [`legacy/java`](https://github.com/ziqizhang/jate/tree/legacy/java) branch. |
6 | 6 |
|
7 | | -JATE has been a trusted open-source tool for automatic term extraction since 2014, used by researchers and practitioners across NLP, biomedical text mining, and knowledge engineering. But the Java/Solr dependency made it hard to set up and integrate into modern Python-based NLP workflows. |
| 7 | +## Installation |
8 | 8 |
|
9 | | -**We're fixing that.** Over the coming weeks, JATE is being revamped into a modern Python library — lightweight, pip-installable, and designed to work seamlessly with spaCy, pandas, and the broader Python ecosystem. |
| 9 | +```bash |
| 10 | +pip install jate |
| 11 | +``` |
10 | 12 |
|
11 | | -This rebuild is being developed with the help of **agentic coding tools**, combining human expertise in NLP research with AI-assisted development to accelerate the process. |
| 13 | +Or from source: |
12 | 14 |
|
13 | | -> Looking for the original Java version? It's preserved on the [`legacy/java`](https://github.com/ziqizhang/jate/tree/legacy/java) branch. |
| 15 | +```bash |
| 16 | +git clone https://github.com/ziqizhang/jate.git |
| 17 | +cd jate |
| 18 | +pip install . |
| 19 | +``` |
14 | 20 |
|
15 | | -## Sneak peek |
| 21 | +Requires Python 3.11+ and a spaCy model: |
16 | 22 |
|
17 | | -**Dead-simple API:** |
| 23 | +```bash |
| 24 | +python -m spacy download en_core_web_sm |
| 25 | +``` |
| 26 | + |
| 27 | +## Quick start |
| 28 | + |
| 29 | +### Single document |
18 | 30 |
|
19 | 31 | ```python |
20 | 32 | import jate |
21 | 33 |
|
22 | | -# One line — that's it |
23 | | -terms = jate.extract("Your document text here...") |
| 34 | +# Extract terms from text (default: C-Value + POS pattern extraction) |
| 35 | +result = jate.extract("Your document text here...") |
24 | 36 |
|
25 | | -# Corpus-level extraction with any algorithm |
26 | | -terms = jate.extract_corpus(docs, algorithm="cvalue") |
| 37 | +for term in result: |
| 38 | + print(f"{term.string:30s} score={term.score:.4f} surfaces={term.surface_forms}") |
| 39 | +``` |
27 | 40 |
|
28 | | -# Compare algorithms side by side |
29 | | -results = jate.compare(corpus, algorithms=["cvalue", "tfidf", "weirdness"]) |
| 41 | +### Corpus-level extraction |
| 42 | + |
| 43 | +```python |
| 44 | +import jate |
| 45 | + |
| 46 | +# From a list of texts |
| 47 | +result = jate.extract_corpus( |
| 48 | + ["First document...", "Second document..."], |
| 49 | + algorithm="tfidf", |
| 50 | +) |
| 51 | + |
| 52 | +# From a directory of text files |
| 53 | +result = jate.extract_corpus("path/to/corpus/", algorithm="cvalue") |
| 54 | + |
| 55 | +# Export results |
| 56 | +df = result.to_dataframe() |
| 57 | +print(result.to_csv()) |
30 | 58 | ``` |
31 | 59 |
|
32 | | -**spaCy integration:** |
| 60 | +### Compare algorithms |
33 | 61 |
|
34 | 62 | ```python |
35 | | -import spacy |
36 | | -nlp = spacy.load("en_core_web_sm") |
37 | | -nlp.add_pipe("jate", config={"algorithm": "cvalue"}) |
| 63 | +import jate |
| 64 | + |
| 65 | +results = jate.compare( |
| 66 | + ["Doc one...", "Doc two..."], |
| 67 | + algorithms=["cvalue", "tfidf", "rake", "weirdness"], |
| 68 | +) |
38 | 69 |
|
39 | | -doc = nlp("Your text here") |
40 | | -print(doc._.terms) |
| 70 | +for algo_name, result in results.items(): |
| 71 | + print(f"\n{algo_name}: {len(result)} terms") |
| 72 | + for term in list(result)[:5]: |
| 73 | + print(f" {term.string:30s} {term.score:.4f}") |
41 | 74 | ``` |
42 | 75 |
|
43 | | -**CLI:** |
| 76 | +For large corpora, speed up with parallel processing: |
44 | 77 |
|
45 | | -```bash |
46 | | -jate extract paper.pdf --algorithm cvalue |
47 | | -jate benchmark --dataset genia --algorithms all |
48 | | -jate demo # launches interactive web UI |
| 78 | +```python |
| 79 | +config = jate.JATEConfig(max_workers=4) |
| 80 | +results = jate.compare(docs, algorithms=["cvalue", "tfidf", "rake"], config=config) |
| 81 | +``` |
| 82 | + |
| 83 | +### Evaluation against a gold standard |
| 84 | + |
| 85 | +```python |
| 86 | +import jate |
| 87 | + |
| 88 | +result = jate.extract_corpus(docs, algorithm="cvalue") |
| 89 | + |
| 90 | +evaluator = jate.Evaluator(gold_terms={"machine learning", "neural network", ...}) |
| 91 | +eval_result = evaluator.evaluate(result) |
| 92 | +print(eval_result.summary()) |
| 93 | +# P=0.2800 R=0.0644 F1=0.1047 TP=28 FP=72 FN=407 predicted=100 gold=435 |
| 94 | + |
| 95 | +# Evaluate top-k |
| 96 | +eval_at_50 = evaluator.evaluate_at_k(result, k=50) |
49 | 97 | ``` |
50 | 98 |
|
51 | | -## Planned features |
| 99 | +### CLI |
| 100 | + |
| 101 | +```bash |
| 102 | +# Extract terms from text |
| 103 | +jate extract "Your text here" --algorithm cvalue --top 20 |
| 104 | + |
| 105 | +# Extract from a corpus directory |
| 106 | +jate corpus path/to/docs/ --algorithm tfidf --output csv |
| 107 | + |
| 108 | +# Compare algorithms on a corpus |
| 109 | +jate compare path/to/docs/ --algorithms cvalue tfidf rake |
| 110 | + |
| 111 | +# Run benchmark on built-in dataset |
| 112 | +jate benchmark --top 100 |
| 113 | +``` |
52 | 114 |
|
53 | | -- **13+ ATE algorithms** in one library — C-Value, NC-Value, TFIDF, RIDF, RAKE, ChiSquare, Weirdness, TermEx, GlossEx, and more |
| 115 | +## Algorithms |
| 116 | + |
| 117 | +| Algorithm | Description | Reference | |
| 118 | +|-----------|-------------|-----------| |
| 119 | +| `tfidf` | TF-IDF at corpus level | — | |
| 120 | +| `cvalue` | Multi-word term extraction via nested term frequency | Frantzi et al. 2000 | |
| 121 | +| `ncvalue` | C-Value extended with context word information | Frantzi et al. 2000 | |
| 122 | +| `basic` | Frequency + containment scoring | Bordea et al. 2013 | |
| 123 | +| `combobasic` | Basic with parent and child containment | Bordea et al. 2013 | |
| 124 | +| `attf` | Average total term frequency (TTF / DF) | — | |
| 125 | +| `ttf` | Raw total term frequency | — | |
| 126 | +| `ridf` | Residual IDF (deviation from Poisson) | Church & Gale 1995 | |
| 127 | +| `rake` | Rapid Automatic Keyword Extraction | Rose et al. 2010 | |
| 128 | +| `chi_square` | Chi-square test for term independence | Matsuo & Ishizuka 2003 | |
| 129 | +| `weirdness` | Target vs reference corpus frequency ratio | Ahmad et al. 1999 | |
| 130 | +| `termex` | Domain pertinence + context + lexical cohesion | Sclano et al. 2007 | |
| 131 | +| `glossex` | Domain specificity via glossary comparison | Park et al. 2002 | |
| 132 | +| `voting` | Ensemble via reciprocal rank fusion | — | |
| 133 | + |
| 134 | +## Candidate extractors |
| 135 | + |
| 136 | +| Extractor | Description | |
| 137 | +|-----------|-------------| |
| 138 | +| `pos_pattern` (default) | Regex over Universal POS tags (e.g. `(ADJ )*(NOUN )+`) | |
| 139 | +| `ngram` | Contiguous token n-grams (configurable min/max n) | |
| 140 | +| `noun_phrase` | spaCy noun chunk detection | |
| 141 | + |
| 142 | +## How it works |
| 143 | + |
| 144 | +1. **Candidate extraction** — identifies potential terms using POS patterns, n-grams, or noun phrases |
| 145 | +2. **Lemmatisation** — normalises candidates to their lemmatised form (e.g. "neural networks" and "neural network" become one entry) |
| 146 | +3. **Sentence context** *(automatic)* — builds sentence co-occurrence and adjacency features for algorithms that use them (Chi-Square, NC-Value) |
| 147 | +4. **Corpus statistics** — builds frequency and co-occurrence counts (in-memory or SQLite-backed) |
| 148 | +5. **Scoring** — applies the chosen algorithm to rank candidates |
| 149 | +6. **Output** — returns `TermExtractionResult` with the normalised term, score, and all observed surface forms |
| 150 | + |
| 151 | +Each `Term` in the result contains: |
| 152 | +- `string` — the canonical (lemmatised) form, used for scoring and evaluation |
| 153 | +- `score` — algorithm-assigned score |
| 154 | +- `frequency` — total corpus frequency |
| 155 | +- `surface_forms` — all surface variants observed (e.g. `{"neural network", "neural networks", "Neural Networks"}`) |
| 156 | + |
| 157 | +## Roadmap |
| 158 | + |
| 159 | +- **spaCy pipeline integration** — `nlp.add_pipe("jate")` |
| 160 | +- **Interactive web demo** — Streamlit UI with HuggingFace Spaces deployment |
| 161 | +- **More benchmarks** — ACTER, GENIA, CoastTerm, TermEval datasets |
54 | 162 | - **Neural methods** — BERT-based sequence labeling, embedding-based scoring |
55 | | -- **LLM-augmented extraction** — optional LLM re-ranking and validation of extracted terms |
56 | | -- **Corpus-level analysis** — SQLite-backed statistics that scale without external services |
57 | | -- **Built-in benchmarking** — 10 standard datasets (GENIA, ACTER, ACL RD-TEC, CoastTerm, and more) with one-command evaluation |
58 | | -- **Interactive web demo** — Streamlit UI for trying algorithms, comparing results, and visualizing terms |
59 | | -- **Multilingual** — works with any spaCy language model |
60 | | -- **Agentic pipeline** — LangGraph-powered orchestration that automatically selects the best algorithm and parameters for your corpus |
61 | | -- **Production-ready** — typed, tested, CI/CD, Docker, and published on PyPI |
| 163 | +- **LLM-augmented extraction** — optional LLM re-ranking and validation |
| 164 | +- **Agentic pipeline** — LangGraph-powered orchestration for automatic algorithm selection |
| 165 | +- **Multilingual support** — works with any spaCy language model |
| 166 | +- **Production-ready** — strict typing, >90% test coverage, Docker, PyPI publishing |
62 | 167 |
|
63 | 168 | ## Get involved |
64 | 169 |
|
65 | 170 | JATE is in active development. We'd love your input: |
66 | 171 |
|
67 | | -- **Feature requests:** [Open an issue](https://github.com/ziqizhang/jate/issues/new?template=feature_request.yml) — tell us what you need |
| 172 | +- **Feature requests:** [Open an issue](https://github.com/ziqizhang/jate/issues/new?template=feature_request.yml) |
68 | 173 | - **Bug reports:** [Report here](https://github.com/ziqizhang/jate/issues/new?template=bug_report.yml) |
69 | 174 | - **Star the repo** to follow progress |
70 | 175 |
|
|
0 commit comments