Skip to content

Commit f6ed07b

Browse files
author
Kabir Khan
authored
Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931)
* Fix ent_ids and labels properties when id attribute used in patterns * use set for labels * sort end_ids for comparison in entity_ruler tests * fixing entity_ruler ent_ids test * add to set * Run make_doc optimistically if using phrase matcher patterns. * remove unused coveragerc I was testing with * format * Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially. * Removing old add_patterns function * Fixing spacing * Make sure token_patterns loaded as well, before generator was being emptied in from_disk
1 parent 72c964b commit f6ed07b

File tree

2 files changed

+65
-3
lines changed

2 files changed

+65
-3
lines changed

spacy/pipeline/entityruler.py

Lines changed: 38 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from ..errors import Errors
99
from ..compat import basestring_
1010
from ..util import ensure_path, to_disk, from_disk
11-
from ..tokens import Span
11+
from ..tokens import Doc, Span
1212
from ..matcher import Matcher, PhraseMatcher
1313

1414
DEFAULT_ENT_ID_SEP = "||"
@@ -162,6 +162,7 @@ def ent_ids(self):
162162
@property
163163
def patterns(self):
164164
"""Get all patterns that were added to the entity ruler.
165+
165166
RETURNS (list): The original patterns, one dictionary per pattern.
166167
167168
DOCS: https://spacy.io/api/entityruler#patterns
@@ -194,6 +195,7 @@ def add_patterns(self, patterns):
194195
195196
DOCS: https://spacy.io/api/entityruler#add_patterns
196197
"""
198+
197199
# disable the nlp components after this one in case they hadn't been initialized / deserialised yet
198200
try:
199201
current_index = self.nlp.pipe_names.index(self.name)
@@ -203,7 +205,33 @@ def add_patterns(self, patterns):
203205
except ValueError:
204206
subsequent_pipes = []
205207
with self.nlp.disable_pipes(subsequent_pipes):
208+
token_patterns = []
209+
phrase_pattern_labels = []
210+
phrase_pattern_texts = []
211+
phrase_pattern_ids = []
212+
206213
for entry in patterns:
214+
if isinstance(entry["pattern"], basestring_):
215+
phrase_pattern_labels.append(entry["label"])
216+
phrase_pattern_texts.append(entry["pattern"])
217+
phrase_pattern_ids.append(entry.get("id"))
218+
elif isinstance(entry["pattern"], list):
219+
token_patterns.append(entry)
220+
221+
phrase_patterns = []
222+
for label, pattern, ent_id in zip(
223+
phrase_pattern_labels,
224+
self.nlp.pipe(phrase_pattern_texts),
225+
phrase_pattern_ids
226+
):
227+
phrase_pattern = {
228+
"label": label, "pattern": pattern, "id": ent_id
229+
}
230+
if ent_id:
231+
phrase_pattern["id"] = ent_id
232+
phrase_patterns.append(phrase_pattern)
233+
234+
for entry in token_patterns + phrase_patterns:
207235
label = entry["label"]
208236
if "id" in entry:
209237
ent_label = label
@@ -212,8 +240,8 @@ def add_patterns(self, patterns):
212240
self._ent_ids[key] = (ent_label, entry["id"])
213241

214242
pattern = entry["pattern"]
215-
if isinstance(pattern, basestring_):
216-
self.phrase_patterns[label].append(self.nlp(pattern))
243+
if isinstance(pattern, Doc):
244+
self.phrase_patterns[label].append(pattern)
217245
elif isinstance(pattern, list):
218246
self.token_patterns[label].append(pattern)
219247
else:
@@ -226,6 +254,8 @@ def add_patterns(self, patterns):
226254
def _split_label(self, label):
227255
"""Split Entity label into ent_label and ent_id if it contains self.ent_id_sep
228256
257+
label (str): The value of label in a pattern entry
258+
229259
RETURNS (tuple): ent_label, ent_id
230260
"""
231261
if self.ent_id_sep in label:
@@ -239,6 +269,9 @@ def _split_label(self, label):
239269
def _create_label(self, label, ent_id):
240270
"""Join Entity label with ent_id if the pattern has an `id` attribute
241271
272+
label (str): The label to set for ent.label_
273+
ent_id (str): The label
274+
242275
RETURNS (str): The ent_label joined with configured `ent_id_sep`
243276
"""
244277
if isinstance(ent_id, basestring_):
@@ -250,6 +283,7 @@ def from_bytes(self, patterns_bytes, **kwargs):
250283
251284
patterns_bytes (bytes): The bytestring to load.
252285
**kwargs: Other config paramters, mostly for consistency.
286+
253287
RETURNS (EntityRuler): The loaded entity ruler.
254288
255289
DOCS: https://spacy.io/api/entityruler#from_bytes
@@ -292,6 +326,7 @@ def from_disk(self, path, **kwargs):
292326
293327
path (unicode / Path): The JSONL file to load.
294328
**kwargs: Other config paramters, mostly for consistency.
329+
295330
RETURNS (EntityRuler): The loaded entity ruler.
296331
297332
DOCS: https://spacy.io/api/entityruler#from_disk

website/docs/usage/rule-based-matching.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1096,6 +1096,33 @@ with the patterns. When you load the model back in, all pipeline components will
10961096
be restored and deserialized – including the entity ruler. This lets you ship
10971097
powerful model packages with binary weights _and_ rules included!
10981098
1099+
### Using a large number of phrase patterns {#entityruler-large-phrase-patterns new="2.2.4"}
1100+
1101+
When using a large amount of **phrase patterns** (roughly > 10000) it's useful to understand how the `add_patterns` function of the EntityRuler works. For each **phrase pattern**,
1102+
the EntityRuler calls the nlp object to construct a doc object. This happens in case you try
1103+
to add the EntityRuler at the end of an existing pipeline with, for example, a POS tagger and want to
1104+
extract matches based on the pattern's POS signature.
1105+
1106+
In this case you would pass a config value of `phrase_matcher_attr="POS"` for the EntityRuler.
1107+
1108+
Running the full language pipeline across every pattern in a large list scales linearly and can therefore take a long time on large amounts of phrase patterns.
1109+
1110+
As of spaCy 2.2.4 the `add_patterns` function has been refactored to use nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with 5,000-100,000 phrase patterns respectively.
1111+
1112+
Even with this speedup (but especially if you're using an older version) the `add_patterns` function can still take a long time.
1113+
1114+
An easy workaround to make this function run faster is disabling the other language pipes
1115+
while adding the phrase patterns.
1116+
1117+
```python
1118+
entityruler = EntityRuler(nlp)
1119+
patterns = [{"label": "TEST", "pattern": str(i)} for i in range(100000)]
1120+
1121+
other_pipes = [p for p in nlp.pipe_names if p != "tagger"]
1122+
with nlp.disable_pipes(*disable_pipes):
1123+
entityruler.add_patterns(patterns)
1124+
```
1125+
10991126
## Combining models and rules {#models-rules}
11001127
11011128
You can combine statistical and rule-based components in a variety of ways.

0 commit comments

Comments
 (0)