Improve load performance of SpanRuler with lots of phrase patterns? #11988
-
|
I've created a language model with only a SpanRuler component (nothing else on the pipeline). The ruler is initialized with a few million phrase patterns. The runtime performance is totally fine. But, it takes a really long time to load the model. When the model is assembled, the phrases are serialized as their original text and I understand the text gets tokenized and processed by the underlying PhraseMatcher on load. Is there any way to improve the load performance? Is there some serialization methodology where the SpanRuler+PhraseMatcher can be saved ready-to-go, instead of doing all the parsing on load? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
There is not a special mode for this or anything. What we have suggested in the past is pickling (#4445, #10514). That should avoid the overhead of recreating the the Docs, and should be similar to reading from a DocBin and using There's no way to directly set the final internal data structures because they aren't exposed on the Python object, but we could think about adding a way to do that. |
Beta Was this translation helpful? Give feedback.
There is not a special mode for this or anything. What we have suggested in the past is pickling (#4445, #10514). That should avoid the overhead of recreating the the Docs, and should be similar to reading from a DocBin and using
add.There's no way to directly set the final internal data structures because they aren't exposed on the Python object, but we could think about adding a way to do that.