Best practices generating and using training data for SpanCategorizer? #10049
Replies: 1 comment
-
Sounds like there are a couple of questions here, so I'll take them one at a time. To provide data to the spancat, you just need to annotate spans on a Doc. Specifically you need to put spans in a SpanGroup. An example of a simple way to do that:
As far as getting the data, using the EntityRuler with some regex matching rules is one way. If you already have regexes that work on the raw text, you may find it easier to use Regarding multiple levels of references, it sounds like it might be easiest to label the whole reference as a single span and then use rules to pick out the individual segments as a post-processing step. If that doesn't work as well as you want you can at least use it as a baseline. While the references you're looking for are hierarchical, it's not really clear to me why your classifier would need to be hierarchical, maybe you could expand on that with an example sentence and desired labels? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone! I'm quite new to Spacy in general, so forgive me if any of my questions are quite basic.
For my master thesis I'm planning to combine model-based text mining (Spacy's SpanCategorizer) with rule-based text-ming (regex patterns) as a hybrid approach to extracting and labeling references to laws and regulations in a corpus of Dutch case law. Legal references have very specific formats which generally follow formal conventions layed down in legal reference guides, but the formatting of legal references still vary greatly due to reasons such as the variation of natural language and poor following of conventions.1 Below I'll list a couple of examples of references to (Dutch) laws and regulations:
The regex patterns will try to capture most of the varying formats of references to laws and regulations and will use names of laws and regulations which are obtained from a dictionary created from a dataset provided and maintained by the Dutch government, but due to the inherent rigidness of regex patterns I'm still likely to miss a lot of references. This is where the SpanCategorizer comes into play. Using the references to laws and regulations extracted and labeled through a rule-based approach as training data for the SpanCategorizer I want to find any missed references to laws and regulations.
Seeing as there will be a lot of training data, I was wondering how I could best provide the training data to the SpanCategorizer. Can I directly provide the extracted and labeled references to laws and regulations to the SpanCategorizer? If so, is this an efficient way of providing the SpanCategorizer with training data and how should this data be formatted (concerning formatting, see Matching regular expressions on the full text)? If not, what possibilities are there? Would, for example, using PhraseMatcher be an efficient way of providing the SpanCategorizer with training data?
Furthremore, since laws and regulations have a high degree of granularity and there might not be that many training data examples for references on deeper levels, I'm considering to dive into hierarchical classification in order to first classify references to laws and regulations on the most abstract level (i.e. any reference to a law or regulation) before attempting to classify those references on deeper levels (i.e. law or regulation > chapter > section > article > paragraph > sub). Does anyone have any advice on how to go about this?
Kind regards,
@romjansen
Footnotes
van Opijnen, M., Verwer, N., & Meijer, J. (2015). Beyond the experiment: the eXtendable Legal Link eXtractor. International Conference on Artificial Intelligence and Law. ↩
Beta Was this translation helpful? Give feedback.
All reactions