Best practices generating and using training data for SpanCategorizer? #10049

romjansen · 2022-01-13T11:23:56Z

romjansen
Jan 13, 2022

Hello everyone! I'm quite new to Spacy in general, so forgive me if any of my questions are quite basic.

For my master thesis I'm planning to combine model-based text mining (Spacy's SpanCategorizer) with rule-based text-ming (regex patterns) as a hybrid approach to extracting and labeling references to laws and regulations in a corpus of Dutch case law. Legal references have very specific formats which generally follow formal conventions layed down in legal reference guides, but the formatting of legal references still vary greatly due to reasons such as the variation of natural language and poor following of conventions.¹ Below I'll list a couple of examples of references to (Dutch) laws and regulations:

artikel 38v Wetboek van Strafrecht
artikel 248, tweede lid, Sr ("Sr" is short for "Wetboek van Strafrecht")
artikel 70 lid 1, aanhef en sub 3, Sr ("lid 1" is a variation of "eerste lid", "Sr" is short for "Wetboek van Strafrecht")
art. 71, aanhef en sub 3, Sr ("art." is short for "artikel", "Sr" is short for "Wetboek van Strafrecht")
art. 16 van de Grondwet ("art." is short for "artikel")

The regex patterns will try to capture most of the varying formats of references to laws and regulations and will use names of laws and regulations which are obtained from a dictionary created from a dataset provided and maintained by the Dutch government, but due to the inherent rigidness of regex patterns I'm still likely to miss a lot of references. This is where the SpanCategorizer comes into play. Using the references to laws and regulations extracted and labeled through a rule-based approach as training data for the SpanCategorizer I want to find any missed references to laws and regulations.

Seeing as there will be a lot of training data, I was wondering how I could best provide the training data to the SpanCategorizer. Can I directly provide the extracted and labeled references to laws and regulations to the SpanCategorizer? If so, is this an efficient way of providing the SpanCategorizer with training data and how should this data be formatted (concerning formatting, see Matching regular expressions on the full text)? If not, what possibilities are there? Would, for example, using PhraseMatcher be an efficient way of providing the SpanCategorizer with training data?

Furthremore, since laws and regulations have a high degree of granularity and there might not be that many training data examples for references on deeper levels, I'm considering to dive into hierarchical classification in order to first classify references to laws and regulations on the most abstract level (i.e. any reference to a law or regulation) before attempting to classify those references on deeper levels (i.e. law or regulation > chapter > section > article > paragraph > sub). Does anyone have any advice on how to go about this?

Kind regards,

@romjansen

van Opijnen, M., Verwer, N., & Meijer, J. (2015). Beyond the experiment: the eXtendable Legal Link eXtractor. International Conference on Artificial Intelligence and Law. ↩

polm · 2022-01-16T11:54:45Z

polm
Jan 16, 2022

Sounds like there are a couple of questions here, so I'll take them one at a time.

To provide data to the spancat, you just need to annotate spans on a Doc. Specifically you need to put spans in a SpanGroup. An example of a simple way to do that:

doc = nlp("This is a sample.")
doc.spans["sc"] = [doc[0:1]]

sc is the default key used by the spancat but you can use anything. After creating Docs this way just save them in a DocBin.

As far as getting the data, using the EntityRuler with some regex matching rules is one way. If you already have regexes that work on the raw text, you may find it easier to use doc.char_span to generate the spans.

Regarding multiple levels of references, it sounds like it might be easiest to label the whole reference as a single span and then use rules to pick out the individual segments as a post-processing step. If that doesn't work as well as you want you can at least use it as a baseline. While the references you're looking for are hierarchical, it's not really clear to me why your classifier would need to be hierarchical, maybe you could expand on that with an example sentence and desired labels?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Best practices generating and using training data for SpanCategorizer? #10049

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Best practices generating and using training data for SpanCategorizer? #10049

Uh oh!

Uh oh!

romjansen Jan 13, 2022

Footnotes

Replies: 1 comment

Uh oh!

polm Jan 16, 2022

romjansen
Jan 13, 2022

polm
Jan 16, 2022