Clause segmentation of variable length #11616

swetepete · 2022-10-11T13:43:37Z

swetepete
Oct 11, 2022

I am consulting this SO post on how to segment sentences on the clause level:

https://stackoverflow.com/questions/65227103/clause-extraction-long-sentence-segmentation-in-python

The technique is roughly to find the root of the dependency parse and then seek any immediate children which are tagged as "conjunctions".

Each such detected word has an associated "subtree", accessible with the Token.subtree attribute.

Basically, you try to take out every subtree, and the assumption seems to be that the only thing remaining will be the first subtree before the first conjunction (i.e., the beginning of the sentence). So, the author of that script tracks every token that's been included so far, and at the end accumulates the words from the beginning of the sentence up until the first already extracted part of the sentence.

They also store the indices of the subtrees so that they can retain their order for the final output.

I am considering if this code could be written in a slightly simpler way, but also with different functionality.

Instead of only focusing on conjunctions, I simply want to break the sentence evenly into clauses of similar length. For example, I never want a clause to be longer than 5 words.

I believe some basic graph traversal algorithm could do this pretty easily. It starts at the "leaves" / terminal nodes, and accumulates "upwards". If it finds that accumulating words one level up would exceed the span length, it reverts back to the previous level and finalises that as a clause "chunk".

For example, looking at the dependency parse in Displacy you can imagine how this German sentence could be grouped into units of five words maximum:

Da indessen seine Ansichten zu verschiedenen Zeiten sehr schwankten und er sich nicht auf die Ursache oder Mittel der Umwandlung der Arten einläßt, brauche ich hier nicht auf Einzelnheiten einzugehen.

I am now drafting a script to do this.

After some thinking, I realise it will require a sense of the tree having "levels" - you can't just start at any "leaf" / terminal node of the tree, but you have to start at the leaves of the lowest depth.

You then move to the token's head and accumulate the nearest token.

Anyway, maybe I haven't thought this through enough because it strikes me as a tricky graph navigation question.

If anyone thinks they know a simple way to do this, I'd appreciate getting some advice.

Thanks very much.

---Edit:

I believe I thought of an extremely simple way to do this.

First, iterate over every token and store it if list(word.children) is empty.

For each such "leaf", navigate to its head.

Now check the word length of each head's subtree.

Presumably, it would be 2.

To keep things simple, keep moving all the heads up to their heads. This may cause some of the heads to overlap, which is fine: discard duplicates.

Once you have climbed to height level 4 or 5 in the tree, you may find that some subtrees are longer or shorter. It doesn't matter - as soon as one of the subtrees has a length of 5, you have reached the maximum possible height.

I will give this a shot now and see if it leads to any weird overlaps or anything between the subtrees.

Thank you.

Here's what I've got so far:

sent = list(doc.sents)[0]

get all tokens with no children

leaves = [word for word in sent if not list(word.children)]

set to remove duplicates, resort on word index

level_1 = sorted(set([word.head for word in leaves]), key=lambda x: x.i)

now filter out all sections that were too big

level_1 = [word for word in level_1 if len(list(word.subtree)) <= 5]

define a method for checking if one word is in the subtree of the second

def exists_subtree_relationship(word1, word2):

subtree1 = list(word1.subtree)
subtree2 = list(word2.subtree)

return set(subtree1) <= set(subtree2) or set(subtree2) <= set(subtree1)

figure out which sublist is shorter so you can remove it when you need to

def which_sublist_is_shorter(word1, word2):

s1_length = len(list(word1.subtree))
s2_length = len(list(word2.subtree))

if s1_length < s2_length:

return word1

else:

return word2

now go through level_1 and for every pair of words that has a subtree relationship, delete the smaller one

for word1 in level_1:

for word2 in level_1:

if word1 != word2:

  if exists_subtree_relationship(word1, word2):

    level_1.remove(which_sublist_is_shorter(word1, word2))

At that point you could just use those chunks as indices to split up the sentences, or you could attempt a second round of chunking those sequences of words that remain. It could make progress if you made the condition that words in the second round count as "leaves" if their subtree is already a chunk, and you just move upwards from them.

However, I think the biggest problem with this method is that a word's children are not necessarily adjacent to each other. Maybe if I included that you can only chunk subtrees where the words are adjacent. Or, I could move in the opposite direction, starting at the root, going one level down to see if any subtrees appear of length five, extracting those and moving further down.

swetepete · 2022-10-11T20:39:58Z

swetepete
Oct 11, 2022
Author

I'm actually very proud of my work today, here it is:




import spacy

nlp = spacy.load("de_core_news_sm")

text = "Nun fügt er hinzu: was aber im letzten Falle „durch Kunst geschieht, scheint mit gleicher Wirksamkeit, wenn auch langsamer, bei der Bildung der Varietäten des Menschengeschlechts, die für die von ihnen bewohnten Gegenden eingerichtet sind, durch die Natur zu geschehen."

doc = nlp(text)

sentences = list(doc.sents)

sentence = sentences[0]





def flag_terminals(ls, depth):

  output = []

  for element in ls:


    if not isinstance(element, list) and len(list(element.children)) == depth:

      output.append([element])

    else:
      
      output.append(element)

  return output





def words_are_siblings(word1, word2):

  return word1 == word2.head or word2 == word1.head




def lists_are_siblings(ls1, ls2):

  for e1 in ls1:

    for e2 in ls2:

      if words_are_siblings(e1, e2):

        return True

  return False






def group_by_siblings(ls, size):

  output = []

  for index, chunk in enumerate(ls):

    if isinstance(chunk, list):

      new_sublist = chunk

      while index + 1 < len(ls) and isinstance(ls[index + 1], list) and lists_are_siblings(chunk, ls[index + 1]) and len(chunk) + len(ls[index + 1]) <= size:

        new_sublist += ls[index + 1]
                    
        del ls[index + 1]

      

      output.append(new_sublist)

    else:

      output.append(chunk)

  return output

          


chunks = flag_terminals(sentence, 0)

chunks = group_by_siblings(chunks, 5)

At this point you go back and forth between flag_terminals with one higher depth each time (1, 2, 3, 4, 5...), and group_by_siblings, until everything has been placed into a group.

It is not perfect, I have to look into new strategies for getting the grouping to be more natural, but it's still really effective in certain ways.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Clause segmentation of variable length #11616

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Clause segmentation of variable length #11616

Uh oh!

Uh oh!

swetepete Oct 11, 2022

get all tokens with no children

set to remove duplicates, resort on word index

now filter out all sections that were too big

define a method for checking if one word is in the subtree of the second

figure out which sublist is shorter so you can remove it when you need to

now go through level_1 and for every pair of words that has a subtree relationship, delete the smaller one

Replies: 1 comment

Uh oh!

swetepete Oct 11, 2022 Author

swetepete
Oct 11, 2022

swetepete
Oct 11, 2022
Author