Clause segmentation of variable length #11616
Unanswered
swetepete
asked this question in
Help: Coding & Implementations
Replies: 1 comment
-
I'm actually very proud of my work today, here it is:
At this point you go back and forth between flag_terminals with one higher depth each time (1, 2, 3, 4, 5...), and group_by_siblings, until everything has been placed into a group. It is not perfect, I have to look into new strategies for getting the grouping to be more natural, but it's still really effective in certain ways. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am consulting this SO post on how to segment sentences on the clause level:
https://stackoverflow.com/questions/65227103/clause-extraction-long-sentence-segmentation-in-python
The technique is roughly to find the root of the dependency parse and then seek any immediate children which are tagged as "conjunctions".
Each such detected word has an associated "subtree", accessible with the
Token.subtree
attribute.Basically, you try to take out every subtree, and the assumption seems to be that the only thing remaining will be the first subtree before the first conjunction (i.e., the beginning of the sentence). So, the author of that script tracks every token that's been included so far, and at the end accumulates the words from the beginning of the sentence up until the first already extracted part of the sentence.
They also store the indices of the subtrees so that they can retain their order for the final output.
I am considering if this code could be written in a slightly simpler way, but also with different functionality.
Instead of only focusing on conjunctions, I simply want to break the sentence evenly into clauses of similar length. For example, I never want a clause to be longer than 5 words.
I believe some basic graph traversal algorithm could do this pretty easily. It starts at the "leaves" / terminal nodes, and accumulates "upwards". If it finds that accumulating words one level up would exceed the span length, it reverts back to the previous level and finalises that as a clause "chunk".
For example, looking at the dependency parse in Displacy you can imagine how this German sentence could be grouped into units of five words maximum:
I am now drafting a script to do this.
After some thinking, I realise it will require a sense of the tree having "levels" - you can't just start at any "leaf" / terminal node of the tree, but you have to start at the leaves of the lowest depth.
You then move to the token's head and accumulate the nearest token.
Anyway, maybe I haven't thought this through enough because it strikes me as a tricky graph navigation question.
If anyone thinks they know a simple way to do this, I'd appreciate getting some advice.
Thanks very much.
---Edit:
I believe I thought of an extremely simple way to do this.
First, iterate over every token and store it if list(word.children) is empty.
For each such "leaf", navigate to its head.
Now check the word length of each head's subtree.
Presumably, it would be 2.
To keep things simple, keep moving all the heads up to their heads. This may cause some of the heads to overlap, which is fine: discard duplicates.
Once you have climbed to height level 4 or 5 in the tree, you may find that some subtrees are longer or shorter. It doesn't matter - as soon as one of the subtrees has a length of 5, you have reached the maximum possible height.
I will give this a shot now and see if it leads to any weird overlaps or anything between the subtrees.
Thank you.
Here's what I've got so far:
sent = list(doc.sents)[0]
get all tokens with no children
leaves = [word for word in sent if not list(word.children)]
set to remove duplicates, resort on word index
level_1 = sorted(set([word.head for word in leaves]), key=lambda x: x.i)
now filter out all sections that were too big
level_1 = [word for word in level_1 if len(list(word.subtree)) <= 5]
define a method for checking if one word is in the subtree of the second
def exists_subtree_relationship(word1, word2):
subtree1 = list(word1.subtree)
subtree2 = list(word2.subtree)
return set(subtree1) <= set(subtree2) or set(subtree2) <= set(subtree1)
figure out which sublist is shorter so you can remove it when you need to
def which_sublist_is_shorter(word1, word2):
s1_length = len(list(word1.subtree))
s2_length = len(list(word2.subtree))
if s1_length < s2_length:
else:
now go through level_1 and for every pair of words that has a subtree relationship, delete the smaller one
for word1 in level_1:
for word2 in level_1:
At that point you could just use those chunks as indices to split up the sentences, or you could attempt a second round of chunking those sequences of words that remain. It could make progress if you made the condition that words in the second round count as "leaves" if their subtree is already a chunk, and you just move upwards from them.
However, I think the biggest problem with this method is that a word's children are not necessarily adjacent to each other. Maybe if I included that you can only chunk subtrees where the words are adjacent. Or, I could move in the opposite direction, starting at the root, going one level down to see if any subtrees appear of length five, extracting those and moving further down.
Beta Was this translation helpful? Give feedback.
All reactions