Same token from 2 subtrees of same sentence have same type and same .text, but are not the same object #11758

pipkin40 · 2022-11-04T18:34:24Z

pipkin40
Nov 4, 2022

github_query_00055.pdf
The behavior illustrated in the supplied code seems anomalous. It is problematic for me because it prevents the deletion of sublists using the del operator to remove a sublist in a larger list. Specifically, the sublist is found not to be identical to the corresponding sequence in the larger list. The sublist deletion functions have been checked with model lists of integers and characters and work fine.
My ultimate goal here is to extract the matrix clause of the sentence. Is there an established way of doing this?

Much appreciation!
pipkin40

Answered by polm

Nov 7, 2022

When providing sample code, please provide an example that we can run in isolation to confirm what's happening. It's best to provide code pasted as text in your question. If you prefer to work in notebooks, you can supply those as a gist, or by linking to a Github repo. Please do not post PDFs of code, which are hard to read and cannot be executed.

As it is, it's very hard to see what's going on because you use several functions without including their definitions. I can kind of guess what they're doing, but it's not clear.

It looks like the problem is that you are accessing the same token in two different ways and it's not equal. If it's the same token in the same Doc object, then it sho…

View full answer

swetepete · 2022-11-05T10:45:57Z

swetepete
Nov 5, 2022

Sorry, I'm a bit of an amateur, but I'll tell you what I do know.

In the attached PDF, I didn't see the definitions for the methods it appears you wrote, such as "get_dep_ccomp_xcomp_conj_subtrees". That would be helpful, but it may not be necessary. My guess is you tried to sort of "jerry-rig" a solution based on some Spacy features you knew of, whereas, as you said at the end of your paragraph above, there might be a much more standard, easier and more effective way to do that thing. When I see you passing both the sentence object and the NLP object to the same function, that's another clue that this may not be the most "Spacy" way to go about this - usually every method you need is available from one of the main Spacy objects themselves, and that's really just "Doc", "Span" (just multiple words, like a sentence or phrase), and "Token" (basically, words). You usually might run a couple methods on the NLP (Language) object at the beginning to set it up right, but from then on Doc represents a complete annotation of the input text, with all possible linguistic processes already having been completed (which are called "pipelines", like a dependency parser, a morphology analyzer, etc.) When you retrieve whatever it is you want from that Doc - say, a sentence - I am relatively certain from experience that they are dynamically linked to the Doc object, so Spacy will evaluate correctly that two excerpts from that doc will return true only if they are the exact same part of the document, the same word indices, not if the same word but in two different parts of the document. In my experience, Spacy's equality operator has always worked fine.

I can't tell if this is what you did, but you can index / access any part of the doc with standard Python list indices, except they refer to words (tokens), not characters. So sentence[1:2] means the second word in that sentence.

If the indices are not the problem, it might be that in your custom method you tried to process some specific data from the Doc object but you ended up maybe making a copy of the object - I don't know but maybe the Doc even has a unique identifier attribute to be absolutely certain that two spans are equivalent only if they are the same span from the same document - or maybe you returned specific data you wanted, like the words in the subtree but only as .text, so the word objects have either lost their index or something possibly weird like got re-indexed or something, I don't really know.

I believe what you are trying to do - identify subtrees, based on a dependancy parse/tagging, whose root/head is "ccomp", "xcomp", or "conj" - is indeed a very standard thing that Spacy is efficiently set up to do.

The dependent parser is already a standard part of the English language pipeline (https://spacy.io/models/en) - here you see, under "pipeline", "parser". Click on "parser", and you see at the top of the page, "Assigned Attributes". That means this is what this pipeline does, on the document - it adds more fields to the returned objects, which you can access.

It shows Token.dep_ returns a string of the dependency label / the type of dependency relation-

Token.head returns the Token object that is the syntactic head of the current token.

What I do not understand so clearly is that if you want to access the children, this attribute is not listed on that page, but if you click on "Token" on that page to see that documentation, you see that "Token" has the property ".children", and it even says it requires a dependency parse to be there. So, I don't know why this isn't listed on the DependencyParse page instead of the Token page, but maybe it's because an external method in the Span or Token object is responsible for algorithmically scanning the heads of the tokens to construct the corresponding list of children. Anyway, I think it's a little bit unintuitive for someone new to this. It also says it returns a "sequence" of child tokens, I think they use that terminology for "generator".

Sorry if you already know all this, just trying to be a bit through and clear with my answer.

So it should be this simple:

# load spacy
import spacy

# load language model
nlp = spacy.load("en_core_web_sm")

# create doc object with string
doc = nlp(some_text)

# if you pass text with multiple sentences, you can get a generator over the sentences with:
sentences = doc.sents

# Assuming you are only passing a single sentence to the doc:
 sentence = list(sentences)[0]

All you have to do is start from the root of the dependency tree. Again, this attribute is slightly hidden in the page for "Span": https://spacy.io/api/span

root = sentence.root

You have the root word / token of the sentence. I guess you basically want to traverse downwards until you find any tokens of the types listed above. I am not sure if you are trying to find the smallest possible such subtrees or the largest, for example, if you have a conjunction between two clauses that also have conjunctions, would you like to get to the bottom, or stop at the top?

As far as I currently know the most natural way to search through trees is using recursion. Starting at the root you define a simple repeatable rule which is basically to scan through a token's immediate children. If they meet some condition, return them. If they don't re-apply the same method to that token's children.

I'll stick with the seemingly much simpler case of just stopping at soon as you hit any first sought after dependency label.

def ...

The tricky bit with recursion is thinking about that the method has to return the same type as it takes as input, and considering how you can orchestrate that to return what the final product you actually want is - some kind of iterable of the sought-after tokens.

Does that seem to imply that it also has to take an iterable as input? I think we can work with that if we just pass the root wrapped in a trivial list. That way the method basically takes a list of tokens, and for each one, asks itself, does it meet the condition (is it a matrix clause)? If so, put it into the tuple the method is building as its return value. If it does not, call the method on the token, again, trivially wrapped in a list.

In other words, for any given list of tokens, create a transfer list, and for each token in the input, if it meets the condition, put it in the transfer list, and it if does not, run the method again on that token so that it replaces itself with all subsidiary / descendant tokens that do meet the criterion, including nothing if there are none. Lastly, "flatten" the list, because you will have a bunch of nested lists. Then, return the list (to the parent method calling it, or as the final return).

So that looks like this:

# trivial definition of what counts as a matrix clause
def is_matrix_clause(token):
  matrix_clauses = {"ccomp", "xcomp", "conj"}
  if token.dep_ in matrix_clauses:
    return True
  else:
    return False

Utility method we need for later - we just need to quickly flatten nested lists - but only nests with one sub-level of nesting, so-called "2_nested"

def flattened_list(2_nested_list):
  # with list comprehension
  return [element for element if isinstance(element_or_nested_list, list) else [element_or_nested_list] for element_or_nested_list in 2_nested_list]

This is pretty bloated, cumbersome and inelegant, it's actually a very simple way to basically sub-iterate over sublists while normally iterating over just elements, so you just basically sequentially pack every unique element into a single list on the same level. I wish I knew better ways to do this with Python, it would be cool to find some way to return a stream of elements so you don't have to deal with list impracticalities, maybe using variable numbers of arguments with the asterisk operator *, so basically you can return any number of elements in the method below without them needing to be in a list. I think it's possible but I haven't thought about it enough yet. Or maybe itertools has some convenient out of the box method for list flattening.

As explained above, here is our recursive method which takes a list of tokens and essentially replaces each token with the nearest descendant matrix tokens, returning a list of tokens back.
def return_matrix_clauses(sequence_of_tokens):
We'll use lists because we need mutability - we are going to be building / appending to the return_sequence sequence. (Tuples can't do that, they can't inherently change themselves.)

  # here is our transfer list
  return_sequence = []
  for token in sequence_of_tokens:
    if is_matrix_clause(token):
      # anything that's already a matrix clause, return it right back. Put it straight into the output list.
      return_sequence.append(token)
    # if it's not already a matrix clause we need to go down a level and see if it has any matrix clauses as children
    else:

token.children returns a generator, not a list. But I believe because Python has dynamic types, this is fine - we just need an iterable to pass to the method.
return return_matrix_clauses(token.children)

If we consider the lowest level case where we are finally at the bottom of the recursion, the "else" clause directly above will not have been activated since, by definition of this situation, there aren't any more matrix clauses beneath wherever we are in the tree at that point. We are either going to have returned only matrix clauses - propagating back upwards through all nested, recursive calls of the method, passing list of matrix clauses up to higher levels of matrix clauses waiting for the return of that lower list of matrix clauses - or there is the case where we happen to be at a dead end where there are no more matrix clauses beneath us at all so what we should return is simply nothing. But that shouldn't be a problem - when you get to a terminal node in a tree - in this case a bottom-most word, with no children - you can still call ".children" and it just returns an empty generator, I believe. That means that the for loop will immediately exit, there being no elements to iterate over. At the end, we just return whatever the transfer list currently looks like.

There is only one last problem where you return a list of matrix clauses in the place of some above / governing token which has gotten replaced by it. You replaced a token with a list of tokens. Now you are building a list of nested lists, which we do not want. So all you have to do is "flatten" your list. The most standard way I know how to do this is defined above in "flattened_list".

return flattened_list(transfer_list)

If I did this correctly on the last run / method call you just returned a clean list of every token that qualifies as a matrix clause - but actually, these are just the heads of the matrix clauses. Hopefully, from here, you understand that you can easily just substitute those heads with their entire subtrees, if you wish, with token.subtree:

# using list comprehension

list_of_matrix_clauses = [token.subtree for token in return_matrix_clauses([root])]
# remember, root must be passed wrapped in a trivial list, since every input is iterated over.

I would recommend you run this in the debugger in VS Code with breakpoints to see if my logic is correct, I might do it later when I find some time.

One thing I am still learning how to do is actually add the above as a custom attribute to your Spacy pipeline. That way it automatically runs when you create the Doc object, and you don't have to write an external method in addition. A list of matrix clauses in any given sentence can be made an automatically available attribute. Still learning how to do this though.

Sorry for my verbosity, was thinking aloud here which helps me make progress in my thinking.

Take care

1 reply

pipkin45 Nov 28, 2022

Very sorry I could not return to this post until now. First of all thank you very much for your very detailed reply. Secondly, I need to apologize for misusing linguistics terminology, which I am just starting to learn. What I am looking for is what I think should be called the main clause of the sentence. By this, I mean the clause in which all else is embedded. In the case of the sentence, I gave, this is 'we expect'. My strategy has been to obtain the root subtree and then to remove from this subtree, the subtrees of the embedded and conjoined structures. This strategy works for the few sentences that I have tested, but it would be desirable to have a method that has been extensively tested and published in the literature. The only problem that I encountered with my approach is that I had to convert the tokens in the subtrees to their respective strings. Without doing this, I could not delete the elements in a small list from a bigger list containing the small list. I explain this more below in the response to polm. Again much appreciation for your effort

polm · 2022-11-07T04:49:28Z

polm
Nov 7, 2022

When providing sample code, please provide an example that we can run in isolation to confirm what's happening. It's best to provide code pasted as text in your question. If you prefer to work in notebooks, you can supply those as a gist, or by linking to a Github repo. Please do not post PDFs of code, which are hard to read and cannot be executed.

As it is, it's very hard to see what's going on because you use several functions without including their definitions. I can kind of guess what they're doing, but it's not clear.

It looks like the problem is that you are accessing the same token in two different ways and it's not equal. If it's the same token in the same Doc object, then it should be equal to itself. The most important value when checking equality of tokens is not the text - since you can have the same word multiple times in a sentence - but token.i, which indicates its position in the Doc.

If you can provide a short code sample so we can inspect what's going on I'd be happy to take a closer look at this.

My ultimate goal here is to extract the matrix clause of the sentence. Is there an established way of doing this?

No, not that I'm aware of. Maybe it would be easier to use a constituency parser?

3 replies

pipkin45 Nov 28, 2022

Thank you for your response. I am currently looking into constituency parsers, namely benepar and clausie. I would begrateful to know of others if they exist. As I explain in my response to swetepete, I was unable to remove a small subtree form a larger subtree containing the small subtree until I converted the tokens in the subtrees to strings. I am attaching a .txt file shat should reveal the problem. I do all of my work in a Colab notebook. Thanks very much for your help.
github_query_00055.txt

pipkin45 Nov 28, 2022

I sent the wrong file. Here is a more complete one.
github_query_00055ipynb1.txt

polm Nov 29, 2022

In your sample code, the issue is that you are comparing tokens from two different Docs. Tokens from different Doc objects are never equal. This may look strange because the Docs are made from the same input string from the same pipeline, but they are still distinct objects.

For your code, it looks like you can fix the problem you're having by just passing a Doc into your functions instead of the string.

text = "This is a sentence"
doc = nlp(text)
subsent = get_subtrees(doc) # not text!
roottree = get_root_subtree(doc) # not text!

I think I understand your objective now, but to clarify what you're trying to do, it would be best to give a concrete example of your desired input and output - that way you can avoid any issues with terminology.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Same token from 2 subtrees of same sentence have same type and same .text, but are not the same object #11758

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Same token from 2 subtrees of same sentence have same type and same .text, but are not the same object #11758

Uh oh!

pipkin40 Nov 4, 2022

Replies: 2 comments · 4 replies

Uh oh!

Uh oh!

swetepete Nov 5, 2022

Uh oh!

pipkin45 Nov 28, 2022

Uh oh!

polm Nov 7, 2022

Uh oh!

pipkin45 Nov 28, 2022

Uh oh!

pipkin45 Nov 28, 2022

Uh oh!

polm Nov 29, 2022

pipkin40
Nov 4, 2022

Replies: 2 comments 4 replies

swetepete
Nov 5, 2022

polm
Nov 7, 2022