Same token from 2 subtrees of same sentence have same type and same .text, but are not the same object #11758
-
github_query_00055.pdf Much appreciation! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Sorry, I'm a bit of an amateur, but I'll tell you what I do know. In the attached PDF, I didn't see the definitions for the methods it appears you wrote, such as "get_dep_ccomp_xcomp_conj_subtrees". That would be helpful, but it may not be necessary. My guess is you tried to sort of "jerry-rig" a solution based on some Spacy features you knew of, whereas, as you said at the end of your paragraph above, there might be a much more standard, easier and more effective way to do that thing. When I see you passing both the sentence object and the NLP object to the same function, that's another clue that this may not be the most "Spacy" way to go about this - usually every method you need is available from one of the main Spacy objects themselves, and that's really just "Doc", "Span" (just multiple words, like a sentence or phrase), and "Token" (basically, words). You usually might run a couple methods on the NLP (Language) object at the beginning to set it up right, but from then on Doc represents a complete annotation of the input text, with all possible linguistic processes already having been completed (which are called "pipelines", like a dependency parser, a morphology analyzer, etc.) When you retrieve whatever it is you want from that Doc - say, a sentence - I am relatively certain from experience that they are dynamically linked to the Doc object, so Spacy will evaluate correctly that two excerpts from that doc will return true only if they are the exact same part of the document, the same word indices, not if the same word but in two different parts of the document. In my experience, Spacy's equality operator has always worked fine. I can't tell if this is what you did, but you can index / access any part of the doc with standard Python list indices, except they refer to words (tokens), not characters. So If the indices are not the problem, it might be that in your custom method you tried to process some specific data from the Doc object but you ended up maybe making a copy of the object - I don't know but maybe the Doc even has a unique identifier attribute to be absolutely certain that two spans are equivalent only if they are the same span from the same document - or maybe you returned specific data you wanted, like the words in the subtree but only as .text, so the word objects have either lost their index or something possibly weird like got re-indexed or something, I don't really know. I believe what you are trying to do - identify subtrees, based on a dependancy parse/tagging, whose root/head is "ccomp", "xcomp", or "conj" - is indeed a very standard thing that Spacy is efficiently set up to do. The dependent parser is already a standard part of the English language pipeline (https://spacy.io/models/en) - here you see, under "pipeline", "parser". Click on "parser", and you see at the top of the page, "Assigned Attributes". That means this is what this pipeline does, on the document - it adds more fields to the returned objects, which you can access. It shows Token.dep_ returns a string of the dependency label / the type of dependency relation- Token.head returns the Token object that is the syntactic head of the current token. What I do not understand so clearly is that if you want to access the children, this attribute is not listed on that page, but if you click on "Token" on that page to see that documentation, you see that "Token" has the property ".children", and it even says it requires a dependency parse to be there. So, I don't know why this isn't listed on the DependencyParse page instead of the Token page, but maybe it's because an external method in the Span or Token object is responsible for algorithmically scanning the heads of the tokens to construct the corresponding list of children. Anyway, I think it's a little bit unintuitive for someone new to this. It also says it returns a "sequence" of child tokens, I think they use that terminology for "generator". Sorry if you already know all this, just trying to be a bit through and clear with my answer. So it should be this simple:
All you have to do is start from the root of the dependency tree. Again, this attribute is slightly hidden in the page for "Span": https://spacy.io/api/span
You have the root word / token of the sentence. I guess you basically want to traverse downwards until you find any tokens of the types listed above. I am not sure if you are trying to find the smallest possible such subtrees or the largest, for example, if you have a conjunction between two clauses that also have conjunctions, would you like to get to the bottom, or stop at the top? As far as I currently know the most natural way to search through trees is using recursion. Starting at the root you define a simple repeatable rule which is basically to scan through a token's immediate children. If they meet some condition, return them. If they don't re-apply the same method to that token's children. I'll stick with the seemingly much simpler case of just stopping at soon as you hit any first sought after dependency label.
The tricky bit with recursion is thinking about that the method has to return the same type as it takes as input, and considering how you can orchestrate that to return what the final product you actually want is - some kind of iterable of the sought-after tokens. Does that seem to imply that it also has to take an iterable as input? I think we can work with that if we just pass the root wrapped in a trivial list. That way the method basically takes a list of tokens, and for each one, asks itself, does it meet the condition (is it a matrix clause)? If so, put it into the tuple the method is building as its return value. If it does not, call the method on the token, again, trivially wrapped in a list. In other words, for any given list of tokens, create a transfer list, and for each token in the input, if it meets the condition, put it in the transfer list, and it if does not, run the method again on that token so that it replaces itself with all subsidiary / descendant tokens that do meet the criterion, including nothing if there are none. Lastly, "flatten" the list, because you will have a bunch of nested lists. Then, return the list (to the parent method calling it, or as the final return). So that looks like this:
Utility method we need for later - we just need to quickly flatten nested lists - but only nests with one sub-level of nesting, so-called "2_nested"
This is pretty bloated, cumbersome and inelegant, it's actually a very simple way to basically sub-iterate over sublists while normally iterating over just elements, so you just basically sequentially pack every unique element into a single list on the same level. I wish I knew better ways to do this with Python, it would be cool to find some way to return a stream of elements so you don't have to deal with list impracticalities, maybe using variable numbers of arguments with the asterisk operator *, so basically you can return any number of elements in the method below without them needing to be in a list. I think it's possible but I haven't thought about it enough yet. Or maybe itertools has some convenient out of the box method for list flattening. As explained above, here is our recursive method which takes a list of tokens and essentially replaces each token with the nearest descendant matrix tokens, returning a list of tokens back.
token.children returns a generator, not a list. But I believe because Python has dynamic types, this is fine - we just need an iterable to pass to the method. If we consider the lowest level case where we are finally at the bottom of the recursion, the "else" clause directly above will not have been activated since, by definition of this situation, there aren't any more matrix clauses beneath wherever we are in the tree at that point. We are either going to have returned only matrix clauses - propagating back upwards through all nested, recursive calls of the method, passing list of matrix clauses up to higher levels of matrix clauses waiting for the return of that lower list of matrix clauses - or there is the case where we happen to be at a dead end where there are no more matrix clauses beneath us at all so what we should return is simply nothing. But that shouldn't be a problem - when you get to a terminal node in a tree - in this case a bottom-most word, with no children - you can still call ".children" and it just returns an empty generator, I believe. That means that the for loop will immediately exit, there being no elements to iterate over. At the end, we just return whatever the transfer list currently looks like. There is only one last problem where you return a list of matrix clauses in the place of some above / governing token which has gotten replaced by it. You replaced a token with a list of tokens. Now you are building a list of nested lists, which we do not want. So all you have to do is "flatten" your list. The most standard way I know how to do this is defined above in "flattened_list".
If I did this correctly on the last run / method call you just returned a clean list of every token that qualifies as a matrix clause - but actually, these are just the heads of the matrix clauses. Hopefully, from here, you understand that you can easily just substitute those heads with their entire subtrees, if you wish, with token.subtree:
I would recommend you run this in the debugger in VS Code with breakpoints to see if my logic is correct, I might do it later when I find some time. One thing I am still learning how to do is actually add the above as a custom attribute to your Spacy pipeline. That way it automatically runs when you create the Doc object, and you don't have to write an external method in addition. A list of matrix clauses in any given sentence can be made an automatically available attribute. Still learning how to do this though. Sorry for my verbosity, was thinking aloud here which helps me make progress in my thinking. Take care |
Beta Was this translation helpful? Give feedback.
-
When providing sample code, please provide an example that we can run in isolation to confirm what's happening. It's best to provide code pasted as text in your question. If you prefer to work in notebooks, you can supply those as a gist, or by linking to a Github repo. Please do not post PDFs of code, which are hard to read and cannot be executed. As it is, it's very hard to see what's going on because you use several functions without including their definitions. I can kind of guess what they're doing, but it's not clear. It looks like the problem is that you are accessing the same token in two different ways and it's not equal. If it's the same token in the same Doc object, then it should be equal to itself. The most important value when checking equality of tokens is not the text - since you can have the same word multiple times in a sentence - but If you can provide a short code sample so we can inspect what's going on I'd be happy to take a closer look at this.
No, not that I'm aware of. Maybe it would be easier to use a constituency parser? |
Beta Was this translation helpful? Give feedback.
When providing sample code, please provide an example that we can run in isolation to confirm what's happening. It's best to provide code pasted as text in your question. If you prefer to work in notebooks, you can supply those as a gist, or by linking to a Github repo. Please do not post PDFs of code, which are hard to read and cannot be executed.
As it is, it's very hard to see what's going on because you use several functions without including their definitions. I can kind of guess what they're doing, but it's not clear.
It looks like the problem is that you are accessing the same token in two different ways and it's not equal. If it's the same token in the same Doc object, then it sho…