-
Notifications
You must be signed in to change notification settings - Fork 27
Feature request: UAST diffing #425
Description
Hey guys !
So for starters just wanted to point out I am not sure if this is the proper repo to ask this, or if it should be in the SDK one. I also want to say I am not taking this lightly, I know its not a priority, and I do not expect resources to be allocated to it in the near future. That being said, before I go to the question, here is some context:
Currently, we do not do too much analysis or learning that would require this feature. In fact, when working on time series of files, we have up until know either:
- not required the structural information provided only by Babelfish, and relied directly on
gitfunctionalities (blame,log, ...), for instance as implemented in src-d/hercules - parsed each blob of interest, extracted features from each UAST, then done manual diffing either for data compressiom purposes or in some cases for better features, for instance as implemented in src-d/tm-experiments
Obviously, both of those methods are flawed: the first can not really be used for deep analysis, while the second does not scale well, and involves reprocessing the same data multiple times - most notably, if diffing two subsequent commits with only a couples lines added or removed.
What I would like to be implemented is essentially a way to parse multiple versions of a files at once which would avoid this manual work and time loss, as well as enable diffing. The way I see it, given a series of versions of file, I would like to be returned something like an augmented UAST, where each node would have an additional attribute indicating the versions in which the node appears. That way, either through implemented methods or manual parsing, we would be able to get the version of the UAST corresponding to each version of the file through that single augmented UAST, as well as diffing from any versions, and hopefully with way less compute time.
My questions are:
- Is this a dumb idea with no value, and we should keep parsing each file separately ?
- is this a dumb idea which would result in huge UASTs or longer process times then parsing each file separately ?
2 if that is the case, have you already thought of alternative ways of getting the same kind of result, and could you detail them ? - if that is not the case, how hard/time-consuming would it be to implement, and would you be ready to do it at some point - be it even with restrictions on the number of versions, file size, etc ?