Skip to content
This repository was archived by the owner on Mar 8, 2020. It is now read-only.

Feature request: UAST diffing #425

@r0mainK

Description

@r0mainK

Hey guys !

So for starters just wanted to point out I am not sure if this is the proper repo to ask this, or if it should be in the SDK one. I also want to say I am not taking this lightly, I know its not a priority, and I do not expect resources to be allocated to it in the near future. That being said, before I go to the question, here is some context:

Currently, we do not do too much analysis or learning that would require this feature. In fact, when working on time series of files, we have up until know either:

  • not required the structural information provided only by Babelfish, and relied directly on git functionalities (blame, log, ...), for instance as implemented in src-d/hercules
  • parsed each blob of interest, extracted features from each UAST, then done manual diffing either for data compressiom purposes or in some cases for better features, for instance as implemented in src-d/tm-experiments

Obviously, both of those methods are flawed: the first can not really be used for deep analysis, while the second does not scale well, and involves reprocessing the same data multiple times - most notably, if diffing two subsequent commits with only a couples lines added or removed.

What I would like to be implemented is essentially a way to parse multiple versions of a files at once which would avoid this manual work and time loss, as well as enable diffing. The way I see it, given a series of versions of file, I would like to be returned something like an augmented UAST, where each node would have an additional attribute indicating the versions in which the node appears. That way, either through implemented methods or manual parsing, we would be able to get the version of the UAST corresponding to each version of the file through that single augmented UAST, as well as diffing from any versions, and hopefully with way less compute time.

My questions are:

  1. Is this a dumb idea with no value, and we should keep parsing each file separately ?
  2. is this a dumb idea which would result in huge UASTs or longer process times then parsing each file separately ?
    2 if that is the case, have you already thought of alternative ways of getting the same kind of result, and could you detail them ?
  3. if that is not the case, how hard/time-consuming would it be to implement, and would you be ready to do it at some point - be it even with restrictions on the number of versions, file size, etc ?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions