Skip to content

Revisit how we determine files-changed #808

@ahal

Description

@ahal

Problem

Taskgraph currently does some extra logic to figure out a proper base revision that we then use to compute the files changed by the graph:

parameters["base_rev"] = _determine_more_accurate_base_rev(

This logic is solving some deficiencies with Github events:

  1. For force pushes, the event.before is the newly orphaned head rather than the actual base revision.
  2. For pushes to new branches, the event.before is the null revision
  3. For pull requests, the pull_request.base.sha property contains the revision that the base reference currently points to, not the revision that the PR is based on.

So instead, we essentially use git merge-base <default branch> <head rev> to find the ancestor commit of both, and then set that to the base revision, which then gets used to compute files-changed.

The problem is that now we're starting to use shallow clones, which means that merge-base doesn't work without progressively deepening the repository until we find the ancestor (negating the benefit of shallow clones).

So a new solution is needed!

Possible Solutions

Simply stop using git merge-base

In this solution, we would simply run git diff <base> <head> to get the files modified between the two trees. Here we would be consciously making a tradeoff of sometimes having inaccurate files-changed for simplicity and clone performance. The above scenarios would roughly shake out like:

  1. files-changed would include files that were modified by orphaned commits under <base>, even if <head> didn't touch them. This would cause us to run more tasks than necessary.
  2. We'd probably have to special case this, maybe only looking at files touched by <head>.
  3. Pull requests would have files-changed that are derived from all commits on the base branch that the PR hasn't rebased on top of yet, potentially running many more tasks than expected

Use the Github API

It should be possible to get the files modified from a PR or force push using the Github API. This would be fast and simple to implement, though we'd need to start worrying about rate limits, tokens and it isn't portable if we ever want to support non-Github repos.

In this case, we'd likely still want a fallback to the merge-base solution for non-Github repos or when hitting rate limits.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions