Skip to content

Commit 9df53c5

Browse files
newrengitster
authored andcommitted
Recommend git-filter-repo instead of git-filter-branch
filter-branch suffers from a deluge of disguised dangers that disfigure history rewrites (i.e. deviate from the deliberate changes). Many of these problems are unobtrusive and can easily go undiscovered until the new repository is in use. This can result in problems ranging from an even messier history than what led folks to filter-branch in the first place, to data loss or corruption. These issues cannot be backward compatibly fixed, so add a warning to both filter-branch and its manpage recommending that another tool (such as filter-repo) be used instead. Also, update other manpages that referenced filter-branch. Several of these needed updates even if we could continue recommending filter-branch, either due to implying that something was unique to filter-branch when it applied more generally to all history rewriting tools (e.g. BFG, reposurgeon, fast-import, filter-repo), or because something about filter-branch was used as an example despite other more commonly known examples now existing. Reword these sections to fix these issues and to avoid recommending filter-branch. Finally, remove the section explaining BFG Repo Cleaner as an alternative to filter-branch. I feel somewhat bad about this, especially since I feel like I learned so much from BFG that I put to good use in filter-repo (which is much more than I can say for filter-branch), but keeping that section presented a few problems: * In order to recommend that people quit using filter-branch, we need to provide them a recomendation for something else to use that can handle all the same types of rewrites. To my knowledge, filter-repo is the only such tool. So it needs to be mentioned. * I don't want to give conflicting recommendations to users * If we recommend two tools, we shouldn't expect users to learn both and pick which one to use; we should explain which problems one can solve that the other can't or when one is much faster than the other. * BFG and filter-repo have similar performance * All filtering types that BFG can do, filter-repo can also do. In fact, filter-repo comes with a reimplementation of BFG named bfg-ish which provides the same user-interface as BFG but with several bugfixes and new features that are hard to implement in BFG due to its technical underpinnings. While I could still mention both tools, it seems like I would need to provide some kind of comparison and I would ultimately just say that filter-repo can do everything BFG can, so ultimately it seems that it is just better to remove that section altogether. Signed-off-by: Elijah Newren <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 7b6ad97 commit 9df53c5

File tree

9 files changed

+288
-59
lines changed

9 files changed

+288
-59
lines changed

Documentation/git-fast-export.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,9 @@ This program dumps the given revisions in a form suitable to be piped
1717
into 'git fast-import'.
1818

1919
You can use it as a human-readable bundle replacement (see
20-
linkgit:git-bundle[1]), or as a kind of an interactive
21-
'git filter-branch'.
22-
20+
linkgit:git-bundle[1]), or as a format that can be edited before being
21+
fed to 'git fast-import' in order to do history rewrites (an ability
22+
relied on by tools like 'git filter-repo').
2323

2424
OPTIONS
2525
-------

Documentation/git-filter-branch.txt

Lines changed: 243 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,19 @@ SYNOPSIS
1616
[--original <namespace>] [-d <directory>] [-f | --force]
1717
[--state-branch <branch>] [--] [<rev-list options>...]
1818

19+
WARNING
20+
-------
21+
'git filter-branch' has a plethora of pitfalls that can produce non-obvious
22+
manglings of the intended history rewrite (and can leave you with little
23+
time to investigate such problems since it has such abysmal performance).
24+
These safety and performance issues cannot be backward compatibly fixed and
25+
as such, its use is not recommended. Please use an alternative history
26+
filtering tool such as https://github.com/newren/git-filter-repo/[git
27+
filter-repo]. If you still need to use 'git filter-branch', please
28+
carefully read <<SAFETY>> (and <<PERFORMANCE>>) to learn about the land
29+
mines of filter-branch, and then vigilantly avoid as many of the hazards
30+
listed there as reasonably possible.
31+
1932
DESCRIPTION
2033
-----------
2134
Lets you rewrite Git revision history by rewriting the branches mentioned
@@ -445,36 +458,236 @@ warned.
445458
(or if your git-gc is not new enough to support arguments to
446459
`--prune`, use `git repack -ad; git prune` instead).
447460

448-
NOTES
449-
-----
450-
451-
git-filter-branch allows you to make complex shell-scripted rewrites
452-
of your Git history, but you probably don't need this flexibility if
453-
you're simply _removing unwanted data_ like large files or passwords.
454-
For those operations you may want to consider
455-
http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner],
456-
a JVM-based alternative to git-filter-branch, typically at least
457-
10-50x faster for those use-cases, and with quite different
458-
characteristics:
459-
460-
* Any particular version of a file is cleaned exactly _once_. The BFG,
461-
unlike git-filter-branch, does not give you the opportunity to
462-
handle a file differently based on where or when it was committed
463-
within your history. This constraint gives the core performance
464-
benefit of The BFG, and is well-suited to the task of cleansing bad
465-
data - you don't care _where_ the bad data is, you just want it
466-
_gone_.
467-
468-
* By default The BFG takes full advantage of multi-core machines,
469-
cleansing commit file-trees in parallel. git-filter-branch cleans
470-
commits sequentially (i.e. in a single-threaded manner), though it
471-
_is_ possible to write filters that include their own parallelism,
472-
in the scripts executed against each commit.
473-
474-
* The http://rtyley.github.io/bfg-repo-cleaner/#examples[command options]
475-
are much more restrictive than git-filter branch, and dedicated just
476-
to the tasks of removing unwanted data- e.g:
477-
`--strip-blobs-bigger-than 1M`.
461+
[[PERFORMANCE]]
462+
PERFORMANCE
463+
-----------
464+
465+
The performance of git-filter-branch is glacially slow; its design makes it
466+
impossible for a backward-compatible implementation to ever be fast:
467+
468+
* In editing files, git-filter-branch by design checks out each and
469+
every commit as it existed in the original repo. If your repo has 10\^5
470+
files and 10\^5 commits, but each commit only modifies 5 files, then
471+
git-filter-branch will make you do 10\^10 modifications, despite only
472+
having (at most) 5*10^5 unique blobs.
473+
474+
* If you try and cheat and try to make git-filter-branch only work on
475+
files modified in a commit, then two things happen
476+
477+
** you run into problems with deletions whenever the user is simply
478+
trying to rename files (because attempting to delete files that
479+
don't exist looks like a no-op; it takes some chicanery to remap
480+
deletes across file renames when the renames happen via arbitrary
481+
user-provided shell)
482+
483+
** even if you succeed at the map-deletes-for-renames chicanery, you
484+
still technically violate backward compatibility because users are
485+
allowed to filter files in ways that depend upon topology of
486+
commits instead of filtering solely based on file contents or names
487+
(though this has not been observed in the wild).
488+
489+
* Even if you don't need to edit files but only want to e.g. rename or
490+
remove some and thus can avoid checking out each file (i.e. you can use
491+
--index-filter), you still are passing shell snippets for your filters.
492+
This means that for every commit, you have to have a prepared git repo
493+
where those filters can be run. That's a significant setup.
494+
495+
* Further, several additional files are created or updated per commit by
496+
git-filter-branch. Some of these are for supporting the convenience
497+
functions provided by git-filter-branch (such as map()), while others
498+
are for keeping track of internal state (but could have also been
499+
accessed by user filters; one of git-filter-branch's regression tests
500+
does so). This essentially amounts to using the filesystem as an IPC
501+
mechanism between git-filter-branch and the user-provided filters.
502+
Disks tend to be a slow IPC mechanism, and writing these files also
503+
effectively represents a forced synchronization point between separate
504+
processes that we hit with every commit.
505+
506+
* The user-provided shell commands will likely involve a pipeline of
507+
commands, resulting in the creation of many processes per commit.
508+
Creating and running another process takes a widely varying amount of
509+
time between operating systems, but on any platform it is very slow
510+
relative to invoking a function.
511+
512+
* git-filter-branch itself is written in shell, which is kind of slow.
513+
This is the one performance issue that could be backward-compatibly
514+
fixed, but compared to the above problems that are intrinsic to the
515+
design of git-filter-branch, the language of the tool itself is a
516+
relatively minor issue.
517+
518+
** Side note: Unfortunately, people tend to fixate on the
519+
written-in-shell aspect and periodically ask if git-filter-branch
520+
could be rewritten in another language to fix the performance
521+
issues. Not only does that ignore the bigger intrinsic problems
522+
with the design, it'd help less than you'd expect: if
523+
git-filter-branch itself were not shell, then the convenience
524+
functions (map(), skip_commit(), etc) and the `--setup` argument
525+
could no longer be executed once at the beginning of the program
526+
but would instead need to be prepended to every user filter (and
527+
thus re-executed with every commit).
528+
529+
The https://github.com/newren/git-filter-repo/[git filter-repo] tool is
530+
an alternative to git-filter-branch which does not suffer from these
531+
performance problems or the safety problems (mentioned below). For those
532+
with existing tooling which relies upon git-filter-branch, 'git
533+
repo-filter' also provides
534+
https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely],
535+
a drop-in git-filter-branch replacement (with a few caveats). While
536+
filter-lamely suffers from all the same safety issues as
537+
git-filter-branch, it at least ameloriates the performance issues a
538+
little.
539+
540+
[[SAFETY]]
541+
SAFETY
542+
------
543+
544+
git-filter-branch is riddled with gotchas resulting in various ways to
545+
easily corrupt repos or end up with a mess worse than what you started
546+
with:
547+
548+
* Someone can have a set of "working and tested filters" which they
549+
document or provide to a coworker, who then runs them on a different OS
550+
where the same commands are not working/tested (some examples in the
551+
git-filter-branch manpage are also affected by this). BSD vs. GNU
552+
userland differences can really bite. If lucky, error messages are
553+
spewed. But just as likely, the commands either don't do the filtering
554+
requested, or silently corrupt by making some unwanted change. The
555+
unwanted change may only affect a few commits, so it's not necessarily
556+
obvious either. (The fact that problems won't necessarily be obvious
557+
means they are likely to go unnoticed until the rewritten history is in
558+
use for quite a while, at which point it's really hard to justify
559+
another flag-day for another rewrite.)
560+
561+
* Filenames with spaces are often mishandled by shell snippets since
562+
they cause problems for shell pipelines. Not everyone is familiar with
563+
find -print0, xargs -0, git-ls-files -z, etc. Even people who are
564+
familiar with these may assume such flags are not relevant because
565+
someone else renamed any such files in their repo back before the person
566+
doing the filtering joined the project. And often, even those familiar
567+
with handling arguments with spaces may not do so just because they
568+
aren't in the mindset of thinking about everything that could possibly
569+
go wrong.
570+
571+
* Non-ascii filenames can be silently removed despite being in a desired
572+
directory. Keeping only wanted paths is often done using pipelines like
573+
`git ls-files | grep -v ^WANTED_DIR/ | xargs git rm`. ls-files will
574+
only quote filenames if needed, so folks may not notice that one of the
575+
files didn't match the regex (at least not until it's much too late).
576+
Yes, someone who knows about core.quotePath can avoid this (unless they
577+
have other special characters like \t, \n, or "), and people who use
578+
ls-files -z with something other than grep can avoid this, but that
579+
doesn't mean they will.
580+
581+
* Similarly, when moving files around, one can find that filenames with
582+
non-ascii or special characters end up in a different directory, one
583+
that includes a double quote character. (This is technically the same
584+
issue as above with quoting, but perhaps an interesting different way
585+
that it can and has manifested as a problem.)
586+
587+
* It's far too easy to accidentally mix up old and new history. It's
588+
still possible with any tool, but git-filter-branch almost invites it.
589+
If lucky, the only downside is users getting frustrated that they don't
590+
know how to shrink their repo and remove the old stuff. If unlucky,
591+
they merge old and new history and end up with multiple "copies" of each
592+
commit, some of which have unwanted or sensitive files and others which
593+
don't. This comes about in multiple different ways:
594+
595+
** the default to only doing a partial history rewrite ('--all' is not
596+
the default and few examples show it)
597+
598+
** the fact that there's no automatic post-run cleanup
599+
600+
** the fact that --tag-name-filter (when used to rename tags) doesn't
601+
remove the old tags but just adds new ones with the new name
602+
603+
** the fact that little educational information is provided to inform
604+
users of the ramifications of a rewrite and how to avoid mixing old
605+
and new history. For example, this man page discusses how users
606+
need to understand that they need to rebase their changes for all
607+
their branches on top of new history (or delete and reclone), but
608+
that's only one of multiple concerns to consider. See the
609+
"DISCUSSION" section of the git filter-repo manual page for more
610+
details.
611+
612+
* Annotated tags can be accidentally converted to lightweight tags, due
613+
to either of two issues:
614+
615+
** Someone can do a history rewrite, realize they messed up, restore
616+
from the backups in refs/original/, and then redo their
617+
git-filter-branch command. (The backup in refs/original/ is not a
618+
real backup; it dereferences tags first.)
619+
620+
** Running git-filter-branch with either --tags or --all in your
621+
<rev-list options>. In order to retain annotated tags as
622+
annotated, you must use --tag-name-filter (and must not have
623+
restored from refs/original/ in a previously botched rewrite).
624+
625+
* Any commit messages that specify an encoding will become corrupted
626+
by the rewrite; git-filter-branch ignores the encoding, takes the original
627+
bytes, and feeds it to commit-tree without telling it the proper
628+
encoding. (This happens whether or not --msg-filter is used.)
629+
630+
* Commit messages (even if they are all UTF-8) by default become
631+
corrupted due to not being updated -- any references to other commit
632+
hashes in commit messages will now refer to no-longer-extant commits.
633+
634+
* There are no facilities for helping users find what unwanted crud they
635+
should delete, which means they are much more likely to have incomplete
636+
or partial cleanups that sometimes result in confusion and people
637+
wasting time trying to understand. (For example, folks tend to just
638+
look for big files to delete instead of big directories or extensions,
639+
and once they do so, then sometime later folks using the new repository
640+
who are going through history will notice a build artifact directory
641+
that has some files but not others, or a cache of dependencies
642+
(node_modules or similar) which couldn't have ever been functional since
643+
it's missing some files.)
644+
645+
* If --prune-empty isn't specified, then the filtering process can
646+
create hoards of confusing empty commits
647+
648+
* If --prune-empty is specified, then intentionally placed empty
649+
commits from before the filtering operation are also pruned instead of
650+
just pruning commits that became empty due to filtering rules.
651+
652+
* If --prune empty is specified, sometimes empty commits are missed
653+
and left around anyway (a somewhat rare bug, but it happens...)
654+
655+
* A minor issue, but users who have a goal to update all names and
656+
emails in a repository may be led to --env-filter which will only update
657+
authors and committers, missing taggers.
658+
659+
* If the user provides a --tag-name-filter that maps multiple tags to
660+
the same name, no warning or error is provided; git-filter-branch simply
661+
overwrites each tag in some undocumented pre-defined order resulting in
662+
only one tag at the end. (A git-filter-branch regression test requires
663+
this surprising behavior.)
664+
665+
Also, the poor performance of git-filter-branch often leads to safety
666+
issues:
667+
668+
* Coming up with the correct shell snippet to do the filtering you want
669+
is sometimes difficult unless you're just doing a trivial modification
670+
such as deleting a couple files. Unfortunately, people often learn if
671+
the snippet is right or wrong by trying it out, but the rightness or
672+
wrongness can vary depending on special circumstances (spaces in
673+
filenames, non-ascii filenames, funny author names or emails, invalid
674+
timezones, presence of grafts or replace objects, etc.), meaning they
675+
may have to wait a long time, hit an error, then restart. The
676+
performance of git-filter-branch is so bad that this cycle is painful,
677+
reducing the time available to carefully re-check (to say nothing about
678+
what it does to the patience of the person doing the rewrite even if
679+
they do technically have more time available). This problem is extra
680+
compounded because errors from broken filters may not be shown for a
681+
long time and/or get lost in a sea of output. Even worse, broken
682+
filters often just result in silent incorrect rewrites.
683+
684+
* To top it all off, even when users finally find working commands, they
685+
naturally want to share them. But they may be unaware that their repo
686+
didn't have some special cases that someone else's does. So, when
687+
someone else with a different repository runs the same commands, they
688+
get hit by the problems above. Or, the user just runs commands that
689+
really were vetted for special cases, but they run it on a different OS
690+
where it doesn't work, as noted above.
478691

479692
GIT
480693
---

Documentation/git-gc.txt

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -115,15 +115,14 @@ NOTES
115115
-----
116116

117117
'git gc' tries very hard not to delete objects that are referenced
118-
anywhere in your repository. In
119-
particular, it will keep not only objects referenced by your current set
120-
of branches and tags, but also objects referenced by the index,
121-
remote-tracking branches, refs saved by 'git filter-branch' in
122-
refs/original/, reflogs (which may reference commits in branches
123-
that were later amended or rewound), and anything else in the refs/* namespace.
124-
If you are expecting some objects to be deleted and they aren't, check
125-
all of those locations and decide whether it makes sense in your case to
126-
remove those references.
118+
anywhere in your repository. In particular, it will keep not only
119+
objects referenced by your current set of branches and tags, but also
120+
objects referenced by the index, remote-tracking branches, notes saved
121+
by 'git notes' under refs/notes/, reflogs (which may reference commits
122+
in branches that were later amended or rewound), and anything else in
123+
the refs/* namespace. If you are expecting some objects to be deleted
124+
and they aren't, check all of those locations and decide whether it
125+
makes sense in your case to remove those references.
127126

128127
On the other hand, when 'git gc' runs concurrently with another process,
129128
there is a risk of it deleting an object that the other process is using

Documentation/git-rebase.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -832,7 +832,8 @@ Hard case: The changes are not the same.::
832832
This happens if the 'subsystem' rebase had conflicts, or used
833833
`--interactive` to omit, edit, squash, or fixup commits; or
834834
if the upstream used one of `commit --amend`, `reset`, or
835-
`filter-branch`.
835+
a full history rewriting command like
836+
https://github.com/newren/git-filter-repo[`filter-repo`].
836837

837838

838839
The easy case

Documentation/git-replace.txt

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -123,10 +123,10 @@ The following format are available:
123123
CREATING REPLACEMENT OBJECTS
124124
----------------------------
125125

126-
linkgit:git-filter-branch[1], linkgit:git-hash-object[1] and
127-
linkgit:git-rebase[1], among other git commands, can be used to create
128-
replacement objects from existing objects. The `--edit` option can
129-
also be used with 'git replace' to create a replacement object by
126+
linkgit:git-hash-object[1], linkgit:git-rebase[1], and
127+
https://github.com/newren/git-filter-repo[git-filter-repo], among other git commands, can be used to
128+
create replacement objects from existing objects. The `--edit` option
129+
can also be used with 'git replace' to create a replacement object by
130130
editing an existing object.
131131

132132
If you want to replace many blobs, trees or commits that are part of a
@@ -148,13 +148,13 @@ pending objects.
148148
SEE ALSO
149149
--------
150150
linkgit:git-hash-object[1]
151-
linkgit:git-filter-branch[1]
152151
linkgit:git-rebase[1]
153152
linkgit:git-tag[1]
154153
linkgit:git-branch[1]
155154
linkgit:git-commit[1]
156155
linkgit:git-var[1]
157156
linkgit:git[1]
157+
https://github.com/newren/git-filter-repo[git-filter-repo]
158158

159159
GIT
160160
---

Documentation/git-svn.txt

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -769,11 +769,11 @@ option for (hopefully) obvious reasons.
769769
+
770770
This option is NOT recommended as it makes it difficult to track down
771771
old references to SVN revision numbers in existing documentation, bug
772-
reports and archives. If you plan to eventually migrate from SVN to Git
773-
and are certain about dropping SVN history, consider
774-
linkgit:git-filter-branch[1] instead. filter-branch also allows
775-
reformatting of metadata for ease-of-reading and rewriting authorship
776-
info for non-"svn.authorsFile" users.
772+
reports, and archives. If you plan to eventually migrate from SVN to
773+
Git and are certain about dropping SVN history, consider
774+
https://github.com/newren/git-filter-repo[git-filter-repo] instead.
775+
filter-repo also allows reformatting of metadata for ease-of-reading
776+
and rewriting authorship info for non-"svn.authorsFile" users.
777777

778778
svn.useSvmProps::
779779
svn-remote.<name>.useSvmProps::

Documentation/githooks.txt

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -425,10 +425,12 @@ post-rewrite
425425

426426
This hook is invoked by commands that rewrite commits
427427
(linkgit:git-commit[1] when called with `--amend` and
428-
linkgit:git-rebase[1]; currently `git filter-branch` does 'not' call
429-
it!). Its first argument denotes the command it was invoked by:
430-
currently one of `amend` or `rebase`. Further command-dependent
431-
arguments may be passed in the future.
428+
linkgit:git-rebase[1]; however, full-history (re)writing tools like
429+
linkgit:git-fast-import[1] or
430+
https://github.com/newren/git-filter-repo[git-filter-repo] typically
431+
do not call it!). Its first argument denotes the command it was
432+
invoked by: currently one of `amend` or `rebase`. Further
433+
command-dependent arguments may be passed in the future.
432434

433435
The hook receives a list of the rewritten commits on stdin, in the
434436
format

0 commit comments

Comments
 (0)