Skip to content

Commit 0f9e62e

Browse files
committed
Merge branch 'jk/pack-bitmap'
Borrow the bitmap index into packfiles from JGit to speed up enumeration of objects involved in a commit range without having to fully traverse the history. * jk/pack-bitmap: (26 commits) ewah: unconditionally ntohll ewah data ewah: support platforms that require aligned reads read-cache: use get_be32 instead of hand-rolled ntoh_l block-sha1: factor out get_be and put_be wrappers do not discard revindex when re-preparing packfiles pack-bitmap: implement optional name_hash cache t/perf: add tests for pack bitmaps t: add basic bitmap functionality tests count-objects: recognize .bitmap in garbage-checking repack: consider bitmaps when performing repacks repack: handle optional files created by pack-objects repack: turn exts array into array-of-struct repack: stop using magic number for ARRAY_SIZE(exts) pack-objects: implement bitmap writing rev-list: add bitmap mode to speed up object lists pack-objects: use bitmaps when packing objects pack-objects: split add_object_entry pack-bitmap: add support for bitmap indexes documentation: add documentation for the bitmap format ewah: compressed bitmap implementation ...
2 parents 6784fab + 6b5b3a2 commit 0f9e62e

33 files changed

+4736
-276
lines changed

Documentation/config.txt

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1870,6 +1870,31 @@ pack.packSizeLimit::
18701870
Common unit suffixes of 'k', 'm', or 'g' are
18711871
supported.
18721872

1873+
pack.useBitmaps::
1874+
When true, git will use pack bitmaps (if available) when packing
1875+
to stdout (e.g., during the server side of a fetch). Defaults to
1876+
true. You should not generally need to turn this off unless
1877+
you are debugging pack bitmaps.
1878+
1879+
pack.writebitmaps::
1880+
When true, git will write a bitmap index when packing all
1881+
objects to disk (e.g., when `git repack -a` is run). This
1882+
index can speed up the "counting objects" phase of subsequent
1883+
packs created for clones and fetches, at the cost of some disk
1884+
space and extra time spent on the initial repack. Defaults to
1885+
false.
1886+
1887+
pack.writeBitmapHashCache::
1888+
When true, git will include a "hash cache" section in the bitmap
1889+
index (if one is written). This cache can be used to feed git's
1890+
delta heuristics, potentially leading to better deltas between
1891+
bitmapped and non-bitmapped objects (e.g., when serving a fetch
1892+
between an older, bitmapped pack and objects that have been
1893+
pushed since the last gc). The downside is that it consumes 4
1894+
bytes per object of disk space, and that JGit's bitmap
1895+
implementation does not understand it, causing it to complain if
1896+
Git and JGit are used on the same repository. Defaults to false.
1897+
18731898
pager.<cmd>::
18741899
If the value is boolean, turns on or off pagination of the
18751900
output of a particular Git subcommand when writing to a tty.

Documentation/git-repack.txt

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ git-repack - Pack unpacked objects in a repository
99
SYNOPSIS
1010
--------
1111
[verse]
12-
'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [--window=<n>] [--depth=<n>]
12+
'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [--window=<n>] [--depth=<n>]
1313

1414
DESCRIPTION
1515
-----------
@@ -110,6 +110,13 @@ other objects in that pack they already have locally.
110110
The default is unlimited, unless the config variable
111111
`pack.packSizeLimit` is set.
112112

113+
-b::
114+
--write-bitmap-index::
115+
Write a reachability bitmap index as part of the repack. This
116+
only makes sense when used with `-a` or `-A`, as the bitmaps
117+
must be able to refer to all reachable objects. This option
118+
overrides the setting of `pack.writebitmaps`.
119+
113120

114121
Configuration
115122
-------------

Documentation/git-rev-list.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ SYNOPSIS
5555
[ \--reverse ]
5656
[ \--walk-reflogs ]
5757
[ \--no-walk ] [ \--do-walk ]
58+
[ \--use-bitmap-index ]
5859
<commit>... [ \-- <paths>... ]
5960

6061
DESCRIPTION

Documentation/rev-list-options.txt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -257,6 +257,14 @@ See also linkgit:git-reflog[1].
257257
Output excluded boundary commits. Boundary commits are
258258
prefixed with `-`.
259259

260+
ifdef::git-rev-list[]
261+
--use-bitmap-index::
262+
263+
Try to speed up the traversal using the pack bitmap index (if
264+
one is available). Note that when traversing with `--objects`,
265+
trees and blobs will not have their associated path printed.
266+
endif::git-rev-list[]
267+
260268
--
261269

262270
History Simplification
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
GIT bitmap v1 format
2+
====================
3+
4+
- A header appears at the beginning:
5+
6+
4-byte signature: {'B', 'I', 'T', 'M'}
7+
8+
2-byte version number (network byte order)
9+
The current implementation only supports version 1
10+
of the bitmap index (the same one as JGit).
11+
12+
2-byte flags (network byte order)
13+
14+
The following flags are supported:
15+
16+
- BITMAP_OPT_FULL_DAG (0x1) REQUIRED
17+
This flag must always be present. It implies that the bitmap
18+
index has been generated for a packfile with full closure
19+
(i.e. where every single object in the packfile can find
20+
its parent links inside the same packfile). This is a
21+
requirement for the bitmap index format, also present in JGit,
22+
that greatly reduces the complexity of the implementation.
23+
24+
- BITMAP_OPT_HASH_CACHE (0x4)
25+
If present, the end of the bitmap file contains
26+
`N` 32-bit name-hash values, one per object in the
27+
pack. The format and meaning of the name-hash is
28+
described below.
29+
30+
4-byte entry count (network byte order)
31+
32+
The total count of entries (bitmapped commits) in this bitmap index.
33+
34+
20-byte checksum
35+
36+
The SHA1 checksum of the pack this bitmap index belongs to.
37+
38+
- 4 EWAH bitmaps that act as type indexes
39+
40+
Type indexes are serialized after the hash cache in the shape
41+
of four EWAH bitmaps stored consecutively (see Appendix A for
42+
the serialization format of an EWAH bitmap).
43+
44+
There is a bitmap for each Git object type, stored in the following
45+
order:
46+
47+
- Commits
48+
- Trees
49+
- Blobs
50+
- Tags
51+
52+
In each bitmap, the `n`th bit is set to true if the `n`th object
53+
in the packfile is of that type.
54+
55+
The obvious consequence is that the OR of all 4 bitmaps will result
56+
in a full set (all bits set), and the AND of all 4 bitmaps will
57+
result in an empty bitmap (no bits set).
58+
59+
- N entries with compressed bitmaps, one for each indexed commit
60+
61+
Where `N` is the total amount of entries in this bitmap index.
62+
Each entry contains the following:
63+
64+
- 4-byte object position (network byte order)
65+
The position **in the index for the packfile** where the
66+
bitmap for this commit is found.
67+
68+
- 1-byte XOR-offset
69+
The xor offset used to compress this bitmap. For an entry
70+
in position `x`, a XOR offset of `y` means that the actual
71+
bitmap representing this commit is composed by XORing the
72+
bitmap for this entry with the bitmap in entry `x-y` (i.e.
73+
the bitmap `y` entries before this one).
74+
75+
Note that this compression can be recursive. In order to
76+
XOR this entry with a previous one, the previous entry needs
77+
to be decompressed first, and so on.
78+
79+
The hard-limit for this offset is 160 (an entry can only be
80+
xor'ed against one of the 160 entries preceding it). This
81+
number is always positive, and hence entries are always xor'ed
82+
with **previous** bitmaps, not bitmaps that will come afterwards
83+
in the index.
84+
85+
- 1-byte flags for this bitmap
86+
At the moment the only available flag is `0x1`, which hints
87+
that this bitmap can be re-used when rebuilding bitmap indexes
88+
for the repository.
89+
90+
- The compressed bitmap itself, see Appendix A.
91+
92+
== Appendix A: Serialization format for an EWAH bitmap
93+
94+
Ewah bitmaps are serialized in the same protocol as the JAVAEWAH
95+
library, making them backwards compatible with the JGit
96+
implementation:
97+
98+
- 4-byte number of bits of the resulting UNCOMPRESSED bitmap
99+
100+
- 4-byte number of words of the COMPRESSED bitmap, when stored
101+
102+
- N x 8-byte words, as specified by the previous field
103+
104+
This is the actual content of the compressed bitmap.
105+
106+
- 4-byte position of the current RLW for the compressed
107+
bitmap
108+
109+
All words are stored in network byte order for their corresponding
110+
sizes.
111+
112+
The compressed bitmap is stored in a form of run-length encoding, as
113+
follows. It consists of a concatenation of an arbitrary number of
114+
chunks. Each chunk consists of one or more 64-bit words
115+
116+
H L_1 L_2 L_3 .... L_M
117+
118+
H is called RLW (run length word). It consists of (from lower to higher
119+
order bits):
120+
121+
- 1 bit: the repeated bit B
122+
123+
- 32 bits: repetition count K (unsigned)
124+
125+
- 31 bits: literal word count M (unsigned)
126+
127+
The bitstream represented by the above chunk is then:
128+
129+
- K repetitions of B
130+
131+
- The bits stored in `L_1` through `L_M`. Within a word, bits at
132+
lower order come earlier in the stream than those at higher
133+
order.
134+
135+
The next word after `L_M` (if any) must again be a RLW, for the next
136+
chunk. For efficient appending to the bitstream, the EWAH stores a
137+
pointer to the last RLW in the stream.
138+
139+
140+
== Appendix B: Optional Bitmap Sections
141+
142+
These sections may or may not be present in the `.bitmap` file; their
143+
presence is indicated by the header flags section described above.
144+
145+
Name-hash cache
146+
---------------
147+
148+
If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains
149+
a cache of 32-bit values, one per object in the pack. The value at
150+
position `i` is the hash of the pathname at which the `i`th object
151+
(counting in index order) in the pack can be found. This can be fed
152+
into the delta heuristics to compare objects with similar pathnames.
153+
154+
The hash algorithm used is:
155+
156+
hash = 0;
157+
while ((c = *name++))
158+
if (!isspace(c))
159+
hash = (hash >> 2) + (c << 24);
160+
161+
Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag.
162+
If implementations want to choose a different hashing scheme, they are
163+
free to do so, but MUST allocate a new header flag (because comparing
164+
hashes made under two different schemes would be pointless).

Makefile

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -664,6 +664,8 @@ LIB_H += diff.h
664664
LIB_H += diffcore.h
665665
LIB_H += dir.h
666666
LIB_H += exec_cmd.h
667+
LIB_H += ewah/ewok.h
668+
LIB_H += ewah/ewok_rlw.h
667669
LIB_H += fetch-pack.h
668670
LIB_H += fmt-merge-msg.h
669671
LIB_H += fsck.h
@@ -691,8 +693,10 @@ LIB_H += notes-merge.h
691693
LIB_H += notes-utils.h
692694
LIB_H += notes.h
693695
LIB_H += object.h
696+
LIB_H += pack-objects.h
694697
LIB_H += pack-revindex.h
695698
LIB_H += pack.h
699+
LIB_H += pack-bitmap.h
696700
LIB_H += parse-options.h
697701
LIB_H += patch-ids.h
698702
LIB_H += pathspec.h
@@ -796,6 +800,10 @@ LIB_OBJS += dir.o
796800
LIB_OBJS += editor.o
797801
LIB_OBJS += entry.o
798802
LIB_OBJS += environment.o
803+
LIB_OBJS += ewah/bitmap.o
804+
LIB_OBJS += ewah/ewah_bitmap.o
805+
LIB_OBJS += ewah/ewah_io.o
806+
LIB_OBJS += ewah/ewah_rlw.o
799807
LIB_OBJS += exec_cmd.o
800808
LIB_OBJS += fetch-pack.o
801809
LIB_OBJS += fsck.o
@@ -827,7 +835,10 @@ LIB_OBJS += notes-cache.o
827835
LIB_OBJS += notes-merge.o
828836
LIB_OBJS += notes-utils.o
829837
LIB_OBJS += object.o
838+
LIB_OBJS += pack-bitmap.o
839+
LIB_OBJS += pack-bitmap-write.o
830840
LIB_OBJS += pack-check.o
841+
LIB_OBJS += pack-objects.o
831842
LIB_OBJS += pack-revindex.o
832843
LIB_OBJS += pack-write.o
833844
LIB_OBJS += pager.o
@@ -2480,8 +2491,9 @@ profile-clean:
24802491
$(RM) $(addsuffix *.gcno,$(addprefix $(PROFILE_DIR)/, $(object_dirs)))
24812492

24822493
clean: profile-clean coverage-clean
2483-
$(RM) *.o *.res block-sha1/*.o ppc/*.o compat/*.o compat/*/*.o xdiff/*.o vcs-svn/*.o \
2484-
builtin/*.o $(LIB_FILE) $(XDIFF_LIB) $(VCSSVN_LIB)
2494+
$(RM) *.o *.res block-sha1/*.o ppc/*.o compat/*.o compat/*/*.o
2495+
$(RM) xdiff/*.o vcs-svn/*.o ewah/*.o builtin/*.o
2496+
$(RM) $(LIB_FILE) $(XDIFF_LIB) $(VCSSVN_LIB)
24852497
$(RM) $(ALL_PROGRAMS) $(SCRIPT_LIB) $(BUILT_INS) git$X
24862498
$(RM) $(TEST_PROGRAMS) $(NO_INSTALL)
24872499
$(RM) -r bin-wrappers $(dep_dirs)

block-sha1/sha1.c

Lines changed: 0 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -62,38 +62,6 @@
6262
#define setW(x, val) (W(x) = (val))
6363
#endif
6464

65-
/*
66-
* Performance might be improved if the CPU architecture is OK with
67-
* unaligned 32-bit loads and a fast ntohl() is available.
68-
* Otherwise fall back to byte loads and shifts which is portable,
69-
* and is faster on architectures with memory alignment issues.
70-
*/
71-
72-
#if defined(__i386__) || defined(__x86_64__) || \
73-
defined(_M_IX86) || defined(_M_X64) || \
74-
defined(__ppc__) || defined(__ppc64__) || \
75-
defined(__powerpc__) || defined(__powerpc64__) || \
76-
defined(__s390__) || defined(__s390x__)
77-
78-
#define get_be32(p) ntohl(*(unsigned int *)(p))
79-
#define put_be32(p, v) do { *(unsigned int *)(p) = htonl(v); } while (0)
80-
81-
#else
82-
83-
#define get_be32(p) ( \
84-
(*((unsigned char *)(p) + 0) << 24) | \
85-
(*((unsigned char *)(p) + 1) << 16) | \
86-
(*((unsigned char *)(p) + 2) << 8) | \
87-
(*((unsigned char *)(p) + 3) << 0) )
88-
#define put_be32(p, v) do { \
89-
unsigned int __v = (v); \
90-
*((unsigned char *)(p) + 0) = __v >> 24; \
91-
*((unsigned char *)(p) + 1) = __v >> 16; \
92-
*((unsigned char *)(p) + 2) = __v >> 8; \
93-
*((unsigned char *)(p) + 3) = __v >> 0; } while (0)
94-
95-
#endif
96-
9765
/* This "rolls" over the 512-bit array */
9866
#define W(x) (array[(x)&15])
9967

0 commit comments

Comments
 (0)