Skip to content

Commit 8e97852

Browse files
committed
Merge branch 'ds/sparse-index-protections'
Builds on top of the sparse-index infrastructure to mark operations that are not ready to mark with the sparse index, causing them to fall back on fully-populated index that they always have worked with. * ds/sparse-index-protections: (47 commits) name-hash: use expand_to_path() sparse-index: expand_to_path() name-hash: don't add directories to name_hash revision: ensure full index resolve-undo: ensure full index read-cache: ensure full index pathspec: ensure full index merge-recursive: ensure full index entry: ensure full index dir: ensure full index update-index: ensure full index stash: ensure full index rm: ensure full index merge-index: ensure full index ls-files: ensure full index grep: ensure full index fsck: ensure full index difftool: ensure full index commit: ensure full index checkout: ensure full index ...
2 parents d250f90 + 4589bca commit 8e97852

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+1257
-109
lines changed

Documentation/config/index.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,11 @@ index.recordOffsetTable::
1414
Defaults to 'true' if index.threads has been explicitly enabled,
1515
'false' otherwise.
1616

17+
index.sparse::
18+
When enabled, write the index using sparse-directory entries. This
19+
has no effect unless `core.sparseCheckout` and
20+
`core.sparseCheckoutCone` are both enabled. Defaults to 'false'.
21+
1722
index.threads::
1823
Specifies the number of threads to spawn when loading the index.
1924
This is meant to reduce index load time on multiprocessor machines.

Documentation/git-sparse-checkout.txt

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the
4545
When `--cone` is provided, the `core.sparseCheckoutCone` setting is
4646
also set, allowing for better performance with a limited set of
4747
patterns (see 'CONE PATTERN SET' below).
48+
+
49+
Use the `--[no-]sparse-index` option to toggle the use of the sparse
50+
index format. This reduces the size of the index to be more closely
51+
aligned with your sparse-checkout definition. This can have significant
52+
performance advantages for commands such as `git status` or `git add`.
53+
This feature is still experimental. Some commands might be slower with
54+
a sparse index until they are properly integrated with the feature.
55+
+
56+
**WARNING:** Using a sparse index requires modifying the index in a way
57+
that is not completely understood by external tools. If you have trouble
58+
with this compatibility, then run `git sparse-checkout init --no-sparse-index`
59+
to rewrite your index to not be sparse. Older versions of Git will not
60+
understand the sparse directory entries index extension and may fail to
61+
interact with your repository until it is disabled.
4862

4963
'set'::
5064
Write a set of patterns to the sparse-checkout file, as given as

Documentation/technical/index-format.txt

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,13 @@ Git index format
4444
localization, no special casing of directory separator '/'). Entries
4545
with the same name are sorted by their stage field.
4646

47+
An index entry typically represents a file. However, if sparse-checkout
48+
is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
49+
`extensions.sparseIndex` extension is enabled, then the index may
50+
contain entries for directories outside of the sparse-checkout definition.
51+
These entries have mode `040000`, include the `SKIP_WORKTREE` bit, and
52+
the path ends in a directory separator.
53+
4754
32-bit ctime seconds, the last time a file's metadata changed
4855
this is stat(2) data
4956

@@ -385,3 +392,15 @@ The remaining data of each directory block is grouped by type:
385392
in this block of entries.
386393

387394
- 32-bit count of cache entries in this block
395+
396+
== Sparse Directory Entries
397+
398+
When using sparse-checkout in cone mode, some entire directories within
399+
the index can be summarized by pointing to a tree object instead of the
400+
entire expanded list of paths within that tree. An index containing such
401+
entries is a "sparse index". Index format versions 4 and less were not
402+
implemented with such entries in mind. Thus, for these versions, an
403+
index containing sparse directory entries will include this extension
404+
with signature { 's', 'd', 'i', 'r' }. Like the split-index extension,
405+
tools should avoid interacting with a sparse index unless they understand
406+
this extension.
Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
Git Sparse-Index Design Document
2+
================================
3+
4+
The sparse-checkout feature allows users to focus a working directory on
5+
a subset of the files at HEAD. The cone mode patterns, enabled by
6+
`core.sparseCheckoutCone`, allow for very fast pattern matching to
7+
discover which files at HEAD belong in the sparse-checkout cone.
8+
9+
Three important scale dimensions for a Git working directory are:
10+
11+
* `HEAD`: How many files are present at `HEAD`?
12+
13+
* Populated: How many files are within the sparse-checkout cone.
14+
15+
* Modified: How many files has the user modified in the working directory?
16+
17+
We will use big-O notation -- O(X) -- to denote how expensive certain
18+
operations are in terms of these dimensions.
19+
20+
These dimensions are ordered by their magnitude: users (typically) modify
21+
fewer files than are populated, and we can only populate files at `HEAD`.
22+
23+
Problems occur if there is an extreme imbalance in these dimensions. For
24+
example, if `HEAD` contains millions of paths but the populated set has
25+
only tens of thousands, then commands like `git status` and `git add` can
26+
be dominated by operations that require O(`HEAD`) operations instead of
27+
O(Populated). Primarily, the cost is in parsing and rewriting the index,
28+
which is filled primarily with files at `HEAD` that are marked with the
29+
`SKIP_WORKTREE` bit.
30+
31+
The sparse-index intends to take these commands that read and modify the
32+
index from O(`HEAD`) to O(Populated). To do this, we need to modify the
33+
index format in a significant way: add "sparse directory" entries.
34+
35+
With cone mode patterns, it is possible to detect when an entire
36+
directory will have its contents outside of the sparse-checkout definition.
37+
Instead of listing all of the files it contains as individual entries, a
38+
sparse-index contains an entry with the directory name, referencing the
39+
object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
40+
If we need to discover the details for paths within that directory, we
41+
can parse trees to find that list.
42+
43+
At time of writing, sparse-directory entries violate expectations about the
44+
index format and its in-memory data structure. There are many consumers in
45+
the codebase that expect to iterate through all of the index entries and
46+
see only files. In fact, these loops expect to see a reference to every
47+
staged file. One way to handle this is to parse trees to replace a
48+
sparse-directory entry with all of the files within that tree as the index
49+
is loaded. However, parsing trees is slower than parsing the index format,
50+
so that is a slower operation than if we left the index alone. The plan is
51+
to make all of these integrations "sparse aware" so this expansion through
52+
tree parsing is unnecessary and they use fewer resources than when using a
53+
full index.
54+
55+
The implementation plan below follows four phases to slowly integrate with
56+
the sparse-index. The intention is to incrementally update Git commands to
57+
interact safely with the sparse-index without significant slowdowns. This
58+
may not always be possible, but the hope is that the primary commands that
59+
users need in their daily work are dramatically improved.
60+
61+
Phase I: Format and initial speedups
62+
------------------------------------
63+
64+
During this phase, Git learns to enable the sparse-index and safely parse
65+
one. Protections are put in place so that every consumer of the in-memory
66+
data structure can operate with its current assumption of every file at
67+
`HEAD`.
68+
69+
At first, every index parse will call a helper method,
70+
`ensure_full_index()`, which scans the index for sparse-directory entries
71+
(pointing to trees) and replaces them with the full list of paths (with
72+
blob contents) by parsing tree objects. This will be slower in all cases.
73+
The only noticeable change in behavior will be that the serialized index
74+
file contains sparse-directory entries.
75+
76+
To start, we use a new required index extension, `sdir`, to allow
77+
inserting sparse-directory entries into indexes with file format
78+
versions 2, 3, and 4. This prevents Git versions that do not understand
79+
the sparse-index from operating on one, while allowing tools that do not
80+
understand the sparse-index to operate on repositories as long as they do
81+
not interact with the index. A new format, index v5, will be introduced
82+
that includes sparse-directory entries by default. It might also
83+
introduce other features that have been considered for improving the
84+
index, as well.
85+
86+
Next, consumers of the index will be guarded against operating on a
87+
sparse-index by inserting calls to `ensure_full_index()` or
88+
`expand_index_to_path()`. If a specific path is requested, then those will
89+
be protected from within the `index_file_exists()` and `index_name_pos()`
90+
API calls: they will call `ensure_full_index()` if necessary. The
91+
intention here is to preserve existing behavior when interacting with a
92+
sparse-checkout. We don't want a change to happen by accident, without
93+
tests. Many of these locations may not need any change before removing the
94+
guards, but we should not do so without tests to ensure the expected
95+
behavior happens.
96+
97+
It may be desirable to _change_ the behavior of some commands in the
98+
presence of a sparse index or more generally in any sparse-checkout
99+
scenario. In such cases, these should be carefully communicated and
100+
tested. No such behavior changes are intended during this phase.
101+
102+
During a scan of the codebase, not every iteration of the cache entries
103+
needs an `ensure_full_index()` check. The basic reasons include:
104+
105+
1. The loop is scanning for entries with non-zero stage. These entries
106+
are not collapsed into a sparse-directory entry.
107+
108+
2. The loop is scanning for submodules. These entries are not collapsed
109+
into a sparse-directory entry.
110+
111+
3. The loop is part of the index API, especially around reading or
112+
writing the format.
113+
114+
4. The loop is checking for correct order of cache entries and that is
115+
correct if and only if the sparse-directory entries are in the correct
116+
location.
117+
118+
5. The loop ignores entries with the `SKIP_WORKTREE` bit set, or is
119+
otherwise already aware of sparse directory entries.
120+
121+
6. The sparse-index is disabled at this point when using the split-index
122+
feature, so no effort is made to protect the split-index API.
123+
124+
Even after inserting these guards, we will keep expanding sparse-indexes
125+
for most Git commands using the `command_requires_full_index` repository
126+
setting. This setting will be on by default and disabled one builtin at a
127+
time until we have sufficient confidence that all of the index operations
128+
are properly guarded.
129+
130+
To complete this phase, the commands `git status` and `git add` will be
131+
integrated with the sparse-index so that they operate with O(Populated)
132+
performance. They will be carefully tested for operations within and
133+
outside the sparse-checkout definition.
134+
135+
Phase II: Careful integrations
136+
------------------------------
137+
138+
This phase focuses on ensuring that all index extensions and APIs work
139+
well with a sparse-index. This requires significant increases to our test
140+
coverage, especially for operations that interact with the working
141+
directory outside of the sparse-checkout definition. Some of these
142+
behaviors may not be the desirable ones, such as some tests already
143+
marked for failure in `t1092-sparse-checkout-compatibility.sh`.
144+
145+
The index extensions that may require special integrations are:
146+
147+
* FS Monitor
148+
* Untracked cache
149+
150+
While integrating with these features, we should look for patterns that
151+
might lead to better APIs for interacting with the index. Coalescing
152+
common usage patterns into an API call can reduce the number of places
153+
where sparse-directories need to be handled carefully.
154+
155+
Phase III: Important command speedups
156+
-------------------------------------
157+
158+
At this point, the patterns for testing and implementing sparse-directory
159+
logic should be relatively stable. This phase focuses on updating some of
160+
the most common builtins that use the index to operate as O(Populated).
161+
Here is a potential list of commands that could be valuable to integrate
162+
at this point:
163+
164+
* `git commit`
165+
* `git checkout`
166+
* `git merge`
167+
* `git rebase`
168+
169+
Hopefully, commands such as `git merge` and `git rebase` can benefit
170+
instead from merge algorithms that do not use the index as a data
171+
structure, such as the merge-ORT strategy. As these topics mature, we
172+
may enable the ORT strategy by default for repositories using the
173+
sparse-index feature.
174+
175+
Along with `git status` and `git add`, these commands cover the majority
176+
of users' interactions with the working directory. In addition, we can
177+
integrate with these commands:
178+
179+
* `git grep`
180+
* `git rm`
181+
182+
These have been proposed as some whose behavior could change when in a
183+
repo with a sparse-checkout definition. It would be good to include this
184+
behavior automatically when using a sparse-index. Some clarity is needed
185+
to make the behavior switch clear to the user.
186+
187+
This phase is the first where parallel work might be possible without too
188+
much conflicts between topics.
189+
190+
Phase IV: The long tail
191+
-----------------------
192+
193+
This last phase is less a "phase" and more "the new normal" after all of
194+
the previous work.
195+
196+
To start, the `command_requires_full_index` option could be removed in
197+
favor of expanding only when hitting an API guard.
198+
199+
There are many Git commands that could use special attention to operate as
200+
O(Populated), while some might be so rare that it is acceptable to leave
201+
them with additional overhead when a sparse-index is present.
202+
203+
Here are some commands that might be useful to update:
204+
205+
* `git sparse-checkout set`
206+
* `git am`
207+
* `git clean`
208+
* `git stash`

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -995,6 +995,7 @@ LIB_OBJS += setup.o
995995
LIB_OBJS += shallow.o
996996
LIB_OBJS += sideband.o
997997
LIB_OBJS += sigchain.o
998+
LIB_OBJS += sparse-index.o
998999
LIB_OBJS += split-index.o
9991000
LIB_OBJS += stable-qsort.o
10001001
LIB_OBJS += strbuf.o

attr.c

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -733,7 +733,7 @@ static struct attr_stack *read_attr_from_file(const char *path, unsigned flags)
733733
return res;
734734
}
735735

736-
static struct attr_stack *read_attr_from_index(const struct index_state *istate,
736+
static struct attr_stack *read_attr_from_index(struct index_state *istate,
737737
const char *path,
738738
unsigned flags)
739739
{
@@ -763,7 +763,7 @@ static struct attr_stack *read_attr_from_index(const struct index_state *istate,
763763
return res;
764764
}
765765

766-
static struct attr_stack *read_attr(const struct index_state *istate,
766+
static struct attr_stack *read_attr(struct index_state *istate,
767767
const char *path, unsigned flags)
768768
{
769769
struct attr_stack *res = NULL;
@@ -855,7 +855,7 @@ static void push_stack(struct attr_stack **attr_stack_p,
855855
}
856856
}
857857

858-
static void bootstrap_attr_stack(const struct index_state *istate,
858+
static void bootstrap_attr_stack(struct index_state *istate,
859859
struct attr_stack **stack)
860860
{
861861
struct attr_stack *e;
@@ -894,7 +894,7 @@ static void bootstrap_attr_stack(const struct index_state *istate,
894894
push_stack(stack, e, NULL, 0);
895895
}
896896

897-
static void prepare_attr_stack(const struct index_state *istate,
897+
static void prepare_attr_stack(struct index_state *istate,
898898
const char *path, int dirlen,
899899
struct attr_stack **stack)
900900
{
@@ -1094,7 +1094,7 @@ static void determine_macros(struct all_attrs_item *all_attrs,
10941094
* If check->check_nr is non-zero, only attributes in check[] are collected.
10951095
* Otherwise all attributes are collected.
10961096
*/
1097-
static void collect_some_attrs(const struct index_state *istate,
1097+
static void collect_some_attrs(struct index_state *istate,
10981098
const char *path,
10991099
struct attr_check *check)
11001100
{
@@ -1123,7 +1123,7 @@ static void collect_some_attrs(const struct index_state *istate,
11231123
fill(path, pathlen, basename_offset, check->stack, check->all_attrs, rem);
11241124
}
11251125

1126-
void git_check_attr(const struct index_state *istate,
1126+
void git_check_attr(struct index_state *istate,
11271127
const char *path,
11281128
struct attr_check *check)
11291129
{
@@ -1140,7 +1140,7 @@ void git_check_attr(const struct index_state *istate,
11401140
}
11411141
}
11421142

1143-
void git_all_attrs(const struct index_state *istate,
1143+
void git_all_attrs(struct index_state *istate,
11441144
const char *path, struct attr_check *check)
11451145
{
11461146
int i;

attr.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -190,14 +190,14 @@ void attr_check_free(struct attr_check *check);
190190
*/
191191
const char *git_attr_name(const struct git_attr *);
192192

193-
void git_check_attr(const struct index_state *istate,
193+
void git_check_attr(struct index_state *istate,
194194
const char *path, struct attr_check *check);
195195

196196
/*
197197
* Retrieve all attributes that apply to the specified path.
198198
* check holds the attributes and their values.
199199
*/
200-
void git_all_attrs(const struct index_state *istate,
200+
void git_all_attrs(struct index_state *istate,
201201
const char *path, struct attr_check *check);
202202

203203
enum git_attr_direction {

builtin/add.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,8 @@ static int renormalize_tracked_files(const struct pathspec *pathspec, int flags)
141141
{
142142
int i, retval = 0;
143143

144+
/* TODO: audit for interaction with sparse-index. */
145+
ensure_full_index(&the_index);
144146
for (i = 0; i < active_nr; i++) {
145147
struct cache_entry *ce = active_cache[i];
146148

builtin/checkout-index.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,8 @@ static void checkout_all(const char *prefix, int prefix_length)
120120
int i, errs = 0;
121121
struct cache_entry *last_ce = NULL;
122122

123+
/* TODO: audit for interaction with sparse-index. */
124+
ensure_full_index(&the_index);
123125
for (i = 0; i < active_nr ; i++) {
124126
struct cache_entry *ce = active_cache[i];
125127
if (ce_stage(ce) != checkout_stage

builtin/checkout.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -369,6 +369,9 @@ static int checkout_worktree(const struct checkout_opts *opts,
369369
NULL);
370370

371371
enable_delayed_checkout(&state);
372+
373+
/* TODO: audit for interaction with sparse-index. */
374+
ensure_full_index(&the_index);
372375
for (pos = 0; pos < active_nr; pos++) {
373376
struct cache_entry *ce = active_cache[pos];
374377
if (ce->ce_flags & CE_MATCHED) {
@@ -513,6 +516,8 @@ static int checkout_paths(const struct checkout_opts *opts,
513516
* Make sure all pathspecs participated in locating the paths
514517
* to be checked out.
515518
*/
519+
/* TODO: audit for interaction with sparse-index. */
520+
ensure_full_index(&the_index);
516521
for (pos = 0; pos < active_nr; pos++)
517522
if (opts->overlay_mode)
518523
mark_ce_for_checkout_overlay(active_cache[pos],

0 commit comments

Comments
 (0)