Skip to content

Commit cf55870

Browse files
torvaldsgitster
authored andcommitted
Create pathname-based hash-table lookup into index
This creates a hash index of every single file added to the index. Right now that hash index isn't actually used for much: I implemented a "cache_name_exists()" function that uses it to efficiently look up a filename in the index without having to do the O(logn) binary search, but quite frankly, that's not why this patch is interesting. No, the whole and only reason to create the hash of the filenames in the index is that by modifying the hash function, you can fairly easily do things like making it always hash equivalent names into the same bucket. That, in turn, means that suddenly questions like "does this name exist in the index under an _equivalent_ name?" becomes much much cheaper. Guiding principles behind this patch: - it shouldn't be too costly. In fact, my primary goal here was to actually speed up "git commit" with a fully populated kernel tree, by being faster at checking whether a file already existed in the index. I did succeed, but only barely: Best before: [torvalds@woody linux]$ time git commit > /dev/null real 0m0.255s user 0m0.168s sys 0m0.088s Best after: [torvalds@woody linux]$ time ~/git/git commit > /dev/null real 0m0.233s user 0m0.144s sys 0m0.088s so some things are actually faster (~8%). Caveat: that's really the best case. Other things are invariably going to be slightly slower, since we populate that index cache, and quite frankly, few things really use it to look things up. That said, the cost is really quite small. The worst case is probably doing a "git ls-files", which will do very little except puopulate the index, and never actually looks anything up in it, just lists it. Before: [torvalds@woody linux]$ time git ls-files > /dev/null real 0m0.016s user 0m0.016s sys 0m0.000s After: [torvalds@woody linux]$ time ~/git/git ls-files > /dev/null real 0m0.021s user 0m0.012s sys 0m0.008s and while the thing has really gotten relatively much slower, we're still talking about something almost unmeasurable (eg 5ms). And that really should be pretty much the worst case. So we lose 5ms on one "benchmark", but win 22ms on another. Pick your poison - this patch has the advantage that it will _likely_ speed up the cases that are complex and expensive more than it slows down the cases that are already so fast that nobody cares. But if you look at relative speedups/slowdowns, it doesn't look so good. - It should be simple and clean The code may be a bit subtle (the reasons I do hash removal the way I do etc), but it re-uses the existing hash.c files, so it really is fairly small and straightforward apart from a few odd details. Now, this patch on its own doesn't really do much, but I think it's worth looking at, if only because if done correctly, the name hashing really can make an improvement to the whole issue of "do we have a filename that looks like this in the index already". And at least it gets real testing by being used even by default (ie there is a real use-case for it even without any insane filesystems). NOTE NOTE NOTE! The current hash is a joke. I'm ashamed of it, I'm just not ashamed of it enough to really care. I took all the numbers out of my nether regions - I'm sure it's good enough that it works in practice, but the whole point was that you can make a really much fancier hash that hashes characters not directly, but by their upper-case value or something like that, and thus you get a case-insensitive hash, while still keeping the name and the index itself totally case sensitive. Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 6d91da6 commit cf55870

File tree

3 files changed

+95
-11
lines changed

3 files changed

+95
-11
lines changed

cache.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33

44
#include "git-compat-util.h"
55
#include "strbuf.h"
6+
#include "hash.h"
67

78
#include SHA1_HEADER
89
#include <zlib.h>
@@ -109,6 +110,7 @@ struct ondisk_cache_entry {
109110
};
110111

111112
struct cache_entry {
113+
struct cache_entry *next;
112114
unsigned int ce_ctime;
113115
unsigned int ce_mtime;
114116
unsigned int ce_dev;
@@ -131,6 +133,7 @@ struct cache_entry {
131133
#define CE_UPDATE (0x10000)
132134
#define CE_REMOVE (0x20000)
133135
#define CE_UPTODATE (0x40000)
136+
#define CE_UNHASHED (0x80000)
134137

135138
static inline unsigned create_ce_flags(size_t len, unsigned stage)
136139
{
@@ -188,6 +191,7 @@ struct index_state {
188191
struct cache_tree *cache_tree;
189192
time_t timestamp;
190193
void *alloc;
194+
struct hash_table name_hash;
191195
};
192196

193197
extern struct index_state the_index;
@@ -211,6 +215,7 @@ extern struct index_state the_index;
211215
#define refresh_cache(flags) refresh_index(&the_index, (flags), NULL, NULL)
212216
#define ce_match_stat(ce, st, options) ie_match_stat(&the_index, (ce), (st), (options))
213217
#define ce_modified(ce, st, options) ie_modified(&the_index, (ce), (st), (options))
218+
#define cache_name_exists(name, namelen) index_name_exists(&the_index, (name), (namelen))
214219
#endif
215220

216221
enum object_type {
@@ -297,6 +302,7 @@ extern int read_index_from(struct index_state *, const char *path);
297302
extern int write_index(struct index_state *, int newfd);
298303
extern int discard_index(struct index_state *);
299304
extern int verify_path(const char *path);
305+
extern int index_name_exists(struct index_state *istate, const char *name, int namelen);
300306
extern int index_name_pos(struct index_state *, const char *name, int namelen);
301307
#define ADD_CACHE_OK_TO_ADD 1 /* Ok to add */
302308
#define ADD_CACHE_OK_TO_REPLACE 2 /* Ok to replace file/directory */

dir.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -346,7 +346,7 @@ static struct dir_entry *dir_entry_new(const char *pathname, int len)
346346

347347
struct dir_entry *dir_add_name(struct dir_struct *dir, const char *pathname, int len)
348348
{
349-
if (cache_name_pos(pathname, len) >= 0)
349+
if (cache_name_exists(pathname, len))
350350
return NULL;
351351

352352
ALLOC_GROW(dir->entries, dir->nr+1, dir->alloc);

read-cache.c

Lines changed: 88 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,70 @@
2323

2424
struct index_state the_index;
2525

26+
static unsigned int hash_name(const char *name, int namelen)
27+
{
28+
unsigned int hash = 0x123;
29+
30+
do {
31+
unsigned char c = *name++;
32+
hash = hash*101 + c;
33+
} while (--namelen);
34+
return hash;
35+
}
36+
37+
static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
38+
{
39+
void **pos;
40+
unsigned int hash = hash_name(ce->name, ce_namelen(ce));
41+
42+
istate->cache[nr] = ce;
43+
pos = insert_hash(hash, ce, &istate->name_hash);
44+
if (pos) {
45+
ce->next = *pos;
46+
*pos = ce;
47+
}
48+
}
49+
50+
/*
51+
* We don't actually *remove* it, we can just mark it invalid so that
52+
* we won't find it in lookups.
53+
*
54+
* Not only would we have to search the lists (simple enough), but
55+
* we'd also have to rehash other hash buckets in case this makes the
56+
* hash bucket empty (common). So it's much better to just mark
57+
* it.
58+
*/
59+
static void remove_hash_entry(struct index_state *istate, struct cache_entry *ce)
60+
{
61+
ce->ce_flags |= CE_UNHASHED;
62+
}
63+
64+
static void replace_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
65+
{
66+
struct cache_entry *old = istate->cache[nr];
67+
68+
if (ce != old) {
69+
remove_hash_entry(istate, old);
70+
set_index_entry(istate, nr, ce);
71+
}
72+
istate->cache_changed = 1;
73+
}
74+
75+
int index_name_exists(struct index_state *istate, const char *name, int namelen)
76+
{
77+
unsigned int hash = hash_name(name, namelen);
78+
struct cache_entry *ce = lookup_hash(hash, &istate->name_hash);
79+
80+
while (ce) {
81+
if (!(ce->ce_flags & CE_UNHASHED)) {
82+
if (!cache_name_compare(name, namelen, ce->name, ce->ce_flags))
83+
return 1;
84+
}
85+
ce = ce->next;
86+
}
87+
return 0;
88+
}
89+
2690
/*
2791
* This only updates the "non-critical" parts of the directory
2892
* cache, ie the parts that aren't tracked by GIT, and only used
@@ -327,6 +391,9 @@ int index_name_pos(struct index_state *istate, const char *name, int namelen)
327391
/* Remove entry, return true if there are more entries to go.. */
328392
int remove_index_entry_at(struct index_state *istate, int pos)
329393
{
394+
struct cache_entry *ce = istate->cache[pos];
395+
396+
remove_hash_entry(istate, ce);
330397
istate->cache_changed = 1;
331398
istate->cache_nr--;
332399
if (pos >= istate->cache_nr)
@@ -702,8 +769,7 @@ static int add_index_entry_with_check(struct index_state *istate, struct cache_e
702769

703770
/* existing match? Just replace it. */
704771
if (pos >= 0) {
705-
istate->cache_changed = 1;
706-
istate->cache[pos] = ce;
772+
replace_index_entry(istate, pos, ce);
707773
return 0;
708774
}
709775
pos = -pos-1;
@@ -763,7 +829,7 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
763829
memmove(istate->cache + pos + 1,
764830
istate->cache + pos,
765831
(istate->cache_nr - pos - 1) * sizeof(ce));
766-
istate->cache[pos] = ce;
832+
set_index_entry(istate, pos, ce);
767833
istate->cache_changed = 1;
768834
return 0;
769835
}
@@ -892,11 +958,8 @@ int refresh_index(struct index_state *istate, unsigned int flags, const char **p
892958
has_errors = 1;
893959
continue;
894960
}
895-
istate->cache_changed = 1;
896-
/* You can NOT just free istate->cache[i] here, since it
897-
* might not be necessarily malloc()ed but can also come
898-
* from mmap(). */
899-
istate->cache[i] = new;
961+
962+
replace_index_entry(istate, i, new);
900963
}
901964
return has_errors;
902965
}
@@ -971,6 +1034,20 @@ static void convert_from_disk(struct ondisk_cache_entry *ondisk, struct cache_en
9711034
memcpy(ce->name, ondisk->name, len + 1);
9721035
}
9731036

1037+
static inline size_t estimate_cache_size(size_t ondisk_size, unsigned int entries)
1038+
{
1039+
long per_entry;
1040+
1041+
per_entry = sizeof(struct cache_entry) - sizeof(struct ondisk_cache_entry);
1042+
1043+
/*
1044+
* Alignment can cause differences. This should be "alignof", but
1045+
* since that's a gcc'ism, just use the size of a pointer.
1046+
*/
1047+
per_entry += sizeof(void *);
1048+
return ondisk_size + entries*per_entry;
1049+
}
1050+
9741051
/* remember to discard_cache() before reading a different cache! */
9751052
int read_index_from(struct index_state *istate, const char *path)
9761053
{
@@ -1021,7 +1098,7 @@ int read_index_from(struct index_state *istate, const char *path)
10211098
* has room for a few more flags, we can allocate using the same
10221099
* index size
10231100
*/
1024-
istate->alloc = xmalloc(mmap_size);
1101+
istate->alloc = xmalloc(estimate_cache_size(mmap_size, istate->cache_nr));
10251102

10261103
src_offset = sizeof(*hdr);
10271104
dst_offset = 0;
@@ -1032,7 +1109,7 @@ int read_index_from(struct index_state *istate, const char *path)
10321109
disk_ce = (struct ondisk_cache_entry *)((char *)mmap + src_offset);
10331110
ce = (struct cache_entry *)((char *)istate->alloc + dst_offset);
10341111
convert_from_disk(disk_ce, ce);
1035-
istate->cache[i] = ce;
1112+
set_index_entry(istate, i, ce);
10361113

10371114
src_offset += ondisk_ce_size(ce);
10381115
dst_offset += ce_size(ce);
@@ -1070,6 +1147,7 @@ int discard_index(struct index_state *istate)
10701147
istate->cache_nr = 0;
10711148
istate->cache_changed = 0;
10721149
istate->timestamp = 0;
1150+
free_hash(&istate->name_hash);
10731151
cache_tree_free(&(istate->cache_tree));
10741152
free(istate->alloc);
10751153
istate->alloc = NULL;

0 commit comments

Comments
 (0)