Skip to content

Commit 660dd97

Browse files
committed
Merge branch 'ds/chunked-file-api'
The common code to deal with "chunked file format" that is shared by the multi-pack-index and commit-graph files have been factored out, to help codepaths for both filetypes to become more robust. * ds/chunked-file-api: commit-graph.c: display correct number of chunks when writing chunk-format: add technical docs chunk-format: restore duplicate chunk checks midx: use 64-bit multiplication for chunk sizes midx: use chunk-format read API commit-graph: use chunk-format read API chunk-format: create read chunk API midx: use chunk-format API in write_midx_internal() midx: drop chunk progress during write midx: return success/failure in chunk write methods midx: add num_large_offsets to write_midx_context midx: add pack_perm to write_midx_context midx: add entries to write_midx_context midx: use context in write_midx_pack_names() midx: rename pack_info to write_midx_context commit-graph: use chunk-format write API chunk-format: create chunk format write API commit-graph: anonymize data in chunk_write_fn
2 parents 12bd175 + c4ff24b commit 660dd97

File tree

10 files changed

+655
-468
lines changed

10 files changed

+655
-468
lines changed
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
Chunk-based file formats
2+
========================
3+
4+
Some file formats in Git use a common concept of "chunks" to describe
5+
sections of the file. This allows structured access to a large file by
6+
scanning a small "table of contents" for the remaining data. This common
7+
format is used by the `commit-graph` and `multi-pack-index` files. See
8+
link:technical/pack-format.html[the `multi-pack-index` format] and
9+
link:technical/commit-graph-format.html[the `commit-graph` format] for
10+
how they use the chunks to describe structured data.
11+
12+
A chunk-based file format begins with some header information custom to
13+
that format. That header should include enough information to identify
14+
the file type, format version, and number of chunks in the file. From this
15+
information, that file can determine the start of the chunk-based region.
16+
17+
The chunk-based region starts with a table of contents describing where
18+
each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
19+
where C is the number of chunks. Consider the following table:
20+
21+
| Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
22+
|--------------------|------------------------|
23+
| ID[0] | OFFSET[0] |
24+
| ... | ... |
25+
| ID[C] | OFFSET[C] |
26+
| 0x0000 | OFFSET[C+1] |
27+
28+
Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
29+
Each integer is stored in network-byte order.
30+
31+
The chunk identifier `ID[i]` is a label for the data stored within this
32+
fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
33+
size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
34+
and `OFFSET[i]`. This requires that the chunk data appears contiguously
35+
in the same order as the table of contents.
36+
37+
The final entry in the table of contents must be four zero bytes. This
38+
confirms that the table of contents is ending and provides the offset for
39+
the end of the chunk-based data.
40+
41+
Note: The chunk-based format expects that the file contains _at least_ a
42+
trailing hash after `OFFSET[C+1]`.
43+
44+
Functions for working with chunk-based file formats are declared in
45+
`chunk-format.h`. Using these methods provide extra checks that assist
46+
developers when creating new file formats.
47+
48+
Writing chunk-based file formats
49+
--------------------------------
50+
51+
To write a chunk-based file format, create a `struct chunkfile` by
52+
calling `init_chunkfile()` and pass a `struct hashfile` pointer. The
53+
caller is responsible for opening the `hashfile` and writing header
54+
information so the file format is identifiable before the chunk-based
55+
format begins.
56+
57+
Then, call `add_chunk()` for each chunk that is intended for write. This
58+
populates the `chunkfile` with information about the order and size of
59+
each chunk to write. Provide a `chunk_write_fn` function pointer to
60+
perform the write of the chunk data upon request.
61+
62+
Call `write_chunkfile()` to write the table of contents to the `hashfile`
63+
followed by each of the chunks. This will verify that each chunk wrote
64+
the expected amount of data so the table of contents is correct.
65+
66+
Finally, call `free_chunkfile()` to clear the `struct chunkfile` data. The
67+
caller is responsible for finalizing the `hashfile` by writing the trailing
68+
hash and closing the file.
69+
70+
Reading chunk-based file formats
71+
--------------------------------
72+
73+
To read a chunk-based file format, the file must be opened as a
74+
memory-mapped region. The chunk-format API expects that the entire file
75+
is mapped as a contiguous memory region.
76+
77+
Initialize a `struct chunkfile` pointer with `init_chunkfile(NULL)`.
78+
79+
After reading the header information from the beginning of the file,
80+
including the chunk count, call `read_table_of_contents()` to populate
81+
the `struct chunkfile` with the list of chunks, their offsets, and their
82+
sizes.
83+
84+
Extract the data information for each chunk using `pair_chunk()` or
85+
`read_chunk()`:
86+
87+
* `pair_chunk()` assigns a given pointer with the location inside the
88+
memory-mapped file corresponding to that chunk's offset. If the chunk
89+
does not exist, then the pointer is not modified.
90+
91+
* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
92+
with the appropriate initial pointer and size information. The function
93+
is not called if the chunk does not exist. Use this method to read chunks
94+
if you need to perform immediate parsing or if you need to execute logic
95+
based on the size of the chunk.
96+
97+
After calling these methods, call `free_chunkfile()` to clear the
98+
`struct chunkfile` data. This will not close the memory-mapped region.
99+
Callers are expected to own that data for the timeframe the pointers into
100+
the region are needed.
101+
102+
Examples
103+
--------
104+
105+
These file formats use the chunk-format API, and can be used as examples
106+
for future formats:
107+
108+
* *commit-graph:* see `write_commit_graph_file()` and `parse_commit_graph()`
109+
in `commit-graph.c` for how the chunk-format API is used to write and
110+
parse the commit-graph file format documented in
111+
link:technical/commit-graph-format.html[the commit-graph file format].
112+
113+
* *multi-pack-index:* see `write_midx_internal()` and `load_multi_pack_index()`
114+
in `midx.c` for how the chunk-format API is used to write and
115+
parse the multi-pack-index file format documented in
116+
link:technical/pack-format.html[the multi-pack-index file format].

Documentation/technical/commit-graph-format.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,9 @@ CHUNK LOOKUP:
6161
the length using the next chunk position if necessary.) Each chunk
6262
ID appears at most once.
6363

64+
The CHUNK LOOKUP matches the table of contents from
65+
link:technical/chunk-format.html[the chunk-based file format].
66+
6467
The remaining data in the body is described one chunk at a time, and
6568
these chunks may be given in any order. Chunks are required unless
6669
otherwise specified.

Documentation/technical/pack-format.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -336,6 +336,9 @@ CHUNK LOOKUP:
336336
(Chunks are provided in file-order, so you can infer the length
337337
using the next chunk position if necessary.)
338338

339+
The CHUNK LOOKUP matches the table of contents from
340+
link:technical/chunk-format.html[the chunk-based file format].
341+
339342
The remaining data in the body is described one chunk at a time, and
340343
these chunks may be given in any order. Chunks are required unless
341344
otherwise specified.

Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -834,6 +834,7 @@ LIB_OBJS += bundle.o
834834
LIB_OBJS += cache-tree.o
835835
LIB_OBJS += chdir-notify.o
836836
LIB_OBJS += checkout.o
837+
LIB_OBJS += chunk-format.o
837838
LIB_OBJS += color.o
838839
LIB_OBJS += column.o
839840
LIB_OBJS += combine-diff.o

chunk-format.c

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
#include "cache.h"
2+
#include "chunk-format.h"
3+
#include "csum-file.h"
4+
5+
/*
6+
* When writing a chunk-based file format, collect the chunks in
7+
* an array of chunk_info structs. The size stores the _expected_
8+
* amount of data that will be written by write_fn.
9+
*/
10+
struct chunk_info {
11+
uint32_t id;
12+
uint64_t size;
13+
chunk_write_fn write_fn;
14+
15+
const void *start;
16+
};
17+
18+
struct chunkfile {
19+
struct hashfile *f;
20+
21+
struct chunk_info *chunks;
22+
size_t chunks_nr;
23+
size_t chunks_alloc;
24+
};
25+
26+
struct chunkfile *init_chunkfile(struct hashfile *f)
27+
{
28+
struct chunkfile *cf = xcalloc(1, sizeof(*cf));
29+
cf->f = f;
30+
return cf;
31+
}
32+
33+
void free_chunkfile(struct chunkfile *cf)
34+
{
35+
if (!cf)
36+
return;
37+
free(cf->chunks);
38+
free(cf);
39+
}
40+
41+
int get_num_chunks(struct chunkfile *cf)
42+
{
43+
return cf->chunks_nr;
44+
}
45+
46+
void add_chunk(struct chunkfile *cf,
47+
uint32_t id,
48+
size_t size,
49+
chunk_write_fn fn)
50+
{
51+
ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc);
52+
53+
cf->chunks[cf->chunks_nr].id = id;
54+
cf->chunks[cf->chunks_nr].write_fn = fn;
55+
cf->chunks[cf->chunks_nr].size = size;
56+
cf->chunks_nr++;
57+
}
58+
59+
int write_chunkfile(struct chunkfile *cf, void *data)
60+
{
61+
int i;
62+
uint64_t cur_offset = hashfile_total(cf->f);
63+
64+
/* Add the table of contents to the current offset */
65+
cur_offset += (cf->chunks_nr + 1) * CHUNK_TOC_ENTRY_SIZE;
66+
67+
for (i = 0; i < cf->chunks_nr; i++) {
68+
hashwrite_be32(cf->f, cf->chunks[i].id);
69+
hashwrite_be64(cf->f, cur_offset);
70+
71+
cur_offset += cf->chunks[i].size;
72+
}
73+
74+
/* Trailing entry marks the end of the chunks */
75+
hashwrite_be32(cf->f, 0);
76+
hashwrite_be64(cf->f, cur_offset);
77+
78+
for (i = 0; i < cf->chunks_nr; i++) {
79+
off_t start_offset = hashfile_total(cf->f);
80+
int result = cf->chunks[i].write_fn(cf->f, data);
81+
82+
if (result)
83+
return result;
84+
85+
if (hashfile_total(cf->f) - start_offset != cf->chunks[i].size)
86+
BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
87+
cf->chunks[i].size, cf->chunks[i].id,
88+
hashfile_total(cf->f) - start_offset);
89+
}
90+
91+
return 0;
92+
}
93+
94+
int read_table_of_contents(struct chunkfile *cf,
95+
const unsigned char *mfile,
96+
size_t mfile_size,
97+
uint64_t toc_offset,
98+
int toc_length)
99+
{
100+
int i;
101+
uint32_t chunk_id;
102+
const unsigned char *table_of_contents = mfile + toc_offset;
103+
104+
ALLOC_GROW(cf->chunks, toc_length, cf->chunks_alloc);
105+
106+
while (toc_length--) {
107+
uint64_t chunk_offset, next_chunk_offset;
108+
109+
chunk_id = get_be32(table_of_contents);
110+
chunk_offset = get_be64(table_of_contents + 4);
111+
112+
if (!chunk_id) {
113+
error(_("terminating chunk id appears earlier than expected"));
114+
return 1;
115+
}
116+
117+
table_of_contents += CHUNK_TOC_ENTRY_SIZE;
118+
next_chunk_offset = get_be64(table_of_contents + 4);
119+
120+
if (next_chunk_offset < chunk_offset ||
121+
next_chunk_offset > mfile_size - the_hash_algo->rawsz) {
122+
error(_("improper chunk offset(s) %"PRIx64" and %"PRIx64""),
123+
chunk_offset, next_chunk_offset);
124+
return -1;
125+
}
126+
127+
for (i = 0; i < cf->chunks_nr; i++) {
128+
if (cf->chunks[i].id == chunk_id) {
129+
error(_("duplicate chunk ID %"PRIx32" found"),
130+
chunk_id);
131+
return -1;
132+
}
133+
}
134+
135+
cf->chunks[cf->chunks_nr].id = chunk_id;
136+
cf->chunks[cf->chunks_nr].start = mfile + chunk_offset;
137+
cf->chunks[cf->chunks_nr].size = next_chunk_offset - chunk_offset;
138+
cf->chunks_nr++;
139+
}
140+
141+
chunk_id = get_be32(table_of_contents);
142+
if (chunk_id) {
143+
error(_("final chunk has non-zero id %"PRIx32""), chunk_id);
144+
return -1;
145+
}
146+
147+
return 0;
148+
}
149+
150+
static int pair_chunk_fn(const unsigned char *chunk_start,
151+
size_t chunk_size,
152+
void *data)
153+
{
154+
const unsigned char **p = data;
155+
*p = chunk_start;
156+
return 0;
157+
}
158+
159+
int pair_chunk(struct chunkfile *cf,
160+
uint32_t chunk_id,
161+
const unsigned char **p)
162+
{
163+
return read_chunk(cf, chunk_id, pair_chunk_fn, p);
164+
}
165+
166+
int read_chunk(struct chunkfile *cf,
167+
uint32_t chunk_id,
168+
chunk_read_fn fn,
169+
void *data)
170+
{
171+
int i;
172+
173+
for (i = 0; i < cf->chunks_nr; i++) {
174+
if (cf->chunks[i].id == chunk_id)
175+
return fn(cf->chunks[i].start, cf->chunks[i].size, data);
176+
}
177+
178+
return CHUNK_NOT_FOUND;
179+
}

chunk-format.h

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
#ifndef CHUNK_FORMAT_H
2+
#define CHUNK_FORMAT_H
3+
4+
#include "git-compat-util.h"
5+
6+
struct hashfile;
7+
struct chunkfile;
8+
9+
#define CHUNK_TOC_ENTRY_SIZE (sizeof(uint32_t) + sizeof(uint64_t))
10+
11+
/*
12+
* Initialize a 'struct chunkfile' for writing _or_ reading a file
13+
* with the chunk format.
14+
*
15+
* If writing a file, supply a non-NULL 'struct hashfile *' that will
16+
* be used to write.
17+
*
18+
* If reading a file, use a NULL 'struct hashfile *' and then call
19+
* read_table_of_contents(). Supply the memory-mapped data to the
20+
* pair_chunk() or read_chunk() methods, as appropriate.
21+
*
22+
* DO NOT MIX THESE MODES. Use different 'struct chunkfile' instances
23+
* for reading and writing.
24+
*/
25+
struct chunkfile *init_chunkfile(struct hashfile *f);
26+
void free_chunkfile(struct chunkfile *cf);
27+
int get_num_chunks(struct chunkfile *cf);
28+
typedef int (*chunk_write_fn)(struct hashfile *f, void *data);
29+
void add_chunk(struct chunkfile *cf,
30+
uint32_t id,
31+
size_t size,
32+
chunk_write_fn fn);
33+
int write_chunkfile(struct chunkfile *cf, void *data);
34+
35+
int read_table_of_contents(struct chunkfile *cf,
36+
const unsigned char *mfile,
37+
size_t mfile_size,
38+
uint64_t toc_offset,
39+
int toc_length);
40+
41+
#define CHUNK_NOT_FOUND (-2)
42+
43+
/*
44+
* Find 'chunk_id' in the given chunkfile and assign the
45+
* given pointer to the position in the mmap'd file where
46+
* that chunk begins.
47+
*
48+
* Returns CHUNK_NOT_FOUND if the chunk does not exist.
49+
*/
50+
int pair_chunk(struct chunkfile *cf,
51+
uint32_t chunk_id,
52+
const unsigned char **p);
53+
54+
typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
55+
size_t chunk_size, void *data);
56+
/*
57+
* Find 'chunk_id' in the given chunkfile and call the
58+
* given chunk_read_fn method with the information for
59+
* that chunk.
60+
*
61+
* Returns CHUNK_NOT_FOUND if the chunk does not exist.
62+
*/
63+
int read_chunk(struct chunkfile *cf,
64+
uint32_t chunk_id,
65+
chunk_read_fn fn,
66+
void *data);
67+
68+
#endif

0 commit comments

Comments
 (0)