Skip to content

Commit a43a2e6

Browse files
derrickstoleegitster
authored andcommitted
chunk-format: add technical docs
The chunk-based file format is now an API in the code, but we should also take time to document it as a file format. Specifically, it matches the CHUNK LOOKUP sections of the commit-graph and multi-pack-index files, but there are some commonalities that should be grouped in this document. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 5387fef commit a43a2e6

File tree

3 files changed

+122
-0
lines changed

3 files changed

+122
-0
lines changed
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
Chunk-based file formats
2+
========================
3+
4+
Some file formats in Git use a common concept of "chunks" to describe
5+
sections of the file. This allows structured access to a large file by
6+
scanning a small "table of contents" for the remaining data. This common
7+
format is used by the `commit-graph` and `multi-pack-index` files. See
8+
link:technical/pack-format.html[the `multi-pack-index` format] and
9+
link:technical/commit-graph-format.html[the `commit-graph` format] for
10+
how they use the chunks to describe structured data.
11+
12+
A chunk-based file format begins with some header information custom to
13+
that format. That header should include enough information to identify
14+
the file type, format version, and number of chunks in the file. From this
15+
information, that file can determine the start of the chunk-based region.
16+
17+
The chunk-based region starts with a table of contents describing where
18+
each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
19+
where C is the number of chunks. Consider the following table:
20+
21+
| Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
22+
|--------------------|------------------------|
23+
| ID[0] | OFFSET[0] |
24+
| ... | ... |
25+
| ID[C] | OFFSET[C] |
26+
| 0x0000 | OFFSET[C+1] |
27+
28+
Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
29+
Each integer is stored in network-byte order.
30+
31+
The chunk identifier `ID[i]` is a label for the data stored within this
32+
fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
33+
size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
34+
and `OFFSET[i]`. This requires that the chunk data appears contiguously
35+
in the same order as the table of contents.
36+
37+
The final entry in the table of contents must be four zero bytes. This
38+
confirms that the table of contents is ending and provides the offset for
39+
the end of the chunk-based data.
40+
41+
Note: The chunk-based format expects that the file contains _at least_ a
42+
trailing hash after `OFFSET[C+1]`.
43+
44+
Functions for working with chunk-based file formats are declared in
45+
`chunk-format.h`. Using these methods provide extra checks that assist
46+
developers when creating new file formats.
47+
48+
Writing chunk-based file formats
49+
--------------------------------
50+
51+
To write a chunk-based file format, create a `struct chunkfile` by
52+
calling `init_chunkfile()` and pass a `struct hashfile` pointer. The
53+
caller is responsible for opening the `hashfile` and writing header
54+
information so the file format is identifiable before the chunk-based
55+
format begins.
56+
57+
Then, call `add_chunk()` for each chunk that is intended for write. This
58+
populates the `chunkfile` with information about the order and size of
59+
each chunk to write. Provide a `chunk_write_fn` function pointer to
60+
perform the write of the chunk data upon request.
61+
62+
Call `write_chunkfile()` to write the table of contents to the `hashfile`
63+
followed by each of the chunks. This will verify that each chunk wrote
64+
the expected amount of data so the table of contents is correct.
65+
66+
Finally, call `free_chunkfile()` to clear the `struct chunkfile` data. The
67+
caller is responsible for finalizing the `hashfile` by writing the trailing
68+
hash and closing the file.
69+
70+
Reading chunk-based file formats
71+
--------------------------------
72+
73+
To read a chunk-based file format, the file must be opened as a
74+
memory-mapped region. The chunk-format API expects that the entire file
75+
is mapped as a contiguous memory region.
76+
77+
Initialize a `struct chunkfile` pointer with `init_chunkfile(NULL)`.
78+
79+
After reading the header information from the beginning of the file,
80+
including the chunk count, call `read_table_of_contents()` to populate
81+
the `struct chunkfile` with the list of chunks, their offsets, and their
82+
sizes.
83+
84+
Extract the data information for each chunk using `pair_chunk()` or
85+
`read_chunk()`:
86+
87+
* `pair_chunk()` assigns a given pointer with the location inside the
88+
memory-mapped file corresponding to that chunk's offset. If the chunk
89+
does not exist, then the pointer is not modified.
90+
91+
* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
92+
with the appropriate initial pointer and size information. The function
93+
is not called if the chunk does not exist. Use this method to read chunks
94+
if you need to perform immediate parsing or if you need to execute logic
95+
based on the size of the chunk.
96+
97+
After calling these methods, call `free_chunkfile()` to clear the
98+
`struct chunkfile` data. This will not close the memory-mapped region.
99+
Callers are expected to own that data for the timeframe the pointers into
100+
the region are needed.
101+
102+
Examples
103+
--------
104+
105+
These file formats use the chunk-format API, and can be used as examples
106+
for future formats:
107+
108+
* *commit-graph:* see `write_commit_graph_file()` and `parse_commit_graph()`
109+
in `commit-graph.c` for how the chunk-format API is used to write and
110+
parse the commit-graph file format documented in
111+
link:technical/commit-graph-format.html[the commit-graph file format].
112+
113+
* *multi-pack-index:* see `write_midx_internal()` and `load_multi_pack_index()`
114+
in `midx.c` for how the chunk-format API is used to write and
115+
parse the multi-pack-index file format documented in
116+
link:technical/pack-format.html[the multi-pack-index file format].

Documentation/technical/commit-graph-format.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,9 @@ CHUNK LOOKUP:
6161
the length using the next chunk position if necessary.) Each chunk
6262
ID appears at most once.
6363

64+
The CHUNK LOOKUP matches the table of contents from
65+
link:technical/chunk-format.html[the chunk-based file format].
66+
6467
The remaining data in the body is described one chunk at a time, and
6568
these chunks may be given in any order. Chunks are required unless
6669
otherwise specified.

Documentation/technical/pack-format.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -301,6 +301,9 @@ CHUNK LOOKUP:
301301
(Chunks are provided in file-order, so you can infer the length
302302
using the next chunk position if necessary.)
303303

304+
The CHUNK LOOKUP matches the table of contents from
305+
link:technical/chunk-format.html[the chunk-based file format].
306+
304307
The remaining data in the body is described one chunk at a time, and
305308
these chunks may be given in any order. Chunks are required unless
306309
otherwise specified.

0 commit comments

Comments
 (0)