Skip to content

Commit 31993be

Browse files
committed
doc: Add a explanation of Git's data model
Git very often uses the terms "object", "reference", or "index" in its documentation. However, it's hard to find a clear explanation of these terms and how they relate to each other in the documentation. The closest candidates currently are: 1. `gitglossary`. This makes a good effort, but it's an alphabetically ordered dictionary and a dictionary is not a good way to learn concepts. You have to jump around too much and it's not possible to present the concepts in the order that they should be explained. 2. `gitcore-tutorial`. This explains how to use the "core" Git commands. This is a nice document to have, but it's not necessary to learn how `update-index` works to understand Git's data model, and we should not be requiring users to learn how to use the "plumbing" commands if they want to learn what the term "index" or "object" means. 3. `gitrepository-layout`. This is a great resource, but it includes a lot of information about configuration and internal implementation details which are not related to the data model. It also does not explain how commits work. The result of this is that Git users (even users who have been using Git for 15+ years) struggle to read the documentation because they don't know what the core terms mean, and it's not possible to add citations to help them learn more. Add an explanation of Git's data model. Some choices I've made in deciding what "core data model" means: 1. Omit pseudorefs like `FETCH_HEAD`, because it's not clear to me if those are intended to be user facing. 2. Don't talk about submodules other than by mentioning how they relate to trees. This is because Git has a lot of special features, and explaining how they all work exhaustively could quickly go down a rabbit hole which would make this document less useful for understanding the core behaviour. Some other choices I've made: 1. Mention packed refs only in a note. 2. Don't mention that the full name of the branch `main` is technically `refs/heads/main`. This should likely change but I haven't worked out how to do it in a clear way yet. 3. Mostly avoid referring to the `.git` directory, because the exact details of how things are stored change over time. This should perhaps change from "mostly" to "entirely" but I haven't worked out how to do it in a clear way yet.
1 parent bb69721 commit 31993be

File tree

2 files changed

+223
-0
lines changed

2 files changed

+223
-0
lines changed

Documentation/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ MAN7_TXT += gitcli.adoc
5252
MAN7_TXT += gitcore-tutorial.adoc
5353
MAN7_TXT += gitcredentials.adoc
5454
MAN7_TXT += gitcvs-migration.adoc
55+
MAN7_TXT += gitdatamodel.adoc
5556
MAN7_TXT += gitdiffcore.adoc
5657
MAN7_TXT += giteveryday.adoc
5758
MAN7_TXT += gitfaq.adoc

Documentation/gitdatamodel.adoc

Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
gitdatamodel(7)
2+
===============
3+
4+
NAME
5+
----
6+
gitdatamodel - Git's core data model
7+
8+
DESCRIPTION
9+
-----------
10+
11+
It's not necessary to understand Git's data model to use Git, but it's
12+
very helpful when reading Git's documentation so that you know what it
13+
means when the documentation says "object" "reference" or "index".
14+
15+
Git's core operations use 4 kinds of data:
16+
17+
1. <<objects,Objects>>: commits, trees, blobs, and tag objects
18+
2. <<references,References>>: branches, tags,
19+
remote-tracking branches, etc
20+
3. <<index,The index>>, also known as the staging area
21+
4. <<reflogs,Reflogs>>
22+
23+
[[objects]]
24+
OBJECTS
25+
-------
26+
27+
Commits, trees, blobs, and tag objects are all stored in Git's object database.
28+
Every object has:
29+
30+
1. an *ID*, which is the SHA-1 hash of its contents.
31+
It's fast to look up a Git object using its ID.
32+
The ID is usually represented in hexadecimal, like
33+
`1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a`.
34+
2. a *type*. There are 4 types of objects:
35+
<<commit,commits>>, <<tree,trees>>, <<blob,blobs>>,
36+
and <<tag-object,tag objects>>.
37+
3. *contents*. The structure of the contents depends on the type.
38+
39+
Once an object is created, it can never be changed.
40+
Here are the 4 types of objects:
41+
42+
[[commit]]
43+
commits::
44+
A commit contains:
45+
+
46+
1. Its *parent commit ID(s)*. The first commit in a repository has 0 parents,
47+
regular commits have 1 parent, merge commits have 2+ parents
48+
2. A *commit message*
49+
3. All the *files* in the commit, stored as a *<<tree,tree>>*
50+
4. An *author* and the time the commit was authored
51+
5. A *committer* and the time the commit was committed
52+
+
53+
Here's how an example commit is stored:
54+
+
55+
----
56+
tree 1b61de420a21a2f1aaef93e38ecd0e45e8bc9f0a
57+
parent 4ccb6d7b8869a86aae2e84c56523f8705b50c647
58+
author Maya <maya@example.com> 1759173425 -0400
59+
committer Maya <maya@example.com> 1759173425 -0400
60+
61+
Add README
62+
----
63+
64+
[[tree]]
65+
trees::
66+
A tree is how Git represents a directory. It lists, for each item in
67+
the tree:
68+
+
69+
1. The *permissions*, for example `100644`
70+
2. The *type*: either <<blob,`blob`>> (a file), `tree` (a directory),
71+
or <<commit,`commit`>> (a Git submodule)
72+
3. The *object ID*
73+
4. The *filename*
74+
+
75+
For example, this is how a tree containing one directory (`src`) and one file
76+
(`README.md`) is stored:
77+
+
78+
----
79+
100644 blob 8728a858d9d21a8c78488c8b4e70e531b659141f README.md
80+
040000 tree 89b1d2e0495f66d6929f4ff76ff1bb07fc41947d src
81+
----
82+
+
83+
*NOTE:* The permissions are in the same format as UNIX permissions, but
84+
the only allowed permissions for files (blobs) are 644 and 755.
85+
86+
[[blob]]
87+
blobs::
88+
A blob is how Git represents a file. A blob object contains the
89+
file's contents.
90+
+
91+
Storing a new blob for every new version of a file can get big, so
92+
`git gc` periodically compresses objects for efficiency in `.git/objects/pack`.
93+
94+
[[tag-object]]
95+
tag objects::
96+
Tag objects (also known as "annotated tags") contain:
97+
+
98+
1. The *tagger* and tag date
99+
2. A *tag message*, similar to a commit message
100+
3. The *ID* of the object (often a commit) that they reference
101+
102+
[[references]]
103+
REFERENCES
104+
----------
105+
106+
References are a way to give a name to a commit.
107+
It's easier to remember "the changes I'm working on are on the `turtle`
108+
branch" than "the changes are in commit bb69721404348e".
109+
Git often uses "ref" as shorthand for "reference".
110+
111+
References that you create are stored in the `.git/refs` directory,
112+
and Git has a few special internal references like `HEAD` that are stored
113+
in the base `.git` directory.
114+
115+
References can either be:
116+
117+
1. References to an object ID, usually a <<commit,commit>> ID
118+
2. References to another reference. This is called a "symbolic reference".
119+
120+
Git handles references differently based on which subdirectory of
121+
`.git/refs` they're stored in.
122+
Here are the main types:
123+
124+
[[branch]]
125+
branches: `.git/refs/heads/<name>`::
126+
A branch is a name for a commit ID.
127+
That commit is the latest commit on the branch.
128+
Branches are stored in the `.git/refs/heads/` directory.
129+
+
130+
To get the history of commits on a branch, Git will start at the commit
131+
ID the branch references, and then look at the commit's parent(s),
132+
the parent's parent, etc.
133+
134+
[[tag]]
135+
tags: `.git/refs/tags/<name>`::
136+
A tag is a name for a commit ID, tag object ID, or other object ID.
137+
Tags are stored in the `refs/tags/` directory.
138+
+
139+
Even though branches and commits are both "a name for a commit ID", Git
140+
treats them very differently.
141+
Branches are expected to be regularly updated as you work on the branch,
142+
but it's expected that a tag will never change after you create it.
143+
144+
[[HEAD]]
145+
HEAD: `.git/HEAD`::
146+
`HEAD` is where Git stores your current <<branch,branch>>.
147+
`HEAD` is normally a symbolic reference to your current branch, for
148+
example `ref: refs/heads/main` if your current branch is `main`.
149+
`HEAD` can also be a direct reference to a commit ID,
150+
that's called "detached HEAD state".
151+
152+
[[remote-tracking-branch]]
153+
remote tracking branches: `.git/refs/remotes/<remote>/<branch>`::
154+
A remote-tracking branch is a name for a commit ID.
155+
It's how Git stores the last-known state of a branch in a remote
156+
repository. `git fetch` updates remote-tracking branches. When
157+
`git status` says "you're up to date with origin/main", it's looking at
158+
this.
159+
160+
[[other-refs]]
161+
Other references::
162+
Git tools may create references in any subdirectory of `.git/refs`.
163+
For example, linkgit:git-stash[1], linkgit:git-bisect[1],
164+
and linkgit:git-notes[1] all create their own references
165+
in `.git/refs/stash`, `.git/refs/bisect`, etc.
166+
Third-party Git tools may also create their own references.
167+
+
168+
Git may also create references in the base `.git` directory
169+
other than `HEAD`, like `ORIG_HEAD`.
170+
171+
*NOTE:* As an optimization, references may be stored as packed
172+
refs instead of in `.git/refs`. See linkgit:git-pack-refs[1].
173+
174+
[[index]]
175+
THE INDEX
176+
---------
177+
178+
The index, also known as the "staging area", contains the current staged
179+
version of every file in your Git repository. When you commit, the files
180+
in the index are used as the files in the next commit.
181+
182+
Unlike a tree, the index is a flat list of files.
183+
Each index entry has 4 fields:
184+
185+
1. The *permissions*
186+
2. The *<<blob,blob>> ID* of the file
187+
3. The *filename*
188+
4. The *number*. This is normally 0, but if there's a merge conflict
189+
there can be multiple versions (with numbers 0, 1, 2, ..)
190+
of the same filename in the index.
191+
192+
It's extremely uncommon to look at the index directly: normally you'd
193+
run `git status` to see a list of changes between the index and <<HEAD,HEAD>>.
194+
But you can use `git ls-files --stage` to see the index.
195+
Here's the output of `git ls-files --stage` in a repository with 2 files:
196+
197+
----
198+
100644 8728a858d9d21a8c78488c8b4e70e531b659141f 0 README.md
199+
100644 665c637a360874ce43bf74018768a96d2d4d219a 0 src/hello.py
200+
----
201+
202+
[[reflogs]]
203+
REFLOGS
204+
-------
205+
206+
Git stores the history of branch, tag, and HEAD refs in a reflog
207+
(you should read "reflog" as "ref log"). Not every ref is logged by
208+
default, but any ref can be logged.
209+
210+
Each reflog entry has:
211+
212+
1. *Before/after *commit IDs*
213+
2. *User* who made the change, for example `Maya <[email protected]>`
214+
3. *Timestamp*
215+
4. *Log message*, for example `pull: Fast-forward`
216+
217+
Reflogs only log changes made in your local repository.
218+
They are not shared with remotes.
219+
220+
GIT
221+
---
222+
Part of the linkgit:git[1] suite

0 commit comments

Comments
 (0)