August 2017
Volume 32 Number 8
[DevOps]
Git Internals: Architecture and Index Files
By Jonathan Waldman | August 2017
In my last article (msdn.com/magazine/mt809117), I showed how Git uses a directed acyclic graph (DAG) to organize a repo’s commit objects. I also explored the blob, tree and tag objects to which commit objects can refer. I concluded the article with an introduction to branching, including the distinction between HEAD and head. That article is a prerequisite to this one, in which I’ll discuss the Git “three-tree” architecture and the importance of its index file. Understanding these additional Git internals will build on the foundational knowledge that will make you a more effective Git user and will provide new insights as you explore various Git operations fronted by the graphical Git tooling in the Visual Studio IDE.
Recall from the last article that Visual Studio communicates with Git using a Git API, and that the Visual Studio IDE Git tooling abstracts away the complexity and capabilities of the underlying Git engine. That’s a boon for developers who want to implement a version-control workflow without needing to rely on the Git command-line interface (CLI). Alas, the otherwise helpful Git abstractions of the IDE can sometimes lead to confusion. For example, ponder the basic workflow of adding a project to Git source control, modifying project files, staging them and then committing the staged files. To do that, you open the Team Explorer Changes pane to view the list of changed files and then you select the ones you want to stage. Consider the leftmost image in Figure 1, which shows that I changed two files in the working directory (Marker 1).
Figure 1 The Team Explorer Changes Pane Can Show the Same File in Its Changes and Staged Changes Sections
In the next image to the right, I staged one of those changed files: Program.cs (Marker 2). When I did that, Program.cs appears to have “moved” from the Changes list to the Staged Changes list. If I further modify and then save the working directory’s copy of Program.cs, it continues to appear in the Staged Changes section (Marker 3)—but it also appears in the Changes section (Marker 4)! Without understanding what Git is doing behind the scenes, you might be flummoxed until you figured out that two “copies” of Program.cs exist: one in the working folder and one in the Git internal database of objects. Even if you realize that, you might not have any insight as to what would happen when you unstage the staged copy, try to stage the second changed copy of Program.cs, undo changes to the working copy or switch branches.
To truly grasp what Git is doing as you stage, unstage, undo, commit and check out files, you first must understand how Git is architected.
The Git Three-Tree Architecture
Git implements a three-tree architecture (a “tree” in this context refers to a directory structure and files). Working from left to right in Figure 2 The Git Three-Tree Architecture Leverages the All-Important Index File for Its Smart and Efficient Performance2, the first tree is the collection of files and folders in the working directory—the OS directory that contains the hidden .git folder; the second tree is typically stored in a single binary file called index, located in the root of the .git folder; the third tree is composed of Git objects that represent the DAG (recall that SHA-1-named Git objects are located in two-hex-digit-named folders .git\objects and can also be stored in “pack” files located in .git\objects\pack and in file paths defined by the .git\objects\info\alternates file). Keep in mind that the Git re-po is defined by all files that sit in the .git folder. Often, people refer to the DAG as the Git repo, and that’s not quite accurate: The index and the DAG are both contained in the Git repo.
Figure 2 The Git Three-Tree Architecture Leverages the All-Important Index File for Its Smart and Efficient Performance
Notice that while each tree stores a directory structure and files, each leverages different data structures in order to re-tain tree-specific metadata and to optimize storage and retrieval. The first tree (the working directory tree, also called “the working tree”) is plainly the OS files and folders (no special data structures there, other than what’s at the OS level) and serves the needs of the software developer and Visual Studio; the second tree (the Git index) straddles the working di-rectory and the commit objects that form the DAG, thereby helping Git perform speedy working-directory file-content comparisons and quick commits; the third tree (the DAG) makes it possible for Git to track a history of commits, as discussed in the previous article. In its capacity as a robust version control system, Git adds helpful metadata to the items it stores in the index and in commit objects. For example, the metadata it stores in the index helps it detect changes to files in the working directory, while the metadata it stores in commit objects helps it track who issued the commit and for what reason.
To review the three trees in the three-tree architecture and to put some perspective around the remainder of this article’s focus: You already know how the working-directory tree functions, because it’s actually the OS file system you’re already well-versed in using. And if you read my earlier article, you should have good working knowledge of the DAG. Thus, at this point, the missing link is the index tree (hereafter, “the index”) that straddles the working directory and the DAG. In fact, the index plays such an important role that it’s the sole subject of the remainder of this article.
How the Index Works
You might have heard the friendly advice that the index is synonymous with the “staging area.” While that’s somewhat accurate, to speak of it that way belies its true role, which is not only to support a staging area, but also to facilitate the ability of Git to detect changes to files in your working directory; to me-diate the branch-merge process, so you can resolve conflicts on a file-by-file basis and safely abort the merge at any time; and to convert staged files and folders into tree objects whose references are written to the next commit object. Git also uses the index to retain information about files in the working tree and about objects retrieved from the DAG—and thus further leveraging the index as a type of cache. Let’s investigate the index more thoroughly.
The index implements its own self-contained file system, giving it the ability to store references to folders and files along with metadata about them. How and when Git updates this in-dex depends on the kind of Git command issued and the command options specified (if you’re so inclined, you can even use the Git update-index plumbing command to manage the index yourself), so exhaustive coverage here isn’t pos-sible. However, as you work with the Visual Studio Git tooling, it’s helpful to be aware of the primary ways in which Git updates the index and in which Git uses information stored in the index. Figure 3 shows that Git updates the index with working directory data when you stage a file, and it updates the index with DAG data when you initiate a merge (if there are merge conflicts), clone or pull, or switch branches. On the other hand, Git relies on information stored in the index when it updates the DAG after you issue a commit, and when it updates the working directory after you clone or pull, or after you switch branches. Once you realize that Git relies on the index and that the index straddles so many Git operations, you’ll begin to appreciate the ad-vanced Git commands that modify the index, effectively empowering you to finesse how Git operates.
Figure 3 Primary Git Actions That Update the Index (Green) and Git Actions That Rely on What the Index Contains (Red)
Let’s create a new file in the working directory to see what happens to it as it’s written to the index. As soon as you stage that file, Git creates a header using this string-concatenation formula:
blob{space}{file-length in bytes}{null-termination character}
Git then concatenates the header to the beginning of the file contents. Thus, for a text file containing the string “Hello,” the header + file contents would generate a string that looks like this (keep in mind there’s a null character before the let-ter “H”):
blob 5Hello
To see that more clearly, here's the hexadecimal version of that string:
62 6C 6F 62 20 35 00 48 65 6C 6C 6F
Git then computes an SHA-1 for the string:
5ab2f8a4323abafb10abb68657d9d39f1a775057
Git next inspects the existing index to determine if an entry for that folder\file name already exists with the same SHA-1. If so, it locates the blob object in the .git\objects folder and updates its date-modified time (Git will never overwrite ob-jects that already exist in the repo; it updates the last-modified date so as to delay this newly added object from being considered for garbage collection). Otherwise, it uses the first two characters of the SHA-1 string as the directory name in .git\objects and the remaining 38 characters to name the blob file before zlib-compressing it and writing its contents. In my example, Git would create a folder in .git\objects called 5a and then write the blob object into that folder as a file with the name b2f8a4323abafb10abb68657d9d39f1a775057.
When Git creates a blob object in this manner, you might be surprised that one expected file property is conspicuously missing from the blob object: the file name! That's by design, however. Recall that Git is a content addressable file system and, as such, it manages SHA-1-named blob objects—not files. Each blob object is normally referenced by at least one tree object, and tree objects in turn are normally referenced by commit objects. Ultimately, Git tree objects express the folder structure of the files you stage. But Git doesn't create those tree objects until you issue a commit. Therefore, you can conclude that if Git uses only the index to prepare a commit object, it also must capture the file-path references for each blob in the index—and that's exactly what it does. In fact, even if two blobs have the same SHA-1 value, as long as each maps to a different file name or different path/file value, each will appear as a separate entry in the index.
Git also saves file metadata with each blob object it writes to the index, such as the file's create and modified dates. Git leverages this information to efficiently detect changes to files in your working directory using file-date comparisons and heuristics rather than brute-force re-computing the SHA-1 values for each file in the working directory. Such a strate-gy speeds up the information you see in the Team Explorer Changes pane—or when you issue the porcelain Git status command.
Once armed with an index entry for a working-directory file along with its associated metadata, Git is said to "track" the file because it can readily compare its copy of the file with the copy that remains in the working directory. Technically, a tracked file is one that also exists in the working directory and is to be included in the next commit. This is in contrast to untracked files, of which there are two types: files that are in the working directory but not in the index, and files that are explicitly designated as not to be tracked (see the Index Extensions section). To summarize, the index gives Git the power to determine which files are tracked, which are not tracked, and which should not be tracked.
To better understand the specific contents of the index, let's use a concrete example by starting with a new Visual Studio project. The complexity of this project isn't so important -- you just need a couple of files to adequately illustrate what's going on. Create a new console application called MSDNConsoleApp and check the Create directory for solution and the Create new Git repository checkboxes. Click OK to create the solution.
I'll issue some Git commands in a moment, so if you want to run them on your system, open a command prompt win-dow in the working directory and keep that window within reach as you follow along. One way to quickly open a Git command window for a particular Git repo is to access the Visual Studio Team menu and select Manage Connections. You'll see a list of local Git repositories, along with the path to that repo's working directory. Right-click the repo name and select Open Command Prompt to launch a window into which you can enter Git CLI commands.
Once you create the solution, open the Team Explorer Branches pane (Figure 4, Marker 1) to see that Git created a default branch called master (Marker 2). Right-click the master branch (Marker 2) and select View History (Marker 3) to view the two commits Visual Studio created on your behalf (Marker 4). The first has the commit message "Add .gitignore and .gitattributes"; the second has the commit message "Add project files."
Figure 4 Viewing the History in Order to See What Visual Studio Does When You Create a New Project
Open the Team Explorer Changes pane. Visual Studio relies on the Git API to populate items in this window -- it's the Visual Studio version of the Git status command. Currently, this window indicates there are no unstaged changes in the working directory. The way Git makes this determination is to compare each index entry with each working directory file. With the index's file entries and associated file metadata, Git has all the information it needs to determine whether you've made any changes, additions, deletions, or if you renamed any files in the working directory (excluding any files mentioned in the .gitignore file).
So the index plays a key role in making Git smart about differences between your working directory tree and the commit object pointed to by HEAD. To learn a bit more about what kind of information the index provides to the Git engine, go to the command-line window you opened earlier and issue the fol-lowing plumbing command:
git ls-files --stage
You can issue this command at any time to generate a complete list of files currently in the index. On my system, this produces the following output:
100644 1ff0c423042b46cb1d617b81efb715defbe8054d 0 .gitattributes
100644 3c4efe206bd0e7230ad0ae8396a3c883c8207906 0 .gitignore
100644 f18cc2fac0bc0e4aa9c5e8655ed63fa33563ab1d 0 MSDNConsoleApp.sln
100644 88fa4027bda397de6bf19f0940e5dd6026c877f9 0 MSDNConsoleApp/App.config
100644 d837dc8996b727d6f6d2c4e788dc9857b840148a 0 MSDNConsoleApp/MSDNConsoleApp.csproj
100644 27e0d58c613432852eab6b9e693d67e5c6d7aba7 0 MSDNConsoleApp/Program.cs
100644 785cfad3244d5e16842f4cf8313c8a75e64adc38 0 MSDNConsoleApp/Properties/AssemblyInfo.cs
The first column of output is a Unix OS file mode, in octal. Git doesn't support the full range of file-mode values, however. You're likely to only ever see 100644 (for non-EXE files) and 100755 (for Unix-based EXE files -- Git for Win-dows also uses 100644 for executable file types). The second column is the SHA-1 value for the file. The third column represents the merge stage value for the file -- 0 for no conflict or 1, 2 or 3 when a merge conflict exists. Finally, notice that the path and file name for each of the seven blob objects are stored in the index. Git uses the path value when it builds tree objects ahead of the next commit (more on that in a moment).
Now, let's examine the index file itself. Because it's a binary file, I'm going to use HexEdit 4 (a freeware hex editor available at hexedit.com) to view its contents (Figure 5 shows an excerpt).
Figure 5 A Hex Dump of the Git Index File for the Project
Figure 6 The Git Index Header Data Format
Index File - Header Entry | ||
00 - 03 (4 bytes) |
DIRC | Fixed header for a directory cache entry. All index files begin with this entry. |
04 - 07 (4 bytes) |
Version | Index version number (Git for Windows currently uses version 2). |
08 - 11 (4 bytes) |
Number of entries | As a 4-byte value, the index supports up to 4,294,967,296 entries! |
The first 12 bytes of the index contain the header (see Figure 6). The first 4 bytes will always contain the characters DIRC (short for directory cache) -- this is one reason the Git index is often referred to as the cache. The next 4 bytes contain the index version number, which defaults to 2 unless you're using certain features of Git (such as sparse check-out), in which case it might be set to version 3 or 4. The final 4 bytes contain the number of file entries contained further down in the index.
Following the 12-byte header is a list of n index entries, where n matches the number of entries described by the index header. The format for each index entry is presented in Figure 7. Git sorts index entries in ascending order based on the path/file name field.
Figure 7 The Git Index File-Index Entry Data Format
Index File - Index Entry | ||
4 bytes | 32-bit created time in seconds | Number of seconds since Jan. 1, 1970, 00:00:00. |
4 bytes | 32-bit created time - nanosecond component | Nanosecond component of the created time in seconds value. |
4 bytes | 32-bit modified time in seconds | Number of seconds since Jan. 1, 1970, 00:00:00. |
4 bytes | 32-bit modified time - nanosecond component | Nanosecond component of the created time in seconds value. |
4 bytes | device | Metadata associated with the file -- these originate from file attributes used on the Unix OS. |
4 bytes | inode | |
4 bytes | mode | |
4 bytes | user id | |
4 bytes | group id | |
4 bytes | file content length | Number of bytes of content in the file. |
20 bytes | SHA-1 | Corresponding blob object's SHA-1 value. |
2 bytes | Flags | (High to low bits) 1 bit: assume-valid/assume-unchanged flag 1-bit: extended flag (must be 0 for versions less than 3; if 1 then an additional 2 bytes follow before the path\file name) 2-bit: merge stage 12-bit: path\file name length (if less than 0xFFF) |
2 bytes (version 3 or higher) |
Flags | (High to low bits) 1-bit: future use 1-bit: skip-worktree flag (sparse checkout) 1-bit: intent-to-add flag (git add -N) 13-bit: unused, must be zero |
Variable Length | Path/file name | NUL terminated |
The first 8 bytes represent the time the file was created as an offset from midnight of Jan. 1, 1970. The second 8 bytes represent the time the file was modified as an offset from midnight of Jan. 1, 1970. Next are five 4-byte values (device, inode, mode, user id and group id) of file-attribute metadata related to the host OS. The only value used under Windows is the mode, which most often will be the octal 100644 I mentioned earlier when showing output from the ls-files com-mand (this converts to the 4-byte 814AH value, which you can see at position 26H in Figure 5).
Following the metadata is the 4-byte length of the file contents. In Figure 5, this value starts at 030, which shows 00 00 0A 15 (2,581 decimal) -- the length of the .gitattributes file on my system:
05/08/2017 09:24 PM <DIR> .
05/08/2017 09:24 PM <DIR> ..
05/08/2017 09:24 PM 2,581 .gitattributes
05/08/2017 09:24 PM 4,565 .gitignore
05/08/2017 09:24 PM <DIR> MSDNConsoleApp
05/08/2017 09:24 PM 1,009 MSDNConsoleApp.sln
3 File(s) 8,155 bytes
3 Dir(s) 92,069,982,208 bytes free
At offset 034H is the 20-byte SHA-1 value for the blob object:
1ff0c423042b46cb1d617b81efb715defbe8054d.
Remember, this SHA-1 points to the blob object that contains the file contents for the file in question: .gitattributes.
At 048H is a 2-byte value containing two 1-bit flags, a 2-bit merge-stage value, and a 12-bit length of the path/file name for the current index entry. Of the two 1-bit flags, the high-order bit designates whether the index entry has its as-sume-unchanged flag set (typically done using the Git update-index plumbing command); the low-order bit indicates whether another two bytes of data precede the path\file name entry -- this bit can be 1 only for index versions 3 and high-er). The next 2 bits hold a merge-stage value from 0 to 3, as described earlier. The 12-bit value contains the length of the path\file name string.
If the extended flag was set, a 2-byte value holds the skip-worktree and intent-to-add bit flags, along with filler place-holders.
Finally, a variable length sequence of bytes contains the path\file name. This value is terminated with one or more NUL characters. Following that termination is the next blob object in the index or one or more index extension entries (as you'll see shortly).
Earlier, I mentioned that Git doesn't build tree objects until you commit what's been staged. What that means is the in-dex starts out with only path/file names and references to blob objects. As soon as you issue a commit, however, Git updates the index so it contains references to the tree objects it created during the last commit. If those directory references still exist in your working directory during the next commit, the cached tree object references can be used to reduce the work Git needs to do during the next commit. As you can see, the role of the index is multifaceted, and that's why it's described as an index, staging area and cache.
The index entry shown in Figure 7 supports only blob object references. To store tree objects, Git uses an extension.
Index Extensions
The index can include extension entries that store specialized data streams to provide additional information for the Git engine to consider as it monitors files in the working directory and when it prepares the next commit. To cache tree ob-jects created during the last commit, Git adds a tree extension object to the index for the working directory's root as well as for each sub-directory.
Figure 5, Marker 2, shows the final bytes of the index and captures the tree objects that are stored in the index. Figure 8 shows the format for the tree-extension data.
Figure 8 The Git Index File Tree-Extension Object Data Format
Index File - Cached Tree-Extension Header | ||
4 bytes | TREE | Fixed signature for a cached tree-extension entry. |
4 bytes | 32-bit number representing the length of TREE extension data |
Cached Tree-Extension Entry | ||
Variable | Path | NUL-terminated path string (null only for the root tree). |
ASCII number | Number of entries | ASCII number representing the number of entries in the index covered by this tree entry. |
1 byte | 20H (space character) | |
ASCII number | Number of subtrees | ASCII number representing the number of subtrees this tree has. |
1 byte | 0AH (linefeed character) | |
20 bytes | Tree object's SHA-1 | SHA-1 values of the tree object this entry produces. |
The tree-extension data header, which appears at offset 284H, is composed of the string "TREE" (marking the start of the cached tree extension data) followed by a 32-bit value that indicates the length of the extension data that follows. Next are entries for each tree entry: The first entry is a variable-length null-terminated string value for the tree path (or simply NUL for the root tree). The following value is an ASCII value, so it is to be read as the "7" you see in the hex edi-tor -- the number of blob entries covered by the current tree (because this is the root tree, it has the same number of en-tries you saw earlier when issuing the Git ls-files stage command). The next character is a space, followed by another ASCII number to represent the number of subtrees the current tree has.
The root tree for our project has only 1 subtree: MSDNConsoleApp. This value is followed by a linefeed character, then the SHA-1 for the tree. The SHA-1 starts at offset 291, beginning with 0d21e2.
Let's confirm that 0d21e2 is actually the root tree SHA-1. To do that, go to the command window and enter:
git log
This displays details of the recent commits:
commit 5192391e9f907eeb47aa38d1c6a3a4ea78e33564
Author: Jonathan Waldman <jonathan.waldman@live.com>
Date: Mon May 8 21:24:15 2017 -0500
Add project files.
commit dc0d3343fa24e912f08bc18aaa6f664a4a020079
Author: Jonathan Waldman <jonathan.waldman@live.com>
Date: Mon May 8 21:24:07 2017 -0500
Add .gitignore and .gitattributes.
The most recent commit is the one with the timestamp 21:24:15, so that's the one that last updated the index. I can use that commit's SHA-1 to find the root-tree SHA-1 value:
git cat-file -p 51923
This generates the following output:
tree 0d21e2f7f760f77ead2cb85cc128efb13f56401d
parent dc0d3343fa24e912f08bc18aaa6f664a4a020079
author Jonathan Waldman <jonathan.waldman@live.com> 1494296655 -0500
committer Jonathan Waldman <jonathan.waldman@live.com> 1494296655 -0500
The preceding tree entry is the root tree object. It confirms that the 0d21e2 value at offset 291H in the index dump is, in fact, the SHA-1 for the root tree object.
The other tree entries appear immediately after the SHA-1 value, starting at offset 2A5H. To confirm the SHA-1 values for cached tree objects under the root tree, run this command:
git ls-tree -r -d master
This displays only the tree objects, recursively on the current branch:
040000 tree c7c367f2d5688dddc25e59525cc6b8efd0df914d MSDNConsoleApp
040000 tree 2723ceb04eda3051abf913782fadeebc97e0123c MSDNConsoleApp/Properties
The mode value of 040000 in the first column indicates that this object is a directory rather than a file.
Finally, the last 20 bytes of the index contain an SHA-1 hash representing the index itself: As expected, Git uses this SHA-1 value to validate the data integrity of the index.
While I've covered all of the entries in this article's example index file, larger and more complex index files are the norm. The index file format supports additional extension data streams, such as:
- One that supports merging operations and merge-conflict resolution. It has the signature "REUC" (for resolve undo conflict).
- One for maintaining a cache of untracked files (these are files to be excluded from tracking, specified in the .gitignore and .git\info\exclude files and by the file pointed to by core.excludesfile). It has the signature "UNTR."
- One to support a split-index mode in order to speed index updates for very large index files. It has the signature "link."
The index's extension feature makes it possible to continue adding to its capabilities.
Wrapping Up
In this article, I reviewed the Git three-tree architecture and delved into details behind its index file. I showed you that Git updates the index in response to certain operations and that it also relies on information the index contains in order to carry out other operations.
It's possible to use Git without thinking much about the index. Yet having knowledge about the index provides invalua-ble insight into Git's core functionality while shedding light on how Git detects changes to files in the working directory, what the staging area is and why it's useful, how Git manages merges, and why Git performs some operations so quick-ly. It also makes it easy to understand command-line variants of the check out and rebase commands -- and the differ-ence between soft, mixed and hard resets. Such features let you specify whether the index, working directory, or both the index and working directories should be updated when issuing certain commands. You'll see such options when reading about Git workflows, strategies and advanced operations. The purpose of this article is to orient you to the im-portant role the index plays so you can better digest the ways in which it can be leveraged.
Jonathan Waldman is a Microsoft Certified Professional who has worked with Microsoft technologies since their inception and who specializes in software ergonomics. Waldman is a member of the Pluralsight technical team and he currently leads institutional and private-sector software-development projects. He can be reached at jonathan.waldman@live.com.
Thanks to the following Microsoft technical experts for reviewing this article: Kraig Brockschmidt, Saeed Noursalehi, Ralph Squillace and Edward Thomson