2015-07-01

September 2014

Volume 29 Number 9

Data Points : Git: It’s Just Data!

Julie Lerman Source control in a data column? Ahh, but when that source control is just one big database, it’s an invitation to some data-geeky fun. To be clear, you should know right up front I’m not writing about Git this month because I’m an expert. In fact, I’m writing about Git because I have struggled with it. My somewhat anemic GitHub profile is testament to that fact. When the subject of Git comes up, I tend to change the subject for fear of my Gitlessness being discovered—as most of my developer friends interact with it without batting an eyelash.

While talking about my Gitless life recently with friend and local Ruby developer, Alan Peabody (github.com/alanpeabody), at a local hacker happy hour, he said to me, “But Julie, it’s just data.” Data? “Oooh, I love data! Tell me more!” Peabody described how Git relies on a database filled with key/value pairs, and then suggested that, rather than trying to use the available UI tools or what’s known as the “porcelain” commands for working with Git, I explore and play with the lower-level “plumbing” commands and APIs.

So Git has a database and accessible plumbing! Peabody’s advice absolutely inspired me. I’ve now spent some time exploring how Git works at a low level, how to read and write data representing repository activity, and how the different object types in Git relate to one another. I’m still far from expert but feel much more in control of my own activity on Git. More important, I’m having fun playing with Git and no longer fear it as a source of black magic.

If this path appeals to you, I recommend not missing the “Git Basics” and “Git Internals” chapters of the online “Pro Git” book by Scott Chacon (Apress, 2009) and available at git-scm.com/book.

I’m sure the Git database was pointed out to me before, but only in passing and while I was already overwhelmed by the learning curve. Now I plan to have a little fun with it. For the rest of this article, I’ll simply interact with the database and some code that it’s tracking and see how it affects the database that represents the repository.

Rather than starting with a new, empty repository, I’ll use an existing repository I’ve already been playing with: the very early work on Entity Framework 7 hosted at GitHub.com/aspnet/EntityFramework. Keep in mind this is a rapidly changing project at the time I’m writing this column, so the files I work with here might change.

I’ll execute the commands from Windows PowerShell, leveraging posh-git, which enhances the command-line experience with additional status information, color-coding and tab completion. You can find instructions for setting this up in the download that accompanies this article.

Getting a Repository and Examining Git Assets

The first step is to clone the existing repository, starting in my existing github folder and using the git clone command:

D:\User Documents\github> git clone git://github.com/aspnet/EntityFramework

By itself, this first step elevated my Git skills! Git creates a new folder in my starting directory using the name of the repository, so I get D:\User Documents\github\EntityFramework. During the cloning process, if you open the EntityFramework folder in File Explorer as soon as it’s created, you’ll see that a .git subfolder is created first. This is the repository and is what we call the git database. And Git is using this data to construct the source in what’s referred to as the working directory, that is, the rest of the EntityFramework folder.

Once the operation is complete, the new folder, EntityFramework, looks much like any other Visual Studio solution folder except for one thing: the .git folder that is, itself, a complete copy of the cloned repository, including all branches. Notice its file structure in Figure 1. In Git, a branch is just a pointer to a commit and a commit is a pointer to a snapshot of your working directory. Although a Git repository’s “main” branch is called master by default, that’s not a requirement. The EF team set their default branch to point to a branch they named dev. Using branches, you can modify and test code safely, before merging changes into other branches.

Figure 1 The Solution Folder Created by Cloning the Repository, Including the Repository Itself in the .git Folder

The rest of the EntityFramework folder contents represent the working directory. At some point, you tell Git to keep track of files you’ve added, changed or deleted in your working directory. This is referred to as “staging”—that is, the changes to these files are staged and ready to be committed. Staged changes are stored in the database and will remain there after you’ve committed them. Eventually, you push those changes up to the server. I won’t be dealing with that step in this article because my focus is on exploring the database activity. There are many other concepts, such as trees and forking, references and headers that I’ll also ignore for this journey.

So what and where is this database? Is it a relational database like SQL Server or SQL CE? Nope. It’s a collection of files that represent Git objects. Each file is named with a hash and contains hashed contents. Remember that the database is a set of key/value pairs. The filename is the key that represents that object in the database, and its contents are the value. The collection of these objects comprises the database that represents the repository. Every key/value pair is stored in the .git/objects folder, and a master list of them is stored in a file called index. There are other objects that keep track of the different branches. It’s possible that many files are just duplicated and never edited. Git has a way to point to those files rather than maintain separate copies, which keeps the Git database from bloating.

But there are no object files to be seen yet, because, initially, everything is compressed into a pack file along with its IDX counterpart. These are in the PACK subfolder as shown in Figure 2.

Figure 2 You Won’t See Objects at First—They’re All Compressed into a PACK File

What Does Git Know About My Code?

Before I start editing, I want to take a look at what Git knows about my working directory and the repository. This means going back to Windows PowerShell and enabling posh-git to enhance my Git command-line experience. I start by changing directory to the EntityFramework folder (cd EntityFramework, just like the good old DOS days). Posh-git will see the .git subfolder and engage in the Windows PowerShell environment. Figure 3 shows my Windows PowerShell window now has the title posh~git and the prompt displays a status in yellow brackets—those are posh-git features. Currently, it indicates that my working directory is using the branch called dev, the main branch of the EF7 repository. When I cloned the repository, by default Git “checked out” this branch and that’s the version it put in my working directory. That’s really all that checking out means to Git. It builds the working directory from the designated branch. A fun exercise is to delete everything but the .git folder in File Explorer, then type “git checkout dev” in Windows PowerShell to see the directory contents get completely recreated. Unplug from the Internet if you want proof that it’s not coming from the server.

Figure 3 Windows PowerShell with posh-git Activated

I can ask Git what it knows about my working directory with the git “status” command. All git commands start by addressing git, so type:

git status

It responds with the message:

On branch dev
nothing to commit, working directory clean

That makes sense. I’m starting with a clean slate.

See How Git Responds to Editing Your Working Directory Files

Now it’s time to start banging on things to see this status change.

I’ll edit one of my favorite EF classes: DbContext.cs (which is in src\EntityFramework). I won’t even bother with Visual Studio. You can just use NotePad++ or your favorite text editor.

I added a comment at the top:

// Julie was here

and saved.

When I run git status again, the response is now more interesting, as Figure 4 shows.

Figure 4 Status After Modifying a File in the Working Directory (Red Font Indicates Working Directory Status)

Git sees that the file (noted in an unfortunately hard-to-read red font) has changed, but it’s not staged. In other words, Git isn’t tracking it yet, but has compared the working directory files to its database to make that determination. Notice also the new prompt status for dev is +0 ~1 -0. Status displayed in red reflects the working directory, here indicating 0 new files, 1 modified file and no deleted files.

What about the database? If you look in File Explorer, you’ll see by its timestamp and size that the index file hasn’t changed. Nothing has changed in the objects folder, either. The Git database is unaware of the change in my working directory.

Dear Git, Please Track My File Now

Git won’t track files until you tell it to. With typical porcelain commands, you just tell Git to add and it figures out which files to pull in. However, I’m going to use a plumbing command instead of the more common commands devs use with Git so I can be very explicit about each step I want to take. Note that the file path I type is case-sensitive:

git update-index --add src\EntityFramework\DbContext.cs

In response, the prompt’s status changes. It still says +0 ~1 -0, but the font is green now, not red, indicating this is the status of the index. Git is now tracking 0 new objects, 1 modified object and 0 deleted objects, which are ready to be committed.

What about the Git index file and the objects folder?

The timestamp of the index file in the .git directory has changed. The file now has a notation that the object file that represents the DbContext.cs class has changed. This is how the status knows one modified file is being tracked. For you Entity Framework coders, does this sound familiar? The index is akin to the EF DbContext, which tracks changes to your entities! DbContext knows when an object instance has been added or modified or deleted. The index file is similar in this way.

But now you can see the object, too. Browse to .git/objects and you’ll see a new folder. If you edited in Visual Studio, you might see more, for example, if the project file changed. But I edited in Notepad++ and there’s just one new folder, named ae. In the folder is a file with a name that’s a hash, as shown in Figure 5.

Figure 5 An Object Being Tracked by Git in Its Objects Folder

This object is part of my database and overrides the object representing the DbContext.cs file that’s cached in the PACK file. That new object file contains a hash of the contents of DbContext.cs, including my change.

I can’t read it myself, but Git can. I’ll ask Git to display its entire contents with the cat-file command:

git cat-file -p aeb6db24b9de85b7b7cb833379387f1754caa146

The -p parameter requests a prettified listing of the text. It’s followed by the name of the object to list, which is a combination of the folder name (ae) and the file name. Git has shortcuts for expressing this object name.

A more interesting view is one that highlights the change. I can ask Git to show me what has changed in this branch (remember its name is “dev”) with:

git diff dev

The response, to which I’ve added line numbers for clarity, is shown in Figure 6.

Figure 6 Response to git diff Command

This text is formatted using colored fonts that make it easy to distinguish the different information. Notice that on line 8, there’s a dash (which is red), indicating I deleted a [blank] line. Line 9 begins with a plus sign, indicating a new line, and it’s displayed in a green font. If I added more objects to the index—for modified or new files—they’d be listed here, as well.

Even though the index file is binary, I can also explore it. The git ls-files command is handy for listing all of the files represented by repository objects in the index file. Unfortunately, ls-files lists a combination of the cached index and the working directory, showing every file in my solution, so I have to dig through it to find the files in which I’m interested. There’s another git command, grep, that’s used for filtering. You can combine that with ls-files to filter the files. I’ll add grep (which is case-sensitive) and also ask ls-files to display the object name by using its -s (stage) parameter:

git ls-files -s |grep DbContext.cs

Including the stage parameter forces the output to contain a file-mode indicator (100664 means it’s a non-executable group-writable file) and the hash filename of the object with which it’s associated:

100644 aeb6db24b9de85b7b7cb833379387f1754caa146 0
src/EntityFramework/DbContext.cs

There’s much more to ls-files, but this is what I’m interested in at the moment—the fact that I can see how Git is mapping the object file to my working directory’s DbContext.cs file.

What Does Committing Do to the Git Database?

Next, let’s see a bit of the effects of pushing objects from staged to committed. Remember that this commit happens only on your computer. Nothing gets committed directly to the original repository on the server. The git commit command requires you add a note using the -m parameter. Keep in mind you’re committing all staged changes. I’ve made just one here:

git commit -m "Edited DbContext.cs"

Git responds by telling me the newly generated Git database name of the object (2291c49) that contains the commit information and what was committed. The insertion note is about what happened in the file—I added something:

[dev 2291c49] Edited DbContext.cs
  1 file changed, 1 insertion(+)
  D:\User Documents\github\ 
    entityframework [dev]>

Notice the prompt is clean. The status now reflects everything that’s happened since the last commit, which is nothing. I’m back to a clean slate.

What has happened in my database in response to the commit? I can tell that the index file wasn’t updated because its timestamp hasn’t changed.

But there are four new folders in the objects directory, each containing its own hash object. I use cat-file to see what’s in those files. The first is a listing, similar to ls-files, of all the files and folders in the src\EntityFramework folder, along with their object names. The second is a list of all the objects in the src folder—that is, a list of the subfolders. These, by the way, are “trees,” not “folders,” in git-speak. The third object contains a list of all the files and folders in the root EntityFramework folder. The final object contains data that will be needed in order to sync to the server—my committer and author identity (my e-mail address), and one object for “tree” and another for “parent.” Interestingly, this final object is the same object that was relayed in the response to the commit. It’s in a folder named 22 and its file name begins with 91c49.

All of these objects will be used for pushing up to the server repository when the time comes.

Have I Pwnd Git?

In a way, I’ve taken ownership of Git by exploring things at a low level. I love exploring cause and effect as well as digging into plumbing. I’ve definitely lost my fear of Git’s magic, and playing around with it in this way has also serendipitously given me an understanding of the command language and the beauty of working with Git at a low level. And it’s probably the most time I’ve ever spent in Windows PowerShell, too.

I have friends who are Git fanatics—like Cori Drew who jumped on Skype with me a number of times when I got confused by what I was seeing—and many more who use Git as though it’s the very air they breathe. They tried to show me how Git works, but I initially gave up on it. This exercise definitely empowered me to learn more and benefit from it, and to see how much more there is to learn. But it’s only my fascination with data and with banging on things to see how they react that got me to this point. And for that I am grateful for Alan Peabody’s inspiring suggestion.

Update: A week after I originally finished writing this article, I found myself using Git on two separate solo projects, branching and merging like an almost-pro! My own plunge was a great success and I hope yours will go just as well.

Julie Lerman is a Microsoft MVP, .NET mentor and consultant who lives in the hills of Vermont. You can find her presenting on data access and other .NET topics at user groups and conferences around the world. She blogs at thedatafarm.com/blog and is the author of “Programming Entity Framework” (2010), as well as a Code First edition (2011) and a DbContext edition (2012), all from O’Reilly Media. Follow her on Twitter at twitter.com/julielerman and see her Pluralsight courses at juliel.me/PS-Videos.

Thanks to the following technical experts for reviewing this article: Cori Drew and Alan Peabody
Cori Drew (@coridrew) is a senior consultant with ImprovingEnterprises.com in Addison, TX. She started her programming career as a Web developer before cutting her OOP teeth (and falling in love with C#) in 2003 in .NET Framework 1.1. Learn more about Cori at about.me/coridrew.

Alan Peabody is a Ruby developer. You can learn plenty about Alan (@alanpeabody) by checking out his github profile at github.com/alanpeabody.

Share via