Why does cloning from VSTS return old unreferenced objects?

UPDATE (2017-08-09):

We rolled out commit reachability bitmap indexes to VSTS and removed the clone cheat mentioned below. Cloning will no longer download unreachable objects! . We still don't have true object-level git gc on the server yet, but clone sizes will be smaller now.

TFS on-prem will get these changes in v.Next (not in any TFS 2017 updates, but the next major release). As Brian Harry mentioned, we should have a release candidate for v.Next in a few weeks.

We'll probably expand on this in future blog posts, but unlike core Git, we use Roaring bitmaps instead of EWAH bitmaps.  Daniel Lemire has some great blog posts and publications on bitmap indexes that we greatly enjoyed and benefited from.

Original Post: 

Note: "core Git" refers to the official base Git implementation, as opposed to Visual Studio or GitHub, or VSTS, which may involve non-standard implementations or behavior.

A customer asked:

We removed some unwanted binaries from our repo on visualstudio.com by following the instructions at https://help.github.com/articles/remove-sensitive-data/. We force-pushed to master and deleted all our other branches.

After running git gc locally, our local repo is now 5 MB, but git clone from visualstudio.com still returns 100MB. The old unreferenced blobs are still being sent down by the server.

How do we git gc (or some equivalent) on the server as well?

There are two issues here:

  1. There is no equivalent to git gc on VSTS yet.

    Our server preserves the history of every ref/branch update to Git repos, including deleted branches. This is analogous to the "reflog" in core Git. On VSTS, we expose the reflog via the REST API and the Branch Updates (i.e. pushes) tab in Web Access. Similarly to core Git, objects in the reflog are still considered to be referenced and will not be deleted by git gc. Core Git can eventually prune old reflog entries via git prune or git gc, but VSTS does not have that functionality yet.

  2. Large fetches are expensive for the server to calculate, so we cheat a little.

    Large fetches (and clones) have historically been very expensive in both core Git and VSTS due to the "counting objects" phase. https://githubengineering.com/counting-objects/ has a nice explanation of the problem, as well as how core Git and GitHub have (cleverly) improved the perf w/ bitmap indexes.

    Unfortunately, VSTS does not have that perf fix yet. Instead, it cheats a bit and blindly streams back every object that exists on the server if the client has nothing and is asks for all branches and tags (e.g. for git clone). This is generally reasonable, until a user decides to dereference most of the objects in their repo to save space!

I suspect that the customer would not have minded the lack of gc in his scenario if we only sent reachable objects during clone.

Until these issues are fixed for VSTS, what workarounds are there?

  • Delete the repo from the server (EDIT: or rename it) and re-push it.

    This works, but is sub-optimal.  In the new repo, you won't be able to see old pull request details, branch update history, and any links from other areas like builds or work items.

  • Trick the server by not cloning everything at once:

     mkdir newRepo
    git init
    git remote add origin 
    #fetch one branch first
    git fetch origin master
    #fetch everything else
    git fetch origin
    

Comments

  • Anonymous
    March 30, 2016
    I've just spent a better part of the day trying to shrink the repo on TFS 2015, now I know why everything failed. Thanks for this post, it is really helpful.What are the plans for fixing this?
    • Anonymous
      June 01, 2016
      The comment has been removed
    • Anonymous
      August 09, 2017
      It rolled out on VSTS, and will ship in TFS on-prem v.Next as well (see update above post)
      • Anonymous
        August 28, 2017
        Is it implemented in vsts now ? How can be quickly check ?
        • Anonymous
          August 28, 2017
          Yes. At the time that I mentioned that "It rolled out on VSTS", the rollout was already complete. You can verify this with the following steps:1. Create a new branch in your local copy of some repo on VSTS. Assume the new branch name is "NewBranch".2. Create a new commit in NewBranch. Assume the commit ID is abc123.3. Push NewBranch to the repo on VSTS.4. At this point if you reclone the repo from VSTS, in the new clone: * "git banch -rv" should show origin/NewBranch and its commit ID abc123 * "git catfile -p abc123" should show the contents of the commit5. Delete NewBranch from the repo on VSTS4. At this point if you reclone the repo from VSTS, in the new clone: * "git branch -rv" should NOT show origin/NewBranch (this has always been the case) * "git catfile -p abc123" should say that abc123 is not valid (unlike in the past when abc123 could get cloned even if NewBranch was deleted)
  • Anonymous
    July 06, 2017
    Is this functionality still in backlog ?
    • Anonymous
      July 10, 2017
      The comment has been removed
    • Anonymous
      August 09, 2017
      It rolled out on VSTS, and will ship in TFS on-prem v.Next as well (see update above post)