Home

About Eric

RSS

Complete Archive




My Favorite Books

Series:

Source Control HOWTO

Marketing for Geeks

The 22 Immutable Laws of Marketing

The Business of Software

WPF 3D

Topics:

Software Development

WPF

Business

Laughs

SourceGear


Related Sites:

www.NotALegend.com

www.SourceGear.com

www.Teamprise.com

     

IBM Rational Software Conference

Anybody attending the Rational Software Conference in Orlando next week?

I've been making very last-minute plans to be there for some meetings, but I'll have some free time, and it's always cool to connect a face with an email address.  So if you're a reader of my blog and will be at the Rational conference next week, drop me an email.

And yes, yes I know this blog entry should really have been a tweet.  I just haven't gotten into the Twitter thing at all yet, but this very moment is the first time I've thought maybe I should.  :-)

 

Time and Space Tradeoffs in Version Control Storage

Storage is one of the most difficult challenges for a version control system.  For every file, we must store every version that has ever existed.  The logical size of a version control repository never shrinks.  It just keeps growing and growing, and every old version needs to remain available.

So, what is the best way to store every version of everything?

As we look for the right scheme, let's remember three things we consider to be important:

  • Data integrity is paramount.  In a version control tool, nothing can be considered to be more important than guarding the safety of the data.

  • Performance is critical.  Software developers have about as much patience as a German Shepherd sitting in front of a pot roast.

  • Space matters too.  We're going to be storing lots of data, much of which is being kept almost entirely for the purpose of archiving history.  We'd prefer to keep this archive as compact as possible.

In this blog entry I will report the results of some exploration I've been doing.  I am experimenting with different ways of storing the full history of one source code file.  In this case, the file comes from the source code for SourceGear Vault.  It has been regularly edited for almost seven years.  There are 508 versions of this file.

As I describe the various things I have tried, a running theme will be the classic tradeoff of space vs. speed.  In physics, we know that matter and energy are interchangeable.  In computer science, we know that time and space are interchangeable.  Usually, we can find a way to make things faster by using more space, or make things smaller by taking more time.

As I said, I'll be storing 508 versions of the same file.  It's a C# source code file.  For each attempt, I will report two things:

  • The total amount of space required to store all 508 versions.

  • The total amount of time required to retrieve (or decompress or decode) all 508 versions, one at a time.

Before we get started, a few caveats:

  • I realize that these experiments would yield different results for a different kind of file.  If you're storing source code, there might be some things here you can apply.  If you're storing JPEG images, not so much.

  • All these experiments were done on my Mac Book Pro laptop.  The CPU is a Core 2 Duo, which I consider to be decently fast.  But like most laptops, this machine has an I/O system which I consider to be quasi-crappy.  I would probably get somewhat different results if I were running on a more serious piece of hardware.

OK, how should we store these 508 versions of the file?

No compression at all

As a first attempt, let's just store them.  No compression or funky encoding.  Each of the 508 versions will be stored in full and uncompressed form.

This is the starting point, even if it is not very practical.

Size:  112,643 KB

Time:  2.5 s

#ifdef DIGRESSION

Yes, dear reader, I admit that this file is far too long.

You can do the math.  If the archive takes 112 MB and there are 508 versions, then each one is 230 KB.  That's pretty big for a source code file.

Actually, it's worse than you think.  The 230 KB figure is just the average.  The first version of the file is around 90 KB.  The latest version is over 400 KB. 

In our defense, I'd like to point out that this piece of code needs to stay compatible with .NET 1.1, so the entire class must be in a single file.  However, I'd still have to answer to the charge of "First Degree Failure to Refactor".  Fine.  I'll have my attorney contact you to plead out on a lesser charge.  I'm thinking maybe "Third Degree Contributing to the Delinquency of an Intern", or something like that.

#endif

This "full and uncompressed" format uses an awful lot of space, but it is also the fastest.  We will find ways of making this smaller, but all of those ways will be slower.

The relevant questions are:

  • How much smaller?
  • How much slower? 

Some solutions will allow us to make this a lot smaller and only a little slower.  Those are interesting.  Other possibilities will be only a little smaller but a lot slower.  Those are not so interesting.

Simple compression

OK, for our next idea, let's just compress every version with zlib.

Size:  22,516 KB

Time:  4.0 sec

The results of this idea are surprisingly impressive.  The archive is over 80% smaller, and only about 60% slower.  That's darn good, considering that I didn't have to be terribly clever.

This tradeoff is probably worth it.  In fact, it establishes a new baseline that might be tough to beat.

How do we get better than this?

Deltas

Instead of just compressing every file independently, we could store things as deltas.  Think of a delta as simply the difference between one version and the next.

Compression with zlib takes one standalone thing and makes an equivalent standalone thing which is smaller.

In contrast, a delta is a representation of the differences between two files.  Suppose that somebody takes file X and makes a few changes to it, resulting in file Y.  With a delta algorithm, we could calculate the delta between X and Y, and call it D.  Then, instead of storing Y, we can store D.

The nice thing here is that D will be approximately the size of the edits, regardless of the size of the two files.  If X was a 100 MB file and Y was the same file with an extra 50 bytes appended to the end, then D will be somewhere around 50 bytes,

A delta is a concept which might be implemented in a lot of different ways.  In my case, the delta algorithm I am using is VCDIFF, which is described in RFC 3284.  We have our own implementation of VCDIFF.  Other implementations include xdelta and open-vcdiff.

The important thing to remember about deltas for storage is that you must have the reference item.  D is a representation of Y, but only if you have X handy.  X is the reference.

OK, it should be obvious that this concept can be helpful in storing a repository, but how do we set things up?

One big delta chain

As a first attempt, let's store all 508 versions as a big chain of deltas.  Every version is stored as a delta against the version just before it.  Version 1 is the reference, and is the only version that is not stored as a delta. 

Size:  7,682 KB

Time:  Way too long to wait

Wow -- this is really small.  It's over 93% smaller than the full/uncompressed form.  It'll be hard to find a general purpose approach that is smaller than this.

But good grief this is slow.  Fetching version 508 takes an eternity, because first you have to construct a temporary version of 507.  And to construct version 507, you first have to construct a temporary version of 506.  And so on.

Key frames

Let's try something else.  The problem with the chaining case above is that retrieving version 508 requires us to go all the way back to version 1, which is incredibly inefficient.  Instead, let's insert "key frames" every 10 versions.  We borrow this idea from the video world where compressed video streams store every frame as a delta, but every 10 seconds they insert a full, uncompressed frame of video.

By using key frames with chaining deltas, we can cut the time required to fetch the average version of the file.  For example, with a key frame every 10 versions, we get most of the benefits of chaining, but in the worst case, we only need 9 delta operations to retrieve any version.

Size: 18,024 KB

Time: 41.0 sec

This is better, but still not very good.  The compression here isn't much better than zlib, and the perf is still a lot worse.  Compared to zlib, we don't want to pay a 10x speed penalty just to get 20% better compression.

All the key frames are stored as full and uncompressed files, and they're taking up a lot of space.  Maybe we should zlib those key frames?

Size: 9,092 KB

Time: 42.7 sec

Now at least the compression is starting to look interesting.  This is less than half the size of the zlib case, and 91.9% smaller than the full form, which is a level of compression that is probably worth the trouble.  But the overall perf is still quite slow.  In fact, it's even slower here than plain chaining with key frames, because we have to un-zlib the key frame.

Flowers

The big problem here is that chains of deltas are killing our performance.  Chained deltas can be used to make things very small because each delta matches up nicely with one set of user edits.  But chained deltas are slow because we need multiple operations to retrieve a given file.

Another approach would be to use each reference for more than one delta.  I call this the flower approach.  With a flower, we deltify a line of versions by picking one version (say, the first one) and using it as the reference to make all the others into deltas.

Flower deltas should be much faster, since any file can be reconstructed with just one undeltify operation.

So let's try to flower all 508 versions using version 1 as the reference for all of them.

Size:  35,851 KB

Time:  10.9 sec

As expected, the performance here is much better.

But the overall space savings is lousy.  Only version 2 was based directly on version 1.  Every version after that has less and less in common with version 1, so the delta algorithm can't draw as much stuff from the reference.

This particular approach isn't going to win.  Plain zlib is both smaller and faster.

Flowers with key frames

Maybe we should try the flower concept with key frames?

Like before, every 10 frames go together as a group.  But instead of chaining, we're going to run each group as a flower.  The first version in the group will serve as the reference for the other 9.  We can reasonably assume that the deltification of frame 10 won't be as good as frame 2, but hopefully 10 and 1 still have enough in common to be worthwhile.

Size:  18,648 KB

Time:  12.2 sec

Wow.  This looks a lot better than chaining.  The space used is about 17% smaller than zlib, but instead of being 10 times slower, it's only 3 times slower.

Of course, we can use the same trick we tried before.  Let's zlib all those key frames.

Size:  9,716 KB

Time:  13.6 sec

This seems like a potentially useful spot.  It's less than half the size of zlib.  The perf still a lot slower than zlib, but at only about 3X slower, the tradeoff is the best we've seen so far.

OK.  So we've made a lot of progress on saving space, but 3X slower than zlib still seems like a high price to pay.  Do we really want to make that trade?  Do we have to?

Some things get retrieved more often than others

Let's look at the patterns for how this data is going to be accessed.

I've been reporting the total time required to fetch all 508 versions of the file.  However, this benchmark doesn't reflect real usage very well at all.  In practice, the recent stuff gets retrieved a LOT more often than the older stuff.  Most of the time, developers are updating their working copy to whatever is latest.

As a rough guess, I'm going to say that version 508 gets retrieved twice as often as 507, which gets retrieved twice as often as 506, and so on.  A timing test based on that assumption gives us results something like this:

Full                                           1.1 sec

Zlib                                          1.7 sec

One big flower                          4.0 sec

Flower with key frames             5.1 sec

Chain with key frames               24.5 sec

Not too surprising.

In the spirit of optimizing performance for the most common operations, why not keep all the more recent versions in a faster form?  We could still use something more aggressive for the older stuff, but we can probably get a nice performance boost if we just refuse to use deltification for the most recent 10 versions of the file.

But how should we store those 10 versions?  In full format?  Or zlib?  This is an arbitrary choice with a clear tradeoff.  For now, I choose zlib.  If we wanted a little more speed at the expense of using a little more space, we could just keep those 10 versions in full form.

By choosing zlib for the most recent 10 revisions, now my "get the recent stuff" benchmark runs in 1.7 seconds no matter what scheme I use.

But we still care about performance for the case where somebody fetches an older version, even if that fetch doesn't happen as often.  That's the point of version control storage.  Every version has to be available.  And when somebody does fetch version 495, we want our version control system to still be reasonably fast.

Reversing the direction of the chains

Since the more recent versions are retrieved more often, obviously, our chains are all going the wrong direction.  If we had them go the other way, then retrieval would get slower as the versions get older instead of as the versions get newer.

But this approach doesn't lend itself well to the way version control repositories naturally grow in the wild.  In these tests, I have mostly ignored the issues of constructing each storage scheme.  I've already got all 508 versions, so I'm just fiddling around with different schemes of storing them all, comparing size and retrieval time.

In practice, those 508 versions happened one at a time, in order.  If we're going to store the versions with backward chains, then each time we commit a new version of the file, we're going to need to re-encode something that was previously stored.  This makes the commit operation slower.  It is also a questionable idea from the perspective of data integrity.  The safest way to maintain data is to not touch it after it has been written.  Once it's there, leave it alone.

One case where we might want to be a bit more liberal toward rewriting data is in a "pack" operation, such as the one Git has.  It wouldn't be terribly crazy to consider a standalone pack operation in a DVCS to be better than rewriting data for each commit, for several reasons:

  • It allows us to keep commit fast.

  • Since pack would be done offline, its implementation can be focused more on data integrity and space savings than on performance.

  • Since the pack code can be separated from the commit code, all the risky code can be kept together where it is easier to maintain.

  • Since the pack operation is separate from commit, a user that does not want to run pack does not have to.

  • A pack operation in a DVCS is happening on just one instance (clone) of the repository, not on the only copy.

Anyway, a pack operation would allow us to use storage schemes that do not work well on the fly, incrementally updating as each version comes in.

Visualizing the results

This plot makes it easier to see which schemes are better than others. 

In my experimentation, I actually did a lot more schemes.  For example, instead of key frames every 10 versions, I also tried every 5, 15 and 20.  However, all those extra data points really cluttered the graph.  So I only included the most important ones here.

  • In the lower right, we find "full".  Very fast and very large.

  • In the upper left, we find "chains".  Very slow and very small.

  • We can ignore any point which is both above AND to the right of any other point.  The "1flower" point is the one where I made one big flower, using version 1 as the reference for every other version.  This scheme ends up being useless since zlib is better in both ways that matter.

  • All the other points represent possible tradeoffs which might be interesting, depending upon our priorities

Intuitively, the schemes which are closer to the origin are better.  This graph makes it easy to see that "zlib" and "flowers" are probably the two most interesting options I have discussed here.


 

Ten Quirky Issues with Cross-Platform Version Control

A big chunk of the software industry today can mostly ignore the issues of multiple platforms, for one of the following reasons:

  1. They only support Windows.  It's got like 90% market share, so why not?

  2. They serve a web application and don't care what the end user is actually using as long as their browser works.

But version control tools involve more cross-platform concerns than most other kinds of software.  Neither of the reasons above tends to work very well.

  1. If a software team has 450 Windows users and 50 people on Mac or Unix, then a Windows-only solution just won't do.

  2. Since a primary task of a version control tool is to manage source code trees on the user's hard disk, a web application just won't do.

So, even as most coders have moved on to a world where they can remain blissfully ignorant of the problems of writing software for multiple operating systems, those of us who create version control tools are still wrestling with those problems.

And in fact, I claim that our challenges are tougher than most.  Version control users ask for the darndest things, especially in the big enterprise companies.  It's easy to believe that all you need is Windows, Mac, Linux and maybe Solaris.  Then you find out just how prevalent things like AIX and HPUX are.  Terms like "Irix" and "Win95" and "mainframe" get tossed around until you're numb and nothing surprises you anymore.  When somebody asks for a port to an arcane platform, you roll your eyes and wonder is if it uses 8-bit bytes or not.

Worse than that, version control vendors aren't just porting to oddball operating systems.  We actually have to make our software interoperate across all those environments.

And that's where things start to get quirky.

  1. On a Linux system, create a file called "README".  In the same directory, create a file called "readme".  Check them both in.  Now go to a Mac and check them both out.  Since the Mac file system is [usually] case insensitive, something bad is going to happen.  Same goes for Windows/NTFS.

  2. On a Mac, checkin a file called "PRN".  Check it out on a Windows system.  That file name is not allowed under Windows, for backward compatibility with MS-DOS..

  3. Under Linux, checkin a file with a name that ends in a dot.  Check it out under Windows.  The trailing dot is probably gone.  Now check the file back in and go back to your Unix system.  If your version control system handled this badly, you've probably got two copies of the file, one with the trailing dot, and one without.  Same goes for a trailing space.

  4. On a Linux system, checkin a file with a path that is 261 characters long.  Check it out under Windows.  This might work.  It probably won't.  It kind of depends on whether .NET is involved or not.  There's a \\?\ trick to get around the limitations of the Win32 layer, but the .NET libraries don't use it.

  5. On a Mac, checkin a file that has a resource fork and some Finder info.  Check it out on a Linux machine.  What happens?  Did stuff show up as xattrs?  Should it have?  On that same Linux machine, make a change and check it back in.  Then check it out on the Mac again.  Is the Finder info still there?

  6. On a Linux machine, checkin a file with a colon in the name.  Check it out on a Mac.  Not sure what'll happen, but it probably won't be what you want.

  7. On a Windows machine, checkin a file with a name that begins with a dash.  Now check it out under Linux and try manipulating it with command-line utilities.  Apps will think the filename is a command-line option.  What if somebody creates a file named "-rf *" and a Linux user tries to rm it?  OK, bad example.  The point remains:  Filenames which begin with a dash may cause more problems on some platforms than on others.

  8. On a Linux machine, create a source code file and check it in.  Check it out on Windows and open it with notepad.  The line-endings are Unix-standard LF, but Windows apps expect CRLF, so notepad shows the entire file as one line.  Now open the same file under Visual Studio.  The file looks fine now.  Now edit a few lines in the middle of the file, check it back in, and check it out on Linux again.  The lines you edited are messed up.

  9. On a recent Ubuntu Linux system, create a file called "Espaņol".  Do the same thing on Mac OS X 10.5.  These two files have the same name, but even though you are [probably] using the utf-8 encoding of Unicode on both systems, the bytes which represent that name do not match.  On the Linux machine, the file name [probably] will be in NFC normalized form (Espa\u00f1ol).  On the Mac, everything gets normalized to NFD (Espan\u0303ol).  When you check these files in and start working with them, bad things will happen unless your version control tool understands what's going on and deals with it appropriately.

  10. On a Unix machine, checkin a symbolic link.  Check it out on Windows.  What happens?

Like I said, things get quirky.


 

Mercurial, Subversion, and Wesley Snipes

People keep asking me why I don't talk more about Mercurial in this series of blog entries.  There's a simple answer to that question:

Mercurial isn't very interesting.

Wait, that didn't come out quite right.  Let me try again:

Git is Wesley Snipes.

Mercurial is Denzel Washington

Hmm, that probably needs further explanation.  First let me give a little background.

I am the founder of a version control software company.  I've done lots of writing about the topic here on my blog.  Currently I am in the process of revising and expanding all those writings to turn them into a book. 

As part of that effort, I have undertaken an exploration of the DVCS world.  Several weeks ago I started writing one blog entry every week, mostly focused on DVCS topics.  In chronological order, here they are:

  • The one where I gripe about Git's index
  • The one where I whine about the way Git allows developers to rearrange the DAG
  • The one where it looks like I am against DAG-based version control but I'm really not
  • The one where I fuss about DVCSes that try to act like centralized tools
  • The one where I complain that DVCSes have a lousy story when it comes to bug-tracking
  • The one where I lament that I want to like Darcs but I can't
  • The one where I speculate cluelessly about why Git is so fast

Along the way, I've been spending some time getting hands-on experience with these tools.  I've been using Bazaar for several months.  I don't like it very much.  I am currently in the process of switching to Git, but I don't expect to like it very much either.

Why am I using these tools if I don't like them?  Because I want the experience.  I don't want to write hearsay.  I want to live with these tools and see what I learn.

So why don't I write about Mercurial?  Because I'm pretty sure I would like it.

I chose Bazaar and Git for the experience.  But if I were choosing a DVCS as a regular user, I would choose Mercurial.  I've used it some, and found it to be incredibly pleasant.  It seems like the DVCS that got everything just about right.  That's great if you're a user, but for a writer, what's interesting about that?

Denzel Washington is a great actor.  Other than that, he lives a pretty normal life.  What's interesting about that?

In contrast, Wesley Snipes gives the world lots of things to write about.  Tax evasion.  Conviction for reckless driving.  Martial arts.

People admire Denzel Washington.  But Wesley Snipes is simply more interesting.

I admire Mercurial.  But Git is more interesting.  Like Snipes, Git is an odd juxtaposition of great power with some quirky flaws and failings.

One more thing:

People also ask me why I don't write more about Subversion.  That's easy too:

Subversion is Morgan Freeman.


 

Why is Git so Fast?

In the DVCS world, Git has a reputation for being really fast.  I am curious about how Git got this way.

When I started thinking about this question, seven different answers came to my mind.  Some of those answers seem more interesting or correct than others.

One:  Maybe Git is fast simply because it's a DVCS.

There's probably some truth here.  One of the main benefits touted by the DVCS fanatics is the extra performance you get when everything is "local".

But this answer isn't enough.  Maybe it explains why Git is faster than Subversion, but it doesn't explain why Git is so often described as being faster than the other DVCSs.

Two:  Maybe Git is fast because Linus Torvalds is so smart.

This might very well be correct.  But it's not interesting.

Fine.  So Linus is smarter than all of us.  But how did he use those smarts to make Git so fast?  What are the details?

Three:  Maybe Git is fast because it's written in C instead of one of those newfangled higher-level languages.

Nah, probably not.  Lots of people have written fast software in C#, Java or Python.

And lots of people have written really slow software in traditional native languages like C/C++.  Adobe writes most of their stuff in C++, and they don't have any trouble making sure that release N+1 is slower than release N.

Four:  Maybe Git is fast because being fast is the primary goal for Git.

This is another one of those high-level answers that is probably correct but doesn't have the kind of details about which I am curious.

Still.  Take some time to read through the archives of the Git developers mailing list.  These people spend a LOT of time talking about performance issues.

Five:  Maybe Git is fast because it does less.

One of my favorite recent blog entries is this piece which claims that the way to make code faster is to have it do less.

Predictably, people came out of the woodwork to say how wrong this guy was.  That's what happens to almost any blog entry about performance tuning or optimization.  Readers ignore anything correct in the article and quibble about little stuff.

But this guy was essentially correct.  One way to make software faster is to make it do less.

For example, the way you get something in the Git index is you use the "git add" command.  Git doesn't scan your working copy for changed files unless you explicitly tell it to.  This can be a pretty big performance win for huge trees.  Even when you use the "remember the timestamp" trick, detecting modified files in a really big tree can take a noticeable amount of time.

Or maybe Git's shortcut for handling renames is faster than doing them more correctly like Bazaar does.

Six:  Maybe Git is fast because it doesn't use much external code.

Very often, when you are facing a decision to use somebody else's code or write it yourself, there is a performance tradeoff.  Not always, but often.  Maybe the third party code is just slower than the code you could write yourself if you had time to do it.  Or maybe there is an impedance mismatch between the API of the external library and your own architecture.

This can happen even when the library is very high quality.  For example, consider libcurl.  This is a great library.  Tons of people use it.  But it does have one problem that will cause performance problems for some users:  When using libcurl to fetch an object, it wants to own the buffer.  In some situations, this can end up forcing you to use extra memcpys or temporary files.  The reason all the low level calls like send() and recv() allow the caller to own the loop and the buffer is because this is the best way to avoid the need to make extra copies of the data on disk or in memory.

People make fun of those with NIH Syndrome, but my observation is that folks who suffer from this disorder tend to create faster software, even if they also tend to ship everything late.  :-)

Maybe Git is fast because every time they faced one of these "buy vs. build" choices, they decided to just write it themselves.

Seven:  Maybe Git isn't really that fast.

If there is one thing I've learned about version control it's that everybody's situation is different.  It is quite likely that Git is a lot faster for some scenarios than it is for others.

How does Git handle really large trees?  Git was designed primary to support the efforts of the Linux kernel developers.  A lot of people think the Linux kernel is a large tree, but it's really not.  Many enterprise configuration management repositories are FAR bigger than the Linux kernel.

Final thoughts

This week's version control blog entry raises more questions than answers.  I'm not a Git user, nor have I looked much at its code, so I don't really know why it's so fast.  I'm just curious.  If you have better answers than mine (and I admit that's a low hurdle), feel free to send them to me or post them in my comments.

But FWIW, I have decided it is time for me to become a Git user.  When I was writing about Git a few weeks ago, a lot of Git users kept telling me I just don't get it.  I've spent more time thinking about version control implementation and design than most folks, so I tend to think I actually do "get it".  But my curiosity is piqued, and I hate to pass up an opportunity to learn something, so I'm going to give it a try.  I've got a small project here at SourceGear that I work on part-time with a couple other people.  We've decided to switch to Git and see how it goes.  I'll let you know what I find out.


 

Merge History, DAGs and Darcs

One of the popular complaints about CVS and Subversion (before 1.5) was the lack of merge history.  In a nutshell, merge history is remembering what has been merged and taking that information into account on future merges.

In a bucket?

In something bigger than a nutshell, merge history is a lot more complicated.

Unless you completely eschew branching, you are going to frequently find yourself wanting to take some changes you made in one place and re-apply them somewhere else.  When you do this, you want your version control to make it painless.  Without merge history, it is very difficult to make change migration painless, since the tool will try to do things that have already been done.

A common example is the case of two branches that occasionally want to merge changes from one side to the other.

Here I have two branches of development.  On two occasions, somebody merged changes from branch (b) over to branch (a).  (Arrows go from a changeset to its parents.)

Without merge history, this can be a real pain.  When branch (a) tries to grab changes from changeset 5b, CVS doesn't remember that 3b and 2b have already been applied. So it attempts to apply them a second time, resulting in conflicts, pain and frustration and a general fear of branching.

One of the coolest things about DAG-based version control tools is that the DAG is an expression of merge history.  We interpret arrows in the DAG to mean "'I've got this".

So, when it comes time to do merge from 5b over to the (a) branch, we can use information in the DAG to know that 3b and 2b are already done.  I'm not saying the algorithm to use this information properly is trivial, but there are multiple implementations, and they work pretty well in practice.

For example, Git has a repuation for excellent and painless branching and merging, and the DAG is the main reason why.

Cherrypicking

But a DAG is just one implementation of merge history, and it is definitely not perfect.

An arrow in a verson control DAG goes from child to parent.  It tells us that the child contains all of the changes in the parent.  And its grandparents.  And its great grandparents.  And so on.

But what if this isn't true?

Consider the following picture:

I want to create changeset 4.  I want to start at changeset 1, and then I want to apply the changes from changeset 3, but NOT the stuff in changeset 2.  This operation is sometimes called "cherrypicking".  I don't want to merge all changes from one branch to another.  I just want to pluck one changeset (or one part of a changeset) and apply it as a patch to some other place.

How do I represent this in the DAG?

I can't.

  • I could draw an arrow from 4 to 3 (shown above in red).  This would correctly say that 4 contains the changes in 3, but it would INCORRECTLY claim that 4 contains the changes in 2.
  • OR, I could draw no arrow.  Effectively, my merge history would simply not record the fact that 4 is really 3 converted to a patch and applied to 1.

In either case, something bad is going to happen next time I merge from one branch to the other:

  • If I draw that lying arrow, I will not be given the chance to apply changeset 2, because the merge history believes I already did it.
  • If I don't draw any arrow, the tool will expect me to deal with changeset 3, because there is no merge history recording the fact that I already did it.

Neither of these problems is disastrous enough to make the evening news, but still.

DAG-Like Things

It's tempting to think that the problem lies in the way I defined my DAG lines.  Perhaps a line should mean "just you, not your parents?"  But then I would I need to have a line from every changeset to every one of its ancestors.  This would be completely infeasible.

Or perhaps we need two kinds of DAG lines?

  • Regular lines are the normal case.  They imply recursive inclusion.  We'll draw them in black.
  • Red lines are for cherrypicking.  When a red line points to a changeset, it says, "I've got this, but not its ancestors."  Red lines imply shallow inclusion.

But now our DAG is not really a DAG anymore.  If we're going to use a DAG, we'd like to be able to use the decades of computer science research about how to deal with them.  AFAIK, all the well understood algorithms about DAGs assume there is only one kind of line.

For example, is changeset 3 a leaf?  Well, maybe.  If you ignore the red lines, then 3 is a leaf.  But if red lines count, then 3 is an interior node.

Many CS algorithms become less useful when questions start getting answered with "maybe".

Darcs

So, even though the DAG is a pretty good representation of merge history, it isn't perfect.

Darcs is an attempt to build a better solution to the problem.

Several weeks ago I divided version control tools into two groups:

  1. Those where the history is a Line.
  2. Those where the history is a Directed Acyclic Graph (a DAG).

But darcs doesn't really fit in either of these categories.  Its model of history is certainly not a Line.  But it's not really a DAG either, at least not in the same way as Git and Mercurial.

A darcs changeset records the full merge history at the patch level.  Darcs has a nice well-defined algebra of patches which allows it to accomplish some very clever things.

But while I consider the concepts behind Darcs to be fascinating, I also consider them to be raw and unproven in practice.  I can't see how the algorithm would scale to big problems.  And people who know darcs are always talking about the possibility of the merge algorithm going exponential.

Darcs seems to have a more complete representation for merge history.  But that doesn't mean there is any practical algorithm for making use of that information.

For now, I must consider darcs to be in the category of research, not development.

Final Thoughts

Merge history is a very hard problem.  Some of the imperfect solutions have found their way into common usage and proven themselves to be quite practical.  But there is a lot more that could be done.

Need a thesis topic for your PhD in computer science?  Go find a better solution to the merge history problem.


 

DVCS and Bug Tracking

In last week's entry, I mentioned my interest in Fossil, a relatively new DVCS written by the author of SQLite.  In the comments on that entry, a guy named Benjamin Pollack picked a fight with me about why I think Fossil is interesting.

It turns out that this guy is actually one of Joel's minions over at Fog Creek.  In fact, he joined the company as one of the interns on Project Aardvark back in 2005.

To Benjamin, I would like to say that "interesting != good".  Some things are interesting in spite of the fact that they are crap.  And some things are interesting BECAUSE of the fact that they're crap.

And to D. Richard Hipp, the author of Fossil, I would like to say that I am not saying Fossil is crap.  In fact, I am currently taking no position on whether Fossil is good or bad.  For now, I just think it's interesting, mostly because I think the issues of DVCS integration with the rest of the ALM tool suite are important.

But before I talk more about that, I can't resist offering a few remarks about Fossil itself.

Comments about Fossil

  1. Benjamin Pollack complained that Fossil handles merge conflicts poorly.  And he's right.  When it inserts markers around the conflicting text, it should clearly indicate what came from which file.

  2. Why does each instance of the repo have its own list of users?  I would have expected that this information would sync during a push/pull operation.

  3. The 'fossil ui' command is conceptually cool.  It runs a built-in web server and launches a browser pointing at it, providing a web-based way to interact with all the features of Fossil.  But Fossil's web UI isn't going to win any awards for aesthetics.  It's 2009, and the world is getting less tolerant of ugly things in web browsers every year.  At some point, making Fossil pretty would probably be worthwhile.

  4. Fossil is really easy to configure.  It's just one executable file.  And setting it up as a server is simple, either using its built-in server, or running as a CGI, or running through inetd.  Very nice.

Distributed Bug-Tracking

Industry-wide, there is a trend toward more integration between version control and other stuff like project tracking, wikis, discussion forums, build tracking, etc.  Developers don't just checkin code.  They use a whole bunch of other tools which help them collaborate with each other and with people in other functional areas.

While DVCS is one of the more interesting things happening right now, it does represent a setback in this particular area.  The benefits of a DVCS are somewhat diminished if all of the other tools a developer needs are still "centralized".

Yes, it's cool that I can commit my code while I'm on a plane, but how do I update the FogBugz case to mark it fixed?  So far, the answer is that I have to wait until the plane lands, hope the airport has Wi-Fi, login to my corporate VPN, bring up a web browser, remember the case ID, find the case, change its status, and try to remember my code changes so I can write something relevant in the comments.

As long as this is the answer, then I assert that the story for DVCS is, well, incomplete.

Other relevant projects

As far as I know, Fossil is the only tool which is a DVCS with bug tracking built-in.  But it is not the only project exploring this area of need.  Others include:

I have spent some time looking at each of these, but not enough to make detailed comments.  Let's just say that I consider all of them interesting in the same way that I think Fossil is interesting.

Things I think I think

After looking at everything I can find in the area of distributed bug-tracking, I found myself with more questions than answers.  But I am starting to collect some things that I think are correct.  I think.

I think bugs deserve their own DAG.

I think everybody's first thought about bug-tracking with DVCS is that the bugs should be stored in the version control tree as text files that can be merged.  Whenever the tree branches, the bugs will automatically branch as well.  A bug can be marked as fixed in the branch where it is fixed.

But the more I think about this design, the more I think it would cause a lot of regrets later.  I think bug tracking records probably need their own place, living in their own DAG.  There are just too many scenarios where the bug-tracking info is being updated without changing anything in the tree.

For example, consider the QA team.  When they update a bug to mark it as "fix verified", you don't really want them doing this operation as a commit to the version control tree, do you?  In fact, you probably want the bug-tracking and version control areas to be controlled by a completely different set of access permissions.

Fossil got this right, sort of.  Tickets are separate from the tree.

But Fossil's design isn't perfect.  Tickets are actually not managed with a DAG at all.  Rather, the algorithm for resolving conflicting changes is something like "the version with the latest timestamp wins".  Do we credit the author for not over-designing?  After all, this guy did SQLite, so he knows a thing or two about how to implement "just enough to be incredibly useful".  Or is this design likely to make users really angry when it causes an unpleasant surprise?

I think bugs deserve their own merge algorithm

Once again, the first thought here is probably not the right one.

A DVCS knows how to deal with merging changes to text files.  So if we want to store bugs, then obviously we should keep them in text files so we can re-use all that merge code, right?

I don't think so.

Stuff in a database is very highly structured.  We have lots of information which can be used to implement really good merging.  In theory, merging changes to a bug-tracking database should work much better than merging changes to code. 

(Yes, code is very highly structured as well, but the only way to get that information is to parse the code.  I've seen some interesting research in the area of language-specific version control tools that manage code changes with a parse tree representation, but I don't think those things will be practical mainstream solutions anytime soon.)

Anyway, if you take a bug record and throw it in a text file and then use regular old file merge to resolve changes, it seems like you're throwing away a lot of the information you could be using.

Admittedly, writing a special merge algorithm for this case would be a TON of work.  But the results might be worth it.  It might be the difference between a distributed bug-tracking system that constantly annoys its users and one that Just Works.

I think bugs deserve their own sync patterns.

The use cases for distributed bug tracking are different than distributed version control.

For example, it seems very likely that we want to sync our local instance of the bug-tracking database a lot more frequently than we want to sync our local instance of the version control tree. 

If I've got a live connection to the central server, then I want to be pulling down updates to the bug db all the time.

If I add a comment to a bug, I probably want that comment pushed up to the central server as soon as my network connectivity allows.

With version control, I want a private sandbox so I can work on a bunch of code changes and only push them up to the central server when I'm done fiddling with them.  That kind of workflow strikes me as far less important for a bug-tracking application.

I think distributed version control needs distributed bug-tracking

I've just explained several ways that distributed bug tracking needs to be different from the way a DVCS works.  But I still think that pairing a DVCS with a centralized bug-tracking solution makes very little sense.

Consider the scenario where a company is doing development in two cities and wants each of them to have their own server.

We actually get this request quite a bit from Vault customers.  Somebody calls and says they have a team in New York City and another team in Strawn.  They want each team to be doing work on their own central server.  And they want the two central servers to synchronize with each other at some regular interval.

These people are asking for a DVCS.  They don't care about the "coding on a plane scenario".  They don't really care so much about private workspaces or the performance benefits of having the entire repository on every developer's machine.  They still want a central server.  The only difference is that they want TWO central servers.  And a DVCS can do that.

And if they are using more than just version control, then what they really want is for ALL developer-related stuff to follow that same workflow.  Every four hours when the two central servers do their sync-up, a bunch of changesets get pushed in each direction.  Some of those changes are modifications to the version control tree.  Others contain changes to the work items or the wiki pages or whatever.

I think DVCS will stay small until it becomes a "whole product".

My regular readers know that I am a fan of Geoffrey Moore's classic book, Crossing the Chasm.  One of the ideas in that book is that new innovations don't go mainstream until they become a "whole product".  Right now, most of the comments about DVCS that I am hearing out in the industry are negative.

Some of them are saying that "DVCS will never be mainstream".  More and more, I think those people are wrong.

Others are saying that "this DVCS stuff just isn't ready yet".  Right now, those people are right.  For a large portion of the market, version control alone is not a complete solution.  They want the whole product, and they want it to work together seamlessly.

If DVCS wants to reach that part of the market, it needs to figure out what "distributed" means for bug-tracking and wiki and forums and change management and build tracking and test management and requirements.

I think Benjamin Pollack is an irritating kid who quibbles too much.

Or rather, I did until I saw his bitbucket page.  Anybody who writes a C implementation of an AVL tree FOR FUN has my complete respect.  :-)


 

DVCS and DAGs, Part 2

In Part 1 of this article, I talked about the differences between modeling version control history as a DAG vs. a Line.  The two most noteworthy kinds of feedback I received on this entry were:

  1. Several people accused me of spreading pro-Line-model FUD because I mentioned some of the problems that happen with the DAG model and stopped short of saying that the DAG model is going to cure cancer, eliminate global warming and bring peace to the Middle East.

  2. Several people asked me how I drew those really cool diagrams.

Before I continue with Part 2, allow me to briefly respond to these two pieces of feedback.

My response about DVCS advocacy

Yes, my company ships a version control tool that is built on the Line model of history.  Therefore, any DVCS is, to a certain extent, a competitor to my product.

I further acknowledge that I am breaking the rules. 

  • Business folks like me aren't supposed to ever say anything positive about their competitors.

  • Our job is to feel threatened by change, and to spread that fear around to others.

  • We're supposed to pretend like we don't know that every design choice has tradeoffs, and to insist that our way is better in all situations.

As my Mother can confirm, I don't always follow the rules very well.  :-)

The simple fact is that I find this stuff interesting.  I have been working in the version control industry for over a decade.  I am writing a book on the topic.  This is what I do.  It's interesting to me.

Really.

But there is more happening here than just me being an entrepreneurial rebel.  Let's see -- how can I say this nicely?

You Git fans need to chill.

Seriously, rabid advocacy by Git fans is making the world a lousy place to live.  Git is really cool, but it is not the right tool for every situation.

In their defense, let's acknowledge that the apple didn't fall far from the tree on this particular issue.  When people begin exploring DVCS, often one of the first things they find is the video of Linus Torvalds and his 2007 presentation about Git.  And what they find there is someone who doesn't seem to get it.

Folks, Subversion is probably the most popular version control tool in the world right now.  Almost everyone using a version control tool today is using one that is built on the Line model of history, and they're using these tools successfully and productively.  When someone refuses to acknowledge any validity in that model, they look clueless.

The Torvalds video has done plenty of damage.  That kind of attitude is a big turn-off for people interested in what's new in the world of version control.

So, my fellow admirers of Git, if you are trying to prevent people from using DVCS tools and make sure that they stay confined to their current niches, then keep up the good work.

But if you really want to help the world see the benefits of Git and similar tools, then start realizing that people were getting productive work done before they existed.

My response about those cool diagrams

My DAG pictures were drawn by SourceGear's graphic artist, John Woolley, who also did all the artwork for the Evil Mastermind comic books.  John is doing the layout and illustration work for my upcoming source control book as well.

However, because John's DAG pictures got more praise than my "thousand words", I have decided to be bitter and refuse to include any of his work in this blog entry.  :-)

OK, let's talk more about DAGs

As I mentioned in Part 1, if a DAG is allowed to grow without guidance, things can turn into a real mess.  DAGs are easier to create.  Lines are easier to use.  As soon as we embrace the DAG model to gain all its benefits, the very next thing that happens is that we want Lines back.

This is why every DVCS has features that can be used to make sure the DAG grows with guidance.  Those features are designed to discourage people from committing without taking any responsibility for the complexity that increases every time we add another point of divergence.

In other words, every DVCS has features that allow developers to take a piece of the DAG and treat it like a Line.

Git

Git guides the growth of the DAG through its support for named branches.  You are discouraged from committing something unless its parent is a leaf.

So, if I use the git checkout command to point my working directory to a DAG node which is not a leaf, Git politely fusses at me:

eric$ git checkout 9542b
Note: moving to "9542b" which isn't a local branch
If you want to create a new branch from this checkout, you may do so
(now or later) by using -b with the checkout command again. Example:
  git checkout -b <new_branch_name>
HEAD is now at 9542b5f... initial

If you only commit things that are based on the leaf, then your history stays very Line-like.

Mercurial

Historically, Mercurial has been described as supporting only one branch per repository instance.  Comparisons to Git often focused on Mercurial's apparent lack of inter-repository branching.

I speak in the past tense here because I have heard that Mercurial has added additional features in this area.

I mention Mercurial here only so that its fans don't feel too left out.  I can't speak from much experience using this particular tool.

Still, I feel comfortable citing Mercurial as an example of my point:  In [at least] its early releases, Mercurial was guiding the growth of the DAG by preventing the user from diverging it.  This almost certainly contributed to the widespread perception of Mercurial as a very easy-to-use tool.

Bazaar

This tool is the DVCS I have used the most, but I still can't call myself an expert.  From my own experience, I would characterize Bzr as a tool that works very hard to guide the growth of the DAG.

Whenever I push changes from my local repo to a central server, Bzr requires me to merge in other changes and commit from the leaf, just like a Line model tool would do.

It's rather cool that Bazaar offers me the option of using a central server instead of as a pure DVCS.  But in this mode, the same basic restriction applies:  I can't commit anything unless my baseline is the leaf in the repo.

When I use Bzr, it usually feels like I am using a Line-model tool.

My own preferences

On this particular issue, I actually prefer Git's way of doing things.

Bazaar seems to believe that DAG divergence is only legitimate when it happens in separate repo instances and must be resolved before anything can be pushed or committed together.  This just feels too heavy-handed for a DVCS.  Once I know about the DAG, I want to be allowed to think that way.  I don't mind being warned when I am about to commit a DAG node which would have an older sibling.  But forcing me to merge in order to commit feels very un-DVCS-like to me.

I like Git's ability to switch my baseline using "git checkout branchname".  I understand that people who are not accustomed to thinking about the DAG do find this capability to be unintuitive.  But I like it.

Note that I still like Line-model tools like Subversion and Vault as well.  I'm just saying that a DAG-model tool should act like one.

Fossil

Lately, the DVCS which intrigues me the most is Fossil.  It was written by D. Richard Hipp, the same guy who wrote SQLite.

Fossil has a number of interesting features.  Most notable is the built-in support for bug tracking.  This is one area where the other DVCS's all fail.  They bring you distributed version control, but when it comes time for a developer to update the bug tracking system,  things suddenly go back to the centralized world.

Anyway, I'm just getting started with looking closely at Fossil, but I do like the way its website talks about this problem of DAG divergence:

Having more than one leaf in the check-in tree is usually considered undesirable, and so forks are usually either avoided entirely, as in figure 1, or else quickly resolved as shown in figure 3. But sometimes, one does want to have multiple leaves. For example, a project might have one leaf that is the latest version of the project under development and another leaf that is the latest version that has been tested. When multiple leaves are desirable, we call the phenomenon branching instead of forking.

Nice.  So far, I get the impression that Fossil works like Git does in this respect.  When the DAG diverges, complexity increases.  Feel free to offer me a little protection from that complexity by informing me of what's going on.  But don't get in my way.


 

DVCS and DAGs, Part 1

There are two kinds of people:

  1. Those who tend to divide everything into two groups
  2. Those who do not.

I am one of the former.  :-)

There are two kinds of version control tools:

  1. Those where the history is a Line.
  2. Those where the history is a Directed Acyclic Graph (a DAG).

Traditional tools (like Subversion and Vault) tend to model history as a Line.  In the DVCS tools (like Git and Mercurial), history is a DAG.  The differences between these two models are rather interesting.

The Line model is tried and true.  History is a sequence of versions, one after the other.

To create a new version:

  1. Grab the latest version
  2. Make some changes to it
  3. Check it back in

People like the Line model for its simplicity.  It provides an unambiguous answer to the question of which version is latest.

But the Line model has one big problem:  You can only checkin a new version if it was based on the latest version.  And this kind of thing happens a lot:

  1. I grab the latest version.  At the time I grabbed it, this was version 3.
  2. I make some changes to it
  3. While I am doing this, somebody checks in version 4.
  4. When I go to checkin my changes, I can't, because they are not based on the current version.  The "baseline" for my changes was version 3, because that's what was current when I started.

The Line model of history won't allow me to create version 5 as shown in the picture.  Instead, a Line model SCM tool will require me to take the changes which were made between version 3 and 4 and apply them to my version.  This operation is usually called a "merge".  The result is that my baseline gets changed from 3 to 4, thus allowing me to checkin.  This model is explained in chapter 2 of my SourceControl HOWTO as "edit-merge-commit".

The obvious question is:  What would happen if we allowed 5 to be checked in with 3 as its baseline?  Our history would no longer be a Line.  Instead it would be a DAG.

And why would we do this?

The major feature of the DAG model for history is that it doesn't interrupt the developer at the moment they are trying to commit their work.  In this fashion, the DAG is probably a more pure representation of what happens in a team practicing concurrent development.  Version 5 was in fact based on version 3, so why not just represent that fact?

Well, it turns out there is a darn good reason why not.  In the DAG above, we don't know which version is "the latest".  This causes all kinds of problems:

  • Suppose we need the changes in version 4 and 5 in order to ship our release.  Currently we can't have that.  There is no version in the system that includes both.
  • Our build system is configured to always build the latest version.  What is it supposed to do now?
  • Even if we build both 4 and 5, which one is QA supposed to test?
  • If a developer wants to update her tree to the latest version, which one is it?
  • When a developer wants to make some changes, which version should they use as the baseline?
  • Our project manager wants to know which tasks are done and how much work is left to do.  His notion of "done" is very closely associated with the concept of "latest".  If he can't figure out which version is latest, his brain is likely to just blue screen when he tries to update the Gannt chart.

Yep, this is a bad scene.  Civilization as we know it will probably just shut down. 

In order to avoid dogs and cats living together with mass hysteria, the tools that use a DAG model of history provide a way to resolve the mess.  The answer is the same as it is with Line history.  We need a merge.  But instead of requiring the developer to merge before they commit, we allow that merge to happen later.

Somebody needs to construct a version which contains all the changes in both version 4 and version 5.  When this version gets committed, it will have arrows pointing to both of its "parents". 

Order has been restored.  Once again we know which version is "the latest".  If somebody will remember to reboot the project manager, he will probably realize that this DAG looks almost like a Line.  Except for that weird stuff happening between version 3 and 6, it is a Line.  Best not to lose sleep over it.

What this project manager doesn't know is that this particular crisis was minor.  He thinks that his paradigm has been completely challenged, but one day he's going to come into his office and find this:

&@#$!

Now what?

If you're living in the Line model paradigm, this DAG is an absolute disaster.  It has FOUR leaf nodes.  Everything that needs to know which version is latest is about to completely fall apart, including the aforementioned product manager who is probably now in his office curled up in a fetal position and hoping that Mommy includes cookies with his SpaghettiOs at lunch.

The Line model is looking pretty good right now.  There's a good reason why 99.44% of developers are using an SCM tool built on the Line model of history.  (Yes, I made that statistic up.)

And yet, despite all this apparent chaos, we should remind ourselves of the primary benefit of the DAG model:  It more accurately describes the way developers work.  It doesn't make developers bend to its will like the Line model does.  When a developer wants to check something in, he does, and the DAG merely records what happened.

Many teams will always prefer the Line model, and there's nothing wrong with that.  Life is simpler that way.

But for some other teams, the DAG model can be really valuable. 

And for other teams, the DAG model might be coming along simply because they want to use a DVCS tool for other reasons.  DVCS tools use a DAG because they have to.  If we can't assume a live connection to a central server, there isn't any way to force developers to make everything fit into the Line model.

So we need to figure out ways of coping with the DAG.  How do we do this?

One way is to reframe every operation.  If you tell a doctor that "it hurts when I need to know which version is latest", the doctor will tell you to "stop doing that".  Instead, always specify exactly which node to use:

  • The build machine doesn't build the latest node.  Instead, it builds whichever node we tell it to build.  Or maybe it builds every node.
  • QA tests whichever build somebody decides they should test.
  • Developers don't update their tree to "the latest".  Instead, they look at the DAG, pick a node, and update to that one.

I'm not saying this approach is practical.  I am merely observing that it is conceptually valid.  As long as you're willing to specify which node you want to use, any operation that needs a node can proceed.

But how do we specify a node?  One thing that makes this approach problematic is that these nodes tend to have odd names.  For example, in Git, the name of a node is something like e69de29bb2d1d6434b8b29ae775ad8c2e48c5391.  Developers are going to find this naming scheme to be a little unintuitive.

All DVCS tools use a DAG.  And all these tools do various things to either prevent "the crisis of multiple leaf nodes" or to help the team cope with it.  But they all seem to do it a little differently.

Happily, this presents me with an opportunity to divide them all into two groups:

  1. Those who handle this problem in ways that I like.
  2. Those who handle this problem in ways that I do not like.

This blog entry is already longer than I thought it would be, so I'm going to stop here and continue next week.


 

On Git's lack of respect for immutability and the Best Practices for a DVCS

I learned something very important from the feedback after my entry last week on Git's index.  Here's what I learned:

Suppose I wrote a 300 page book describing all the great things about Git and why it is so awesome. 

Further suppose that on page 295 near the bottom, I include a one-sentence mention of a way that I think Git might change for the better. 

Further suppose that I wrote that sentence in Klingon.  And then I encrypted it with Schneier's latest cipher, wrapped it in base64 encoding, ran it through rot13 and then pasted it into the book.

If I did this, the primary response from the Git user community would be:  "Eric's new book says that Git sucks.  He doesn't get it."

Trust me folks -- I get it.  Commits to a DVCS are different.  When you commit to a private instance of the repository, you don't "break the build".  The rules and guidelines for a DVCS are different than the ones for a centralized system.

Best Practices

But some of the best practices are the same.  Here's my off-the-cuff sloppy definition of a "best practice": 

A best practice is a guideline that can be followed lots of times by lots of different people in lots of different situations with minimal likelihood of causing pain to the team.

Actually, I want to give TWO definitions.  Here's another one, speaking as a source control vendor:

A best practice is a guideline that I can give to our customers to minimize the likelihood that they will need to call our tech support staff.

A technique can be "really cool" or "very powerful" and still not qualify for any reasonable person's definition of "best practice".

I stand by my original claims.  I think "git add --p" is "really cool", but it doesn't qualify as a "best practice".  It allows the developer to commit code they have never seen.  Yes, that commit happens in a private instance of the repo, but that code is eligible to be pushed into another instance.

Is there a good outcome here?

Suppose I use "git add --p" to commit some code that doesn't even compile.  What can happen?

  • Maybe this changeset never escapes my private repository instance.  In that case, it has caused no harm.  But it has also caused no benefit.

  • Maybe my next checkin fixes the build.  So now the offending changeset is less likely to cause problems, because the fix will get pushed as well.  But this scenario is equivalent to the centralized case where I break the build but fix it before anybody finds out.  It's not very harmful, but it's not very helpful either.

  • Maybe I later use Git's history rewriting features to eliminate the offending changeset, replacing a chain of small changesets with one larger one that has been well-tested.  In this scenario, I have eliminated all the potentially harmful effects, since the DAG will not have any nodes that are "broken".  But now I have other concerns.

Immutability

The issue of rewriting history is perhaps my biggest philosophical objection to the way Git works.  Call me old fashioned if you like, but I believe changesets and the history of the repository should be immutable.  Version control features that alter history make me squirm.

My own product supports an "Obliterate" feature and I hate it.  I understand why it's there, but I still wish it wasn't.  One thing I've learned from twelve years of supporting version control products is that customers will find a way to misuse things. 

The purpose of Obliterate is to help with that once-a-year situation where you really screwed up and checked in something that should never have been in the repository and absolutely must be removed.  But every now and then we get a tech support call from somebody who is using Obliterate every day.  Those are the days when I want to ship the product with that feature locked and only enable it for customers where every developer has passed a written exam.

Think about it.  Even if you love Git's ability to rewrite history, does this sound to you like a "best practice"?  Or does it sound like a quick way to get a bunch of geeks addicted to recreational pharmaceuticals?

Sandboxes

Like I said, I get it.  A DVCS gives me a private sandbox, so I can have more freedom while I play.  It's "really cool" that I can kick and throw sand without hurting the other kids.  But that doesn't mean it's a "best practice".

Conceptually, my private instance of the repository is still part of a larger whole.  The entire repository may not exist on any one machine, but it exists in concept.  It is one big Directed Acyclic Graph.  When I use "git add --p" and checkin something that doesn't compile, my offending commit is conceptually still a member of that DAG.

The best practices for a DVCS are built around this principle:  The extra freedom provided by a private sandbox should be held in the proper tension with a measure of respect for the entire DAG.