2009-03-02 16:10:25

DVCS and DAGs, Part 1

There are two kinds of people:

Those who tend to divide everything into two groups
Those who do not.

I am one of the former. :-)

There are two kinds of version control tools:

Those where the history is a Line.
Those where the history is a Directed Acyclic Graph (a DAG).

Traditional tools (like Subversion and Vault) tend to model history as a Line. In the DVCS tools (like Git and Mercurial), history is a DAG. The differences between these two models are rather interesting.

The Line model is tried and true. History is a sequence of versions, one after the other.

To create a new version:

Grab the latest version
Make some changes to it
Check it back in

People like the Line model for its simplicity. It provides an unambiguous answer to the question of which version is latest.

But the Line model has one big problem: You can only checkin a new version if it was based on the latest version. And this kind of thing happens a lot:

I grab the latest version. At the time I grabbed it, this was version 3.
I make some changes to it
While I am doing this, somebody checks in version 4.
When I go to checkin my changes, I can't, because they are not based on the current version. The "baseline" for my changes was version 3, because that's what was current when I started.

The Line model of history won't allow me to create version 5 as shown in the picture. Instead, a Line model SCM tool will require me to take the changes which were made between version 3 and 4 and apply them to my version. This operation is usually called a "merge". The result is that my baseline gets changed from 3 to 4, thus allowing me to checkin. This model is explained in chapter 2 of my SourceControl HOWTO as "edit-merge-commit".

The obvious question is: What would happen if we allowed 5 to be checked in with 3 as its baseline? Our history would no longer be a Line. Instead it would be a DAG.

And why would we do this?

The major feature of the DAG model for history is that it doesn't interrupt the developer at the moment they are trying to commit their work. In this fashion, the DAG is probably a more pure representation of what happens in a team practicing concurrent development. Version 5 was in fact based on version 3, so why not just represent that fact?

Well, it turns out there is a darn good reason why not. In the DAG above, we don't know which version is "the latest". This causes all kinds of problems:

Suppose we need the changes in version 4 and 5 in order to ship our release. Currently we can't have that. There is no version in the system that includes both.
Our build system is configured to always build the latest version. What is it supposed to do now?
Even if we build both 4 and 5, which one is QA supposed to test?
If a developer wants to update her tree to the latest version, which one is it?
When a developer wants to make some changes, which version should they use as the baseline?
Our project manager wants to know which tasks are done and how much work is left to do. His notion of "done" is very closely associated with the concept of "latest". If he can't figure out which version is latest, his brain is likely to just blue screen when he tries to update the Gannt chart.

Yep, this is a bad scene. Civilization as we know it will probably just shut down.

In order to avoid dogs and cats living together with mass hysteria, the tools that use a DAG model of history provide a way to resolve the mess. The answer is the same as it is with Line history. We need a merge. But instead of requiring the developer to merge before they commit, we allow that merge to happen later.

Somebody needs to construct a version which contains all the changes in both version 4 and version 5. When this version gets committed, it will have arrows pointing to both of its "parents".

Order has been restored. Once again we know which version is "the latest". If somebody will remember to reboot the project manager, he will probably realize that this DAG looks almost like a Line. Except for that weird stuff happening between version 3 and 6, it is a Line. Best not to lose sleep over it.

What this project manager doesn't know is that this particular crisis was minor. He thinks that his paradigm has been completely challenged, but one day he's going to come into his office and find this:

&@#$!

Now what?

If you're living in the Line model paradigm, this DAG is an absolute disaster. It has FOUR leaf nodes. Everything that needs to know which version is latest is about to completely fall apart, including the aforementioned product manager who is probably now in his office curled up in a fetal position and hoping that Mommy includes cookies with his SpaghettiOs at lunch.

The Line model is looking pretty good right now. There's a good reason why 99.44% of developers are using an SCM tool built on the Line model of history. (Yes, I made that statistic up.)

And yet, despite all this apparent chaos, we should remind ourselves of the primary benefit of the DAG model: It more accurately describes the way developers work. It doesn't make developers bend to its will like the Line model does. When a developer wants to check something in, he does, and the DAG merely records what happened.

Many teams will always prefer the Line model, and there's nothing wrong with that. Life is simpler that way.

But for some other teams, the DAG model can be really valuable.

And for other teams, the DAG model might be coming along simply because they want to use a DVCS tool for other reasons. DVCS tools use a DAG because they have to. If we can't assume a live connection to a central server, there isn't any way to force developers to make everything fit into the Line model.

So we need to figure out ways of coping with the DAG. How do we do this?

One way is to reframe every operation. If you tell a doctor that "it hurts when I need to know which version is latest", the doctor will tell you to "stop doing that". Instead, always specify exactly which node to use:

The build machine doesn't build the latest node. Instead, it builds whichever node we tell it to build. Or maybe it builds every node.
QA tests whichever build somebody decides they should test.
Developers don't update their tree to "the latest". Instead, they look at the DAG, pick a node, and update to that one.

I'm not saying this approach is practical. I am merely observing that it is conceptually valid. As long as you're willing to specify which node you want to use, any operation that needs a node can proceed.

But how do we specify a node? One thing that makes this approach problematic is that these nodes tend to have odd names. For example, in Git, the name of a node is something like e69de29bb2d1d6434b8b29ae775ad8c2e48c5391. Developers are going to find this naming scheme to be a little unintuitive.

All DVCS tools use a DAG. And all these tools do various things to either prevent "the crisis of multiple leaf nodes" or to help the team cope with it. But they all seem to do it a little differently.

Happily, this presents me with an opportunity to divide them all into two groups:

Those who handle this problem in ways that I like.
Those who handle this problem in ways that I do not like.

This blog entry is already longer than I thought it would be, so I'm going to stop here and continue next week.